-
Data-Efficient Psychiatric Disorder Detection via Self-supervised Learning on Frequency-enhanced Brain Networks
Authors:
Mujie Liu,
Mengchu Zhu,
Qichao Dong,
Ting Dang,
Jiangang Ma,
Jing Ren,
Feng Xia
Abstract:
Psychiatric disorders involve complex neural activity changes, with functional magnetic resonance imaging (fMRI) data serving as key diagnostic evidence. However, data scarcity and the diverse nature of fMRI information pose significant challenges. While graph-based self-supervised learning (SSL) methods have shown promise in brain network analysis, they primarily focus on time-domain representati…
▽ More
Psychiatric disorders involve complex neural activity changes, with functional magnetic resonance imaging (fMRI) data serving as key diagnostic evidence. However, data scarcity and the diverse nature of fMRI information pose significant challenges. While graph-based self-supervised learning (SSL) methods have shown promise in brain network analysis, they primarily focus on time-domain representations, often overlooking the rich information embedded in the frequency domain. To overcome these limitations, we propose Frequency-Enhanced Network (FENet), a novel SSL framework specially designed for fMRI data that integrates time-domain and frequency-domain information to improve psychiatric disorder detection in small-sample datasets. FENet constructs multi-view brain networks based on the inherent properties of fMRI data, explicitly incorporating frequency information into the learning process of representation. Additionally, it employs domain-specific encoders to capture temporal-spectral characteristics, including an efficient frequency-domain encoder that highlights disease-relevant frequency features. Finally, FENet introduces a domain consistency-guided learning objective, which balances the utilization of diverse information and generates frequency-enhanced brain graph representations. Experiments on two real-world medical datasets demonstrate that FENet outperforms state-of-the-art methods while maintaining strong performance in minimal data conditions. Furthermore, we analyze the correlation between various frequency-domain features and psychiatric disorders, emphasizing the critical role of high-frequency information in disorder detection.
△ Less
Submitted 4 September, 2025;
originally announced September 2025.
-
DCT-MARL: A Dynamic Communication Topology-Based MARL Algorithm for Connected Vehicle Platoon Control
Authors:
Yaqi Xu,
Yan Shi,
Jin Tian,
Fanzeng Xia,
Tongxin Li,
Shanzhi Chen,
Yuming Ge
Abstract:
With the rapid advancement of vehicular communication facilities and autonomous driving technologies, connected vehicle platooning has emerged as a promising approach to improve traffic efficiency and driving safety. Reliable Vehicle-to-Vehicle (V2V) communication is critical to achieving efficient cooperative control. However, in the real-world traffic environment, V2V communication may suffer fr…
▽ More
With the rapid advancement of vehicular communication facilities and autonomous driving technologies, connected vehicle platooning has emerged as a promising approach to improve traffic efficiency and driving safety. Reliable Vehicle-to-Vehicle (V2V) communication is critical to achieving efficient cooperative control. However, in the real-world traffic environment, V2V communication may suffer from time-varying delay and packet loss, leading to degraded control performance and even safety risks. To mitigate the adverse effects of non-ideal communication, this paper proposes a Dynamic Communication Topology based Multi-Agent Reinforcement Learning (DCT-MARL) algorithm for robust cooperative platoon control. Specifically, the state space is augmented with historical control action and delay to enhance robustness against communication delay. To mitigate the impact of packet loss, a multi-key gated communication mechanism is introduced, which dynamically adjusts the communication topology based on the correlation between vehicles and their current communication status. Simulation results demonstrate that the proposed DCT-MARL significantly outperforms state-of-the-art methods in terms of string stability and driving comfort, validating its superior robustness and effectiveness.
△ Less
Submitted 20 August, 2025; v1 submitted 18 August, 2025;
originally announced August 2025.
-
CovertAuth: Joint Covert Communication and Authentication in MmWave Systems
Authors:
Yulin Teng,
Keshuang Han,
Pinchang Zhang,
Xiaohong Jiang,
Yulong Shen,
Fu Xiao
Abstract:
Beam alignment (BA) is a crucial process in millimeter-wave (mmWave) communications, enabling precise directional transmission and efficient link establishment. However, due to characteristics like omnidirectional exposure and the broadcast nature of the BA phase, it is particularly vulnerable to eavesdropping and identity impersonation attacks. To this end, this paper proposes a novel secure fram…
▽ More
Beam alignment (BA) is a crucial process in millimeter-wave (mmWave) communications, enabling precise directional transmission and efficient link establishment. However, due to characteristics like omnidirectional exposure and the broadcast nature of the BA phase, it is particularly vulnerable to eavesdropping and identity impersonation attacks. To this end, this paper proposes a novel secure framework named CovertAuth, designed to enhance the security of the BA phase against such attacks. In particular, to combat eavesdropping attacks, the closed-form expressions of successful BA probability and covert transmission rate are first derived. Then, a covert communication problem aimed at jointly optimizing beam training budget and transmission power is formulated to maximize covert communication rate, subject to the covertness requirement. An alternating optimization algorithm combined with successive convex approximation is employed to iteratively achieve optimal results. To combat impersonation attacks, the mutual coupling effect of antenna array impairments is explored as a device feature to design a weighted-sum energy detector based physical layer authentication scheme. Moreover, theoretical models for authentication metrics like detection and false alarm probabilities are also provided to conduct performance analysis. Based on these models, an optimization problem is constructed to determine the optimal weight value that maximizes authentication accuracy. Finally, simulation results demonstrate that CovertAuth presents improved detection accuracy under the same covertness requirement compared to existing works.
△ Less
Submitted 11 July, 2025;
originally announced July 2025.
-
Customizable ROI-Based Deep Image Compression
Authors:
Jian Jin,
Fanxin Xia,
Feng Ding,
Xinfeng Zhang,
Meiqin Liu,
Yao Zhao,
Weisi Lin,
Lili Meng
Abstract:
Region of Interest (ROI)-based image compression optimizes bit allocation by prioritizing ROI for higher-quality reconstruction. However, as the users (including human clients and downstream machine tasks) become more diverse, ROI-based image compression needs to be customizable to support various preferences. For example, different users may define distinct ROI or require different quality trade-…
▽ More
Region of Interest (ROI)-based image compression optimizes bit allocation by prioritizing ROI for higher-quality reconstruction. However, as the users (including human clients and downstream machine tasks) become more diverse, ROI-based image compression needs to be customizable to support various preferences. For example, different users may define distinct ROI or require different quality trade-offs between ROI and non-ROI. Existing ROI-based image compression schemes predefine the ROI, making it unchangeable, and lack effective mechanisms to balance reconstruction quality between ROI and non-ROI. This work proposes a paradigm for customizable ROI-based deep image compression. First, we develop a Text-controlled Mask Acquisition (TMA) module, which allows users to easily customize their ROI for compression by just inputting the corresponding semantic \emph{text}. It makes the encoder controlled by text. Second, we design a Customizable Value Assign (CVA) mechanism, which masks the non-ROI with a changeable extent decided by users instead of a constant one to manage the reconstruction quality trade-off between ROI and non-ROI. Finally, we present a Latent Mask Attention (LMA) module, where the latent spatial prior of the mask and the latent Rate-Distortion Optimization (RDO) prior of the image are extracted and fused in the latent space, and further used to optimize the latent representation of the source image. Experimental results demonstrate that our proposed customizable ROI-based deep image compression paradigm effectively addresses the needs of customization for ROI definition and mask acquisition as well as the reconstruction quality trade-off management between the ROI and non-ROI.
△ Less
Submitted 2 July, 2025; v1 submitted 30 June, 2025;
originally announced July 2025.
-
Pureformer-VC: Non-parallel Voice Conversion with Pure Stylized Transformer Blocks and Triplet Discriminative Training
Authors:
Wenhan Yao,
Fen Xiao,
Xiarun Chen,
Jia Liu,
YongQiang He,
Weiping Wen
Abstract:
As a foundational technology for intelligent human-computer interaction, voice conversion (VC) seeks to transform speech from any source timbre into any target timbre. Traditional voice conversion methods based on Generative Adversarial Networks (GANs) encounter significant challenges in precisely encoding diverse speech elements and effectively synthesising these elements into natural-sounding co…
▽ More
As a foundational technology for intelligent human-computer interaction, voice conversion (VC) seeks to transform speech from any source timbre into any target timbre. Traditional voice conversion methods based on Generative Adversarial Networks (GANs) encounter significant challenges in precisely encoding diverse speech elements and effectively synthesising these elements into natural-sounding converted speech. To overcome these limitations, we introduce Pureformer-VC, an encoder-decoder framework that utilizes Conformer blocks to build a disentangled encoder and employs Zipformer blocks to create a style transfer decoder. We adopt a variational decoupled training approach to isolate speech components using a Variational Autoencoder (VAE), complemented by triplet discriminative training to enhance the speaker's discriminative capabilities. Furthermore, we incorporate the Attention Style Transfer Mechanism (ASTM) with Zipformer's shared weights to improve the style transfer performance in the decoder. We conducted experiments on two multi-speaker datasets. The experimental results demonstrate that the proposed model achieves comparable subjective evaluation scores while significantly enhancing objective metrics compared to existing approaches in many-to-many and many-to-one VC scenarios.
△ Less
Submitted 9 June, 2025;
originally announced June 2025.
-
SPBA: Utilizing Speech Large Language Model for Backdoor Attacks on Speech Classification Models
Authors:
Wenhan Yao,
Fen Xiao,
Xiarun Chen,
Jia Liu,
YongQiang He,
Weiping Wen
Abstract:
Deep speech classification tasks, including keyword spotting and speaker verification, are vital in speech-based human-computer interaction. Recently, the security of these technologies has been revealed to be susceptible to backdoor attacks. Specifically, attackers use noisy disruption triggers and speech element triggers to produce poisoned speech samples that train models to become vulnerable.…
▽ More
Deep speech classification tasks, including keyword spotting and speaker verification, are vital in speech-based human-computer interaction. Recently, the security of these technologies has been revealed to be susceptible to backdoor attacks. Specifically, attackers use noisy disruption triggers and speech element triggers to produce poisoned speech samples that train models to become vulnerable. However, these methods typically create only a limited number of backdoors due to the inherent constraints of the trigger function. In this paper, we propose that speech backdoor attacks can strategically focus on speech elements such as timbre and emotion, leveraging the Speech Large Language Model (SLLM) to generate diverse triggers. Increasing the number of triggers may disproportionately elevate the poisoning rate, resulting in higher attack costs and a lower success rate per trigger. We introduce the Multiple Gradient Descent Algorithm (MGDA) as a mitigation strategy to address this challenge. The proposed attack is called the Speech Prompt Backdoor Attack (SPBA). Building on this foundation, we conducted attack experiments on two speech classification tasks, demonstrating that SPBA shows significant trigger effectiveness and achieves exceptional performance in attack metrics.
△ Less
Submitted 9 June, 2025;
originally announced June 2025.
-
Rehearsal with Auxiliary-Informed Sampling for Audio Deepfake Detection
Authors:
Falih Gozi Febrinanto,
Kristen Moore,
Chandra Thapa,
Jiangang Ma,
Vidya Saikrishna,
Feng Xia
Abstract:
The performance of existing audio deepfake detection frameworks degrades when confronted with new deepfake attacks. Rehearsal-based continual learning (CL), which updates models using a limited set of old data samples, helps preserve prior knowledge while incorporating new information. However, existing rehearsal techniques don't effectively capture the diversity of audio characteristics, introduc…
▽ More
The performance of existing audio deepfake detection frameworks degrades when confronted with new deepfake attacks. Rehearsal-based continual learning (CL), which updates models using a limited set of old data samples, helps preserve prior knowledge while incorporating new information. However, existing rehearsal techniques don't effectively capture the diversity of audio characteristics, introducing bias and increasing the risk of forgetting. To address this challenge, we propose Rehearsal with Auxiliary-Informed Sampling (RAIS), a rehearsal-based CL approach for audio deepfake detection. RAIS employs a label generation network to produce auxiliary labels, guiding diverse sample selection for the memory buffer. Extensive experiments show RAIS outperforms state-of-the-art methods, achieving an average Equal Error Rate (EER) of 1.953 % across five experiences. The code is available at: https://github.com/falihgoz/RAIS.
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
Aneumo: A Large-Scale Multimodal Aneurysm Dataset with Computational Fluid Dynamics Simulations and Deep Learning Benchmarks
Authors:
Xigui Li,
Yuanye Zhou,
Feiyang Xiao,
Xin Guo,
Chen Jiang,
Tan Pan,
Xingmeng Zhang,
Cenyu Liu,
Zeyun Miao,
Jianchao Ge,
Xiansheng Wang,
Qimeng Wang,
Yichi Zhang,
Wenbo Zhang,
Fengping Zhu,
Limei Han,
Yuan Qi,
Chensen Lin,
Yuan Cheng
Abstract:
Intracranial aneurysms (IAs) are serious cerebrovascular lesions found in approximately 5\% of the general population. Their rupture may lead to high mortality. Current methods for assessing IA risk focus on morphological and patient-specific factors, but the hemodynamic influences on IA development and rupture remain unclear. While accurate for hemodynamic studies, conventional computational flui…
▽ More
Intracranial aneurysms (IAs) are serious cerebrovascular lesions found in approximately 5\% of the general population. Their rupture may lead to high mortality. Current methods for assessing IA risk focus on morphological and patient-specific factors, but the hemodynamic influences on IA development and rupture remain unclear. While accurate for hemodynamic studies, conventional computational fluid dynamics (CFD) methods are computationally intensive, hindering their deployment in large-scale or real-time clinical applications. To address this challenge, we curated a large-scale, high-fidelity aneurysm CFD dataset to facilitate the development of efficient machine learning algorithms for such applications. Based on 427 real aneurysm geometries, we synthesized 10,660 3D shapes via controlled deformation to simulate aneurysm evolution. The authenticity of these synthetic shapes was confirmed by neurosurgeons. CFD computations were performed on each shape under eight steady-state mass flow conditions, generating a total of 85,280 blood flow dynamics data covering key parameters. Furthermore, the dataset includes segmentation masks, which can support tasks that use images, point clouds or other multimodal data as input. Additionally, we introduced a benchmark for estimating flow parameters to assess current modeling methods. This dataset aims to advance aneurysm research and promote data-driven approaches in biofluids, biomedical engineering, and clinical risk assessment. The code and dataset are available at: https://github.com/Xigui-Li/Aneumo.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Global Rice Multi-Class Segmentation Dataset (RiceSEG): A Comprehensive and Diverse High-Resolution RGB-Annotated Images for the Development and Benchmarking of Rice Segmentation Algorithms
Authors:
Junchi Zhou,
Haozhou Wang,
Yoichiro Kato,
Tejasri Nampally,
P. Rajalakshmi,
M. Balram,
Keisuke Katsura,
Hao Lu,
Yue Mu,
Wanneng Yang,
Yangmingrui Gao,
Feng Xiao,
Hongtao Chen,
Yuhao Chen,
Wenjuan Li,
Jingwen Wang,
Fenghua Yu,
Jian Zhou,
Wensheng Wang,
Xiaochun Hu,
Yuanzhu Yang,
Yanfeng Ding,
Wei Guo,
Shouyang Liu
Abstract:
Developing computer vision-based rice phenotyping techniques is crucial for precision field management and accelerating breeding, thereby continuously advancing rice production. Among phenotyping tasks, distinguishing image components is a key prerequisite for characterizing plant growth and development at the organ scale, enabling deeper insights into eco-physiological processes. However, due to…
▽ More
Developing computer vision-based rice phenotyping techniques is crucial for precision field management and accelerating breeding, thereby continuously advancing rice production. Among phenotyping tasks, distinguishing image components is a key prerequisite for characterizing plant growth and development at the organ scale, enabling deeper insights into eco-physiological processes. However, due to the fine structure of rice organs and complex illumination within the canopy, this task remains highly challenging, underscoring the need for a high-quality training dataset. Such datasets are scarce, both due to a lack of large, representative collections of rice field images and the time-intensive nature of annotation. To address this gap, we established the first comprehensive multi-class rice semantic segmentation dataset, RiceSEG. We gathered nearly 50,000 high-resolution, ground-based images from five major rice-growing countries (China, Japan, India, the Philippines, and Tanzania), encompassing over 6,000 genotypes across all growth stages. From these original images, 3,078 representative samples were selected and annotated with six classes (background, green vegetation, senescent vegetation, panicle, weeds, and duckweed) to form the RiceSEG dataset. Notably, the sub-dataset from China spans all major genotypes and rice-growing environments from the northeast to the south. Both state-of-the-art convolutional neural networks and transformer-based semantic segmentation models were used as baselines. While these models perform reasonably well in segmenting background and green vegetation, they face difficulties during the reproductive stage, when canopy structures are more complex and multiple classes are involved. These findings highlight the importance of our dataset for developing specialized segmentation models for rice and other crops.
△ Less
Submitted 2 April, 2025;
originally announced April 2025.
-
Robust Control of General Linear Delay Systems under Dissipativity: Part I -- A KSD based Framework
Authors:
Qian Feng,
Wei Xing Zheng,
Xiaoyu Wang,
Feng Xiao
Abstract:
This paper introduces an effective framework for designing memoryless dissipative full-state feedbacks for general linear delay systems via the Krasovskiĭ functional (KF) approach, where an unlimited number of pointwise and general distributed delays (DDs) exists in the state, input and output. To handle the infinite dimensionality of DDs, we employ the Kronecker-Seuret Decomposition (KSD) which w…
▽ More
This paper introduces an effective framework for designing memoryless dissipative full-state feedbacks for general linear delay systems via the Krasovskiĭ functional (KF) approach, where an unlimited number of pointwise and general distributed delays (DDs) exists in the state, input and output. To handle the infinite dimensionality of DDs, we employ the Kronecker-Seuret Decomposition (KSD) which we recently proposed for analyzing matrix-valued functions in the context of delay systems. The KSD enables factorization or least-squares approximation of any number of $\mathcal{L}^2$ DD kernels from any number of DDs without introducing conservatism. This also facilitates the construction of a complete-type KF with flexible integral kernels, following from an application of a novel integral inequality derived from the least-squares principle. Our solution includes two theorems and an iterative algorithm to compute controller gains without relying on nonlinear solvers. A challenging numerical example, intractable for existing methods, underscores the efficacy of this approach.
△ Less
Submitted 3 April, 2025; v1 submitted 31 March, 2025;
originally announced April 2025.
-
Multipath Component Power Delay Profile Based Joint Range and Doppler Estimation for AFDM-ISAC Systems
Authors:
Fangqing Xiao,
Zunqi Li,
Dirk Slock
Abstract:
Integrated Sensing and Communication (ISAC) systems combine sensing and communication functionalities within a unified framework, enhancing spectral efficiency and reducing costs by utilizing shared hardware components. This paper investigates multipath component power delay profile (MPCPDP)-based joint range and Doppler estimation for Affine Frequency Division Multiplexing (AFDM)-ISAC systems. Th…
▽ More
Integrated Sensing and Communication (ISAC) systems combine sensing and communication functionalities within a unified framework, enhancing spectral efficiency and reducing costs by utilizing shared hardware components. This paper investigates multipath component power delay profile (MPCPDP)-based joint range and Doppler estimation for Affine Frequency Division Multiplexing (AFDM)-ISAC systems. The path resolvability of the equivalent channel in the AFDM system allows the recognition of Line-of-Sight (LoS) and Non-Line-of-Sight (NLoS) paths within a single pilot symbol in fast time-varying channels. We develop a joint estimation model that leverages multipath Doppler shifts and delays information under the AFDM waveform. Utilizing the MPCPDP, we propose a novel ranging method that exploits the range-dependent magnitude of the MPCPDP across its delay spread by constructing a Nakagami-m statistical fading model for MPC channel fading and correlating the distribution parameters with propagation distance in AFDM systems. This method eliminates the need for additional time synchronization or extra hardware. We also transform the nonlinear Doppler estimation problem into a bilinear estimation problem using a First-order Taylor expansion. Moreover, we introduce the Expectation Maximization algorithm to estimate the hyperparameters and leverage the Expectation Consistent algorithm to cope with high-dimensional integration challenges. Extensive numerical simulations demonstrate the effectiveness of our MPCPDP-based joint range and Doppler estimation in ISAC systems.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
Utilizing 3D Fast Spin Echo Anatomical Imaging to Reduce the Number of Contrast Preparations in $T_{1ρ}$ Quantification of Knee Cartilage Using Learning-Based Methods
Authors:
Junru Zhong,
Chaoxing Huang,
Ziqiang Yu,
Fan Xiao,
Siyue Li,
Tim-Yun Michael Ong,
Ki-Wai Kevin Ho,
Queenie Chan,
James F. Griffith,
Weitian Chen
Abstract:
Purpose: To propose and evaluate an accelerated $T_{1ρ}$ quantification method that combines $T_{1ρ}$-weighted fast spin echo (FSE) images and proton density (PD)-weighted anatomical FSE images, leveraging deep learning models for $T_{1ρ}$ mapping. The goal is to reduce scan time and facilitate integration into routine clinical workflows for osteoarthritis (OA) assessment. Methods: This retrospect…
▽ More
Purpose: To propose and evaluate an accelerated $T_{1ρ}$ quantification method that combines $T_{1ρ}$-weighted fast spin echo (FSE) images and proton density (PD)-weighted anatomical FSE images, leveraging deep learning models for $T_{1ρ}$ mapping. The goal is to reduce scan time and facilitate integration into routine clinical workflows for osteoarthritis (OA) assessment. Methods: This retrospective study utilized MRI data from 40 participants (30 OA patients and 10 healthy volunteers). A volume of PD-weighted anatomical FSE images and a volume of $T_{1ρ}$-weighted images acquired at a non-zero spin-lock time were used as input to train deep learning models, including a 2D U-Net and a multi-layer perceptron (MLP). $T_{1ρ}$ maps generated by these models were compared with ground truth maps derived from a traditional non-linear least squares (NLLS) fitting method using four $T_{1ρ}$-weighted images. Evaluation metrics included mean absolute error (MAE), mean absolute percentage error (MAPE), regional error (RE), and regional percentage error (RPE). Results: Deep learning models achieved RPEs below 5% across all evaluated scenarios, outperforming NLLS methods, especially in low signal-to-noise conditions. The best results were obtained using the 2D U-Net, which effectively leveraged spatial information for accurate $T_{1ρ}$ fitting. The proposed method demonstrated compatibility with shorter TSLs, alleviating RF hardware and specific absorption rate (SAR) limitations. Conclusion: The proposed approach enables efficient $T_{1ρ}$ mapping using PD-weighted anatomical images, reducing scan time while maintaining clinical standards. This method has the potential to facilitate the integration of quantitative MRI techniques into routine clinical practice, benefiting OA diagnosis and monitoring.
△ Less
Submitted 13 February, 2025;
originally announced February 2025.
-
Vision Graph Non-Contrastive Learning for Audio Deepfake Detection with Limited Labels
Authors:
Falih Gozi Febrinanto,
Kristen Moore,
Chandra Thapa,
Jiangang Ma,
Vidya Saikrishna,
Feng Xia
Abstract:
Recent advancements in audio deepfake detection have leveraged graph neural networks (GNNs) to model frequency and temporal interdependencies in audio data, effectively identifying deepfake artifacts. However, the reliance of GNN-based methods on substantial labeled data for graph construction and robust performance limits their applicability in scenarios with limited labeled data. Although vast a…
▽ More
Recent advancements in audio deepfake detection have leveraged graph neural networks (GNNs) to model frequency and temporal interdependencies in audio data, effectively identifying deepfake artifacts. However, the reliance of GNN-based methods on substantial labeled data for graph construction and robust performance limits their applicability in scenarios with limited labeled data. Although vast amounts of audio data exist, the process of labeling samples as genuine or fake remains labor-intensive and costly. To address this challenge, we propose SIGNL (Spatio-temporal vIsion Graph Non-contrastive Learning), a novel framework that maintains high GNN performance in low-label settings. SIGNL constructs spatio-temporal graphs by representing patches from the audio's visual spectrogram as nodes. These graph structures are modeled using vision graph convolutional (GC) encoders pre-trained through graph non-contrastive learning, a label-free that maximizes the similarity between positive pairs. The pre-trained encoders are then fine-tuned for audio deepfake detection, reducing reliance on labeled data. Experiments demonstrate that SIGNL outperforms state-of-the-art baselines across multiple audio deepfake detection datasets, achieving the lowest Equal Error Rate (EER) with as little as 5% labeled data. Additionally, SIGNL exhibits strong cross-domain generalization, achieving the lowest EER in evaluations involving diverse attack types and languages in the In-The-Wild dataset.
△ Less
Submitted 8 January, 2025;
originally announced January 2025.
-
Disentangling Hierarchical Features for Anomalous Sound Detection Under Domain Shift
Authors:
Jian Guan,
Jiantong Tian,
Qiaoxi Zhu,
Feiyang Xiao,
Hejing Zhang,
Xubo Liu
Abstract:
Anomalous sound detection (ASD) encounters difficulties with domain shift, where the sounds of machines in target domains differ significantly from those in source domains due to varying operating conditions. Existing methods typically employ domain classifiers to enhance detection performance, but they often overlook the influence of domain-unrelated information. This oversight can hinder the mod…
▽ More
Anomalous sound detection (ASD) encounters difficulties with domain shift, where the sounds of machines in target domains differ significantly from those in source domains due to varying operating conditions. Existing methods typically employ domain classifiers to enhance detection performance, but they often overlook the influence of domain-unrelated information. This oversight can hinder the model's ability to clearly distinguish between domains, thereby weakening its capacity to differentiate normal from abnormal sounds. In this paper, we propose a Gradient Reversal-based Hierarchical feature Disentanglement (GRHD) method to address the above challenge. GRHD uses gradient reversal to separate domain-related features from domain-unrelated ones, resulting in more robust feature representations. Additionally, the method employs a hierarchical structure to guide the learning of fine-grained, domain-specific features by leveraging available metadata, such as section IDs and machine sound attributes. Experimental results on the DCASE 2022 Challenge Task 2 dataset demonstrate that the proposed method significantly improves ASD performance under domain shift.
△ Less
Submitted 2 January, 2025;
originally announced January 2025.
-
Spectral-Temporal Fusion Representation for Person-in-Bed Detection
Authors:
Xuefeng Yang,
Shiheng Zhang,
Jian Guan,
Feiyang Xiao,
Wei Lu,
Qiaoxi Zhu
Abstract:
This study is based on the ICASSP 2025 Signal Processing Grand Challenge's Accelerometer-Based Person-in-Bed Detection Challenge, which aims to determine bed occupancy using accelerometer signals. The task is divided into two tracks: "in bed" and "not in bed" segmented detection, and streaming detection, facing challenges such as individual differences, posture variations, and external disturbance…
▽ More
This study is based on the ICASSP 2025 Signal Processing Grand Challenge's Accelerometer-Based Person-in-Bed Detection Challenge, which aims to determine bed occupancy using accelerometer signals. The task is divided into two tracks: "in bed" and "not in bed" segmented detection, and streaming detection, facing challenges such as individual differences, posture variations, and external disturbances. We propose a spectral-temporal fusion-based feature representation method with mixup data augmentation, and adopt Intersection over Union (IoU) loss to optimize detection accuracy. In the two tracks, our method achieved outstanding results of 100.00% and 95.55% in detection scores, securing first place and third place, respectively.
△ Less
Submitted 26 December, 2024;
originally announced December 2024.
-
Graph-Enhanced Dual-Stream Feature Fusion with Pre-Trained Model for Acoustic Traffic Monitoring
Authors:
Shitong Fan,
Feiyang Xiao,
Wenbo Wang,
Shuhan Qi,
Qiaoxi Zhu,
Wenwu Wang,
Jian Guan
Abstract:
Microphone array techniques are widely used in sound source localization and smart city acoustic-based traffic monitoring, but these applications face significant challenges due to the scarcity of labeled real-world traffic audio data and the complexity and diversity of application scenarios. The DCASE Challenge's Task 10 focuses on using multi-channel audio signals to count vehicles (cars or comm…
▽ More
Microphone array techniques are widely used in sound source localization and smart city acoustic-based traffic monitoring, but these applications face significant challenges due to the scarcity of labeled real-world traffic audio data and the complexity and diversity of application scenarios. The DCASE Challenge's Task 10 focuses on using multi-channel audio signals to count vehicles (cars or commercial vehicles) and identify their directions (left-to-right or vice versa). In this paper, we propose a graph-enhanced dual-stream feature fusion network (GEDF-Net) for acoustic traffic monitoring, which simultaneously considers vehicle type and direction to improve detection. We propose a graph-enhanced dual-stream feature fusion strategy which consists of a vehicle type feature extraction (VTFE) branch, a vehicle direction feature extraction (VDFE) branch, and a frame-level feature fusion module to combine the type and direction feature for enhanced performance. A pre-trained model (PANNs) is used in the VTFE branch to mitigate data scarcity and enhance the type features, followed by a graph attention mechanism to exploit temporal relationships and highlight important audio events within these features. The frame-level fusion of direction and type features enables fine-grained feature representation, resulting in better detection performance. Experiments demonstrate the effectiveness of our proposed method. GEDF-Net is our submission that achieved 1st place in the DCASE 2024 Challenge Task 10.
△ Less
Submitted 26 December, 2024;
originally announced December 2024.
-
Attacking Voice Anonymization Systems with Augmented Feature and Speaker Identity Difference
Authors:
Yanzhe Zhang,
Zhonghao Bi,
Feiyang Xiao,
Xuefeng Yang,
Qiaoxi Zhu,
Jian Guan
Abstract:
This study focuses on the First VoicePrivacy Attacker Challenge within the ICASSP 2025 Signal Processing Grand Challenge, which aims to develop speaker verification systems capable of determining whether two anonymized speech signals are from the same speaker. However, differences between feature distributions of original and anonymized speech complicate this task. To address this challenge, we pr…
▽ More
This study focuses on the First VoicePrivacy Attacker Challenge within the ICASSP 2025 Signal Processing Grand Challenge, which aims to develop speaker verification systems capable of determining whether two anonymized speech signals are from the same speaker. However, differences between feature distributions of original and anonymized speech complicate this task. To address this challenge, we propose an attacker system that combines Data Augmentation enhanced feature representation and Speaker Identity Difference enhanced classifier to improve verification performance, termed DA-SID. Specifically, data augmentation strategies (i.e., data fusion and SpecAugment) are utilized to mitigate feature distribution gaps, while probabilistic linear discriminant analysis (PLDA) is employed to further enhance speaker identity difference. Our system significantly outperforms the baseline, demonstrating exceptional effectiveness and robustness against various voice anonymization systems, ultimately securing a top-5 ranking in the challenge.
△ Less
Submitted 12 January, 2025; v1 submitted 26 December, 2024;
originally announced December 2024.
-
Optimizing Clustered Cell-Free Networking for Sum Ergodic Capacity Maximization with Joint Processing Constraint
Authors:
Funing Xia,
Junyuan Wang,
Lin Dai
Abstract:
Clustered cell-free networking has been considered as an effective scheme to trade off between the low complexity of current cellular networks and the superior performance of fully cooperative networks. With clustered cell-free networking, the wireless network is decomposed into a number of disjoint parallel operating subnetworks with joint processing adopted inside each subnetwork independently f…
▽ More
Clustered cell-free networking has been considered as an effective scheme to trade off between the low complexity of current cellular networks and the superior performance of fully cooperative networks. With clustered cell-free networking, the wireless network is decomposed into a number of disjoint parallel operating subnetworks with joint processing adopted inside each subnetwork independently for intra-subnetwork interference mitigation. Different from the existing works that aim to maximize the number of subnetworks without considering the limited processing capability of base-stations (BSs), this paper investigates the clustered cell-free networking problem with the objective of maximizing the sum ergodic capacity while imposing a limit on the number of user equipments (UEs) in each subnetwork to constrain the joint processing complexity. By successfully transforming the combinatorial NP-hard clustered cell-free networking problem into an integer convex programming problem, the problem is solved by the branch-and-bound method. To further reduce the computational complexity, a bisection clustered cell-free networking (BC^2F-Net) algorithm is proposed to decompose the network hierarchically. Simulation results show that compared to the branch-and-bound based scheme, the proposed BC^2F-Net algorithm significantly reduces the computational complexity yet achieves nearly the same network decomposition result. Moreover, our BC^2F-Net algorithm achieves near-optimal performance and outperforms the state-of-the-art benchmarks with up to 25% capacity gain.
△ Less
Submitted 18 November, 2024;
originally announced November 2024.
-
Cross-Fundus Transformer for Multi-modal Diabetic Retinopathy Grading with Cataract
Authors:
Fan Xiao,
Junlin Hou,
Ruiwei Zhao,
Rui Feng,
Haidong Zou,
Lina Lu,
Yi Xu,
Juzhao Zhang
Abstract:
Diabetic retinopathy (DR) is a leading cause of blindness worldwide and a common complication of diabetes. As two different imaging tools for DR grading, color fundus photography (CFP) and infrared fundus photography (IFP) are highly-correlated and complementary in clinical applications. To the best of our knowledge, this is the first study that explores a novel multi-modal deep learning framework…
▽ More
Diabetic retinopathy (DR) is a leading cause of blindness worldwide and a common complication of diabetes. As two different imaging tools for DR grading, color fundus photography (CFP) and infrared fundus photography (IFP) are highly-correlated and complementary in clinical applications. To the best of our knowledge, this is the first study that explores a novel multi-modal deep learning framework to fuse the information from CFP and IFP towards more accurate DR grading. Specifically, we construct a dual-stream architecture Cross-Fundus Transformer (CFT) to fuse the ViT-based features of two fundus image modalities. In particular, a meticulously engineered Cross-Fundus Attention (CFA) module is introduced to capture the correspondence between CFP and IFP images. Moreover, we adopt both the single-modality and multi-modality supervisions to maximize the overall performance for DR grading. Extensive experiments on a clinical dataset consisting of 1,713 pairs of multi-modal fundus images demonstrate the superiority of our proposed method. Our code will be released for public access.
△ Less
Submitted 1 November, 2024;
originally announced November 2024.
-
Independent Feature Enhanced Crossmodal Fusion for Match-Mismatch Classification of Speech Stimulus and EEG Response
Authors:
Shitong Fan,
Wenbo Wang,
Feiyang Xiao,
Shiheng Zhang,
Qiaoxi Zhu,
Jian Guan
Abstract:
It is crucial for auditory attention decoding to classify matched and mismatched speech stimuli with corresponding EEG responses by exploring their relationship. However, existing methods often adopt two independent networks to encode speech stimulus and EEG response, which neglect the relationship between these signals from the two modalities. In this paper, we propose an independent feature enha…
▽ More
It is crucial for auditory attention decoding to classify matched and mismatched speech stimuli with corresponding EEG responses by exploring their relationship. However, existing methods often adopt two independent networks to encode speech stimulus and EEG response, which neglect the relationship between these signals from the two modalities. In this paper, we propose an independent feature enhanced crossmodal fusion model (IFE-CF) for match-mismatch classification, which leverages the fusion feature of the speech stimulus and the EEG response to achieve auditory EEG decoding. Specifically, our IFE-CF contains a crossmodal encoder to encode the speech stimulus and the EEG response with a two-branch structure connected via crossmodal attention mechanism in the encoding process, a multi-channel fusion module to fuse features of two modalities by aggregating the interaction feature obtained from the crossmodal encoder and the independent feature obtained from the speech stimulus and EEG response, and a predictor to give the matching result. In addition, the causal mask is introduced to consider the time delay of the speech-EEG pair in the crossmodal encoder, which further enhances the feature representation for match-mismatch classification. Experiments demonstrate our method's effectiveness with better classification accuracy, as compared with the baseline of the Auditory EEG Decoding Challenge 2023.
△ Less
Submitted 19 October, 2024;
originally announced October 2024.
-
Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation
Authors:
Homanga Bharadhwaj,
Debidatta Dwibedi,
Abhinav Gupta,
Shubham Tulsiani,
Carl Doersch,
Ted Xiao,
Dhruv Shah,
Fei Xia,
Dorsa Sadigh,
Sean Kirmani
Abstract:
How can robot manipulation policies generalize to novel tasks involving unseen object types and new motions? In this paper, we provide a solution in terms of predicting motion information from web data through human video generation and conditioning a robot policy on the generated video. Instead of attempting to scale robot data collection which is expensive, we show how we can leverage video gene…
▽ More
How can robot manipulation policies generalize to novel tasks involving unseen object types and new motions? In this paper, we provide a solution in terms of predicting motion information from web data through human video generation and conditioning a robot policy on the generated video. Instead of attempting to scale robot data collection which is expensive, we show how we can leverage video generation models trained on easily available web data, for enabling generalization. Our approach Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution with a single policy conditioned on the generated video. To train the policy, we use an order of magnitude less robot interaction data compared to what the video prediction model was trained on. Gen2Act doesn't require fine-tuning the video model at all and we directly use a pre-trained model for generating human videos. Our results on diverse real-world scenarios show how Gen2Act enables manipulating unseen object types and performing novel motions for tasks not present in the robot data. Videos are at https://homangab.github.io/gen2act/
△ Less
Submitted 24 September, 2024;
originally announced September 2024.
-
A Systematic Post-Processing Approach for Quantitative $T_{1ρ}$ Imaging of Knee Articular Cartilage
Authors:
Junru Zhong,
Yongcheng Yao,
Fan Xiao,
Tim-Yun Michael Ong,
Ki-Wai Kevin Ho,
Siyue Li,
Chaoxing Huang,
Queenie Chan,
James F. Griffith,
Weitian Chen
Abstract:
Objective: To establish an automated pipeline for post-processing of quantitative spin-lattice relaxation time constant in the rotating frame ($T_{1ρ}$) imaging of knee articular cartilage. Design: The proposed post-processing pipeline commences with an image standardisation procedure, followed by deep learning-based segmentation to generate cartilage masks. The articular cartilage is then automat…
▽ More
Objective: To establish an automated pipeline for post-processing of quantitative spin-lattice relaxation time constant in the rotating frame ($T_{1ρ}$) imaging of knee articular cartilage. Design: The proposed post-processing pipeline commences with an image standardisation procedure, followed by deep learning-based segmentation to generate cartilage masks. The articular cartilage is then automatically parcellated into 20 subregions, where $T_{1ρ}$ quantification is performed. The proposed pipeline was retrospectively validated on a dataset comprising knee $T_{1ρ}$ images of 10 healthy volunteers and 30 patients with knee osteoarthritis. Three experiments were conducted, namely an assessment of segmentation model performance (using Dice similarity coefficients, DSCs); an evaluation of the impact of standardisation; and a test of $T_{1ρ}$ quantification accuracy (using paired t-tests; root-mean-square deviations, RMSDs; and coefficients of variance of RMSDs, $CV_{RMSD}$). Statistical significance was set as p<0.05. Results: There was a substantial agreement between the subregional $T_{1ρ}$ quantification from the model-predicted masks and those from the manual segmentation labels. In patients, 17 of 20 subregions, and in healthy volunteers, 18 out of 20 subregions, demonstrated no significant difference between predicted and reference $T_{1ρ}$ quantifications. Average RMSDs were 0.79 ms for patients and 0.56 ms for healthy volunteers, while average $CV_{RMSD}$ were 1.97% and 1.38% for patients and healthy volunteers. Bland-Altman plots showed negligible bias across all subregions for patients and healthy volunteers. Conclusion: The proposed pipeline can perform automatic and reliable post-processing of quantitative $T_{1ρ}$ images of knee articular cartilage.
△ Less
Submitted 19 September, 2024;
originally announced September 2024.
-
Symbiotic Sensing and Communication: Framework and Beamforming Design
Authors:
Fanghao Xia,
Zesong Fei,
Xinyi Wang,
Weijie Yuan,
Qingqing Wu,
Yuanwei Liu,
Tony Q. S. Quek
Abstract:
In this paper, we propose a novel symbiotic sensing and communication (SSAC) framework, comprising a base station (BS) and a passive sensing node. In particular, the BS transmits communication waveform to serve vehicle users (VUEs), while the sensing node is employed to execute sensing tasks based on the echoes in a bistatic manner, thereby avoiding the issue of self-interference. Besides the weak…
▽ More
In this paper, we propose a novel symbiotic sensing and communication (SSAC) framework, comprising a base station (BS) and a passive sensing node. In particular, the BS transmits communication waveform to serve vehicle users (VUEs), while the sensing node is employed to execute sensing tasks based on the echoes in a bistatic manner, thereby avoiding the issue of self-interference. Besides the weak target of interest, the sensing node tracks VUEs and shares sensing results with BS to facilitate sensing-assisted beamforming. By considering both fully digital arrays and hybrid analog-digital (HAD) arrays, we investigate the beamforming design in the SSAC system. We first derive the Cramer-Rao lower bound (CRLB) of the two-dimensional angles of arrival estimation as the sensing metric. Next, we formulate an achievable sum rate maximization problem under the CRLB constraint, where the channel state information is reconstructed based on the sensing results. Then, we propose two penalty dual decomposition (PDD)-based alternating algorithms for fully digital and HAD arrays, respectively. Simulation results demonstrate that the proposed algorithms can achieve an outstanding data rate with effective localization capability for both VUEs and the weak target. In particular, the HAD beamforming design exhibits remarkable performance gain compared to conventional schemes, especially with fewer radio frequency chains.
△ Less
Submitted 27 August, 2024;
originally announced August 2024.
-
A Reference-free Metric for Language-Queried Audio Source Separation using Contrastive Language-Audio Pretraining
Authors:
Feiyang Xiao,
Jian Guan,
Qiaoxi Zhu,
Xubo Liu,
Wenbo Wang,
Shuhan Qi,
Kejia Zhang,
Jianyuan Sun,
Wenwu Wang
Abstract:
Language-queried audio source separation (LASS) aims to separate an audio source guided by a text query, with the signal-to-distortion ratio (SDR)-based metrics being commonly used to objectively measure the quality of the separated audio. However, the SDR-based metrics require a reference signal, which is often difficult to obtain in real-world scenarios. In addition, with the SDR-based metrics,…
▽ More
Language-queried audio source separation (LASS) aims to separate an audio source guided by a text query, with the signal-to-distortion ratio (SDR)-based metrics being commonly used to objectively measure the quality of the separated audio. However, the SDR-based metrics require a reference signal, which is often difficult to obtain in real-world scenarios. In addition, with the SDR-based metrics, the content information of the text query is not considered effectively in LASS. This paper introduces a reference-free evaluation metric using a contrastive language-audio pretraining (CLAP) module, termed CLAPScore, which measures the semantic similarity between the separated audio and the text query. Unlike SDR, the proposed CLAPScore metric evaluates the quality of the separated audio based on the content information of the text query, without needing a reference signal. Experiments show that the CLAPScore provides an effective evaluation of the semantic relevance of the separated audio to the text query, as compared to the SDR metric, offering an alternative for the performance evaluation of LASS systems. The code for evaluation is publicly available.
△ Less
Submitted 5 January, 2025; v1 submitted 5 July, 2024;
originally announced July 2024.
-
Revealing the Trade-off in ISAC Systems: The KL Divergence Perspective
Authors:
Zesong Fei,
Shuntian Tang,
Xinyi Wang,
Fanghao Xia,
Fan Liu,
J. Andrew Zhang
Abstract:
Integrated sensing and communication (ISAC) is regarded as a promising technique for 6G communication network. In this letter, we investigate the Pareto bound of the ISAC system in terms of a unified Kullback-Leibler (KL) divergence performance metric. We firstly present the relationship between KL divergence and explicit ISAC performance metric, i.e., demodulation error and probability of detecti…
▽ More
Integrated sensing and communication (ISAC) is regarded as a promising technique for 6G communication network. In this letter, we investigate the Pareto bound of the ISAC system in terms of a unified Kullback-Leibler (KL) divergence performance metric. We firstly present the relationship between KL divergence and explicit ISAC performance metric, i.e., demodulation error and probability of detection. Thereafter, we investigate the impact of constellation and beamforming design on the Pareto bound via deep learning and semi-definite relaxation (SDR) techniques. Simulation results show the trade-off between sensing and communication performance in terms of bit error rate (BER) and probability of detection under different parameter set-ups.
△ Less
Submitted 17 May, 2024;
originally announced May 2024.
-
Millimeter Wave Radar-based Human Activity Recognition for Healthcare Monitoring Robot
Authors:
Zhanzhong Gu,
Xiangjian He,
Gengfa Fang,
Chengpei Xu,
Feng Xia,
Wenjing Jia
Abstract:
Healthcare monitoring is crucial, especially for the daily care of elderly individuals living alone. It can detect dangerous occurrences, such as falls, and provide timely alerts to save lives. Non-invasive millimeter wave (mmWave) radar-based healthcare monitoring systems using advanced human activity recognition (HAR) models have recently gained significant attention. However, they encounter cha…
▽ More
Healthcare monitoring is crucial, especially for the daily care of elderly individuals living alone. It can detect dangerous occurrences, such as falls, and provide timely alerts to save lives. Non-invasive millimeter wave (mmWave) radar-based healthcare monitoring systems using advanced human activity recognition (HAR) models have recently gained significant attention. However, they encounter challenges in handling sparse point clouds, achieving real-time continuous classification, and coping with limited monitoring ranges when statically mounted. To overcome these limitations, we propose RobHAR, a movable robot-mounted mmWave radar system with lightweight deep neural networks for real-time monitoring of human activities. Specifically, we first propose a sparse point cloud-based global embedding to learn the features of point clouds using the light-PointNet (LPN) backbone. Then, we learn the temporal pattern with a bidirectional lightweight LSTM model (BiLiLSTM). In addition, we implement a transition optimization strategy, integrating the Hidden Markov Model (HMM) with Connectionist Temporal Classification (CTC) to improve the accuracy and robustness of the continuous HAR. Our experiments on three datasets indicate that our method significantly outperforms the previous studies in both discrete and continuous HAR tasks. Finally, we deploy our system on a movable robot-mounted edge computing platform, achieving flexible healthcare monitoring in real-world scenarios.
△ Less
Submitted 3 May, 2024;
originally announced May 2024.
-
Scalable second-order consensus of hierarchical groups
Authors:
Jiamin Wang,
Jian Liu,
Feng Xiao,
Ning Xi,
Yuanshi Zheng
Abstract:
Motivated by widespread dominance hierarchy, growth of group sizes, and feedback mechanisms in social species, we are devoted to exploring the scalable second-order consensus of hierarchical groups. More specifically, a hierarchical group consists of a collection of agents with double-integrator dynamics on a directed acyclic graph with additional reverse edges, which characterize feedback mechani…
▽ More
Motivated by widespread dominance hierarchy, growth of group sizes, and feedback mechanisms in social species, we are devoted to exploring the scalable second-order consensus of hierarchical groups. More specifically, a hierarchical group consists of a collection of agents with double-integrator dynamics on a directed acyclic graph with additional reverse edges, which characterize feedback mechanisms across hierarchical layers. As the group size grows and the reverse edges appear, we investigate whether the absolute velocity protocol and the relative velocity protocol can preserve the system consensus property without tuning the control gains. It is rigorously proved that the absolute velocity protocol is able to achieve completely scalable second-order consensus but the relative velocity protocol cannot. This result theoretically reveals how the scalable coordination behavior in hierarchical groups is determined by local interaction rules. Moreover, we develop a hierarchical structure in order to achieve scalable second-order consensus for networks of any size and with any number of reverse edges.
△ Less
Submitted 31 March, 2024;
originally announced April 2024.
-
Fluid Antenna for Mobile Edge Computing
Authors:
Yiping Zuo,
Jiajia Guo,
Biyun Sheng,
Chen Dai,
Fu Xiao,
Shi Jin
Abstract:
In the evolving environment of mobile edge computing (MEC), optimizing system performance to meet the growing demand for low-latency computing services is a top priority. Integrating fluidic antenna (FA) technology into MEC networks provides a new approach to address this challenge. This letter proposes an FA-enabled MEC scheme that aims to minimize the total system delay by leveraging the mobilit…
▽ More
In the evolving environment of mobile edge computing (MEC), optimizing system performance to meet the growing demand for low-latency computing services is a top priority. Integrating fluidic antenna (FA) technology into MEC networks provides a new approach to address this challenge. This letter proposes an FA-enabled MEC scheme that aims to minimize the total system delay by leveraging the mobility of FA to enhance channel conditions and improve computational offloading efficiency. By establishing an optimization problem focusing on the joint optimization of computation offloading and antenna positioning, we introduce an alternating iterative algorithm based on the interior point method and particle swarm optimization (IPPSO). Numerical results demonstrate the advantages of our proposed scheme compared to traditional fixed antenna positions, showing significant improvements in transmission rates and reductions in delays. The proposed IPPSO algorithm exhibits robust convergence properties, further validating the effectiveness of our method.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
MicroDiffusion: Implicit Representation-Guided Diffusion for 3D Reconstruction from Limited 2D Microscopy Projections
Authors:
Mude Hui,
Zihao Wei,
Hongru Zhu,
Fei Xia,
Yuyin Zhou
Abstract:
Volumetric optical microscopy using non-diffracting beams enables rapid imaging of 3D volumes by projecting them axially to 2D images but lacks crucial depth information. Addressing this, we introduce MicroDiffusion, a pioneering tool facilitating high-quality, depth-resolved 3D volume reconstruction from limited 2D projections. While existing Implicit Neural Representation (INR) models often yiel…
▽ More
Volumetric optical microscopy using non-diffracting beams enables rapid imaging of 3D volumes by projecting them axially to 2D images but lacks crucial depth information. Addressing this, we introduce MicroDiffusion, a pioneering tool facilitating high-quality, depth-resolved 3D volume reconstruction from limited 2D projections. While existing Implicit Neural Representation (INR) models often yield incomplete outputs and Denoising Diffusion Probabilistic Models (DDPM) excel at capturing details, our method integrates INR's structural coherence with DDPM's fine-detail enhancement capabilities. We pretrain an INR model to transform 2D axially-projected images into a preliminary 3D volume. This pretrained INR acts as a global prior guiding DDPM's generative process through a linear interpolation between INR outputs and noise inputs. This strategy enriches the diffusion process with structured 3D information, enhancing detail and reducing noise in localized 2D images. By conditioning the diffusion model on the closest 2D projection, MicroDiffusion substantially enhances fidelity in resulting 3D reconstructions, surpassing INR and standard DDPM outputs with unparalleled image quality and structural fidelity. Our code and dataset are available at https://github.com/UCSC-VLAA/MicroDiffusion.
△ Less
Submitted 16 March, 2024;
originally announced March 2024.
-
First-Shot Unsupervised Anomalous Sound Detection With Unknown Anomalies Estimated by Metadata-Assisted Audio Generation
Authors:
Hejing Zhang,
Qiaoxi Zhu,
Jian Guan,
Haohe Liu,
Feiyang Xiao,
Jiantong Tian,
Xinhao Mei,
Xubo Liu,
Wenwu Wang
Abstract:
First-shot (FS) unsupervised anomalous sound detection (ASD) is a brand-new task introduced in DCASE 2023 Challenge Task 2, where the anomalous sounds for the target machine types are unseen in training. Existing methods often rely on the availability of normal and abnormal sound data from the target machines. However, due to the lack of anomalous sound data for the target machine types, it become…
▽ More
First-shot (FS) unsupervised anomalous sound detection (ASD) is a brand-new task introduced in DCASE 2023 Challenge Task 2, where the anomalous sounds for the target machine types are unseen in training. Existing methods often rely on the availability of normal and abnormal sound data from the target machines. However, due to the lack of anomalous sound data for the target machine types, it becomes challenging when adapting the existing ASD methods to the first-shot task. In this paper, we propose a new framework for the first-shot unsupervised ASD, where metadata-assisted audio generation is used to estimate unknown anomalies, by utilising the available machine information (i.e., metadata and sound data) to fine-tune a text-to-audio generation model for generating the anomalous sounds that contain unique acoustic characteristics accounting for each different machine type. We then use the method of Time-Weighted Frequency domain audio Representation with Gaussian Mixture Model (TWFR-GMM) as the backbone to achieve the first-shot unsupervised ASD. Our proposed FS-TWFR-GMM method achieves competitive performance amongst top systems in DCASE 2023 Challenge Task 2, while requiring only 1% model parameters for detection, as validated in our experiments.
△ Less
Submitted 11 March, 2024; v1 submitted 22 October, 2023;
originally announced October 2023.
-
MERTech: Instrument Playing Technique Detection Using Self-Supervised Pretrained Model With Multi-Task Finetuning
Authors:
Dichucheng Li,
Yinghao Ma,
Weixing Wei,
Qiuqiang Kong,
Yulun Wu,
Mingjin Che,
Fan Xia,
Emmanouil Benetos,
Wei Li
Abstract:
Instrument playing techniques (IPTs) constitute a pivotal component of musical expression. However, the development of automatic IPT detection methods suffers from limited labeled data and inherent class imbalance issues. In this paper, we propose to apply a self-supervised learning model pre-trained on large-scale unlabeled music data and finetune it on IPT detection tasks. This approach addresse…
▽ More
Instrument playing techniques (IPTs) constitute a pivotal component of musical expression. However, the development of automatic IPT detection methods suffers from limited labeled data and inherent class imbalance issues. In this paper, we propose to apply a self-supervised learning model pre-trained on large-scale unlabeled music data and finetune it on IPT detection tasks. This approach addresses data scarcity and class imbalance challenges. Recognizing the significance of pitch in capturing the nuances of IPTs and the importance of onset in locating IPT events, we investigate multi-task finetuning with pitch and onset detection as auxiliary tasks. Additionally, we apply a post-processing approach for event-level prediction, where an IPT activation initiates an event only if the onset output confirms an onset in that frame. Our method outperforms prior approaches in both frame-level and event-level metrics across multiple IPT benchmark datasets. Further experiments demonstrate the efficacy of multi-task finetuning on each IPT class.
△ Less
Submitted 15 October, 2023;
originally announced October 2023.
-
Transformer-based Autoencoder with ID Constraint for Unsupervised Anomalous Sound Detection
Authors:
Jian Guan,
Youde Liu,
Qiuqiang Kong,
Feiyang Xiao,
Qiaoxi Zhu,
Jiantong Tian,
Wenwu Wang
Abstract:
Unsupervised anomalous sound detection (ASD) aims to detect unknown anomalous sounds of devices when only normal sound data is available. The autoencoder (AE) and self-supervised learning based methods are two mainstream methods. However, the AE-based methods could be limited as the feature learned from normal sounds can also fit with anomalous sounds, reducing the ability of the model in detectin…
▽ More
Unsupervised anomalous sound detection (ASD) aims to detect unknown anomalous sounds of devices when only normal sound data is available. The autoencoder (AE) and self-supervised learning based methods are two mainstream methods. However, the AE-based methods could be limited as the feature learned from normal sounds can also fit with anomalous sounds, reducing the ability of the model in detecting anomalies from sound. The self-supervised methods are not always stable and perform differently, even for machines of the same type. In addition, the anomalous sound may be short-lived, making it even harder to distinguish from normal sound. This paper proposes an ID constrained Transformer-based autoencoder (IDC-TransAE) architecture with weighted anomaly score computation for unsupervised ASD. Machine ID is employed to constrain the latent space of the Transformer-based autoencoder (TransAE) by introducing a simple ID classifier to learn the difference in the distribution for the same machine type and enhance the ability of the model in distinguishing anomalous sound. Moreover, weighted anomaly score computation is introduced to highlight the anomaly scores of anomalous events that only appear for a short time. Experiments performed on DCASE 2020 Challenge Task2 development dataset demonstrate the effectiveness and superiority of our proposed method.
△ Less
Submitted 13 October, 2023;
originally announced October 2023.
-
Synth-AC: Enhancing Audio Captioning with Synthetic Supervision
Authors:
Feiyang Xiao,
Qiaoxi Zhu,
Jian Guan,
Xubo Liu,
Haohe Liu,
Kejia Zhang,
Wenwu Wang
Abstract:
Data-driven approaches hold promise for audio captioning. However, the development of audio captioning methods can be biased due to the limited availability and quality of text-audio data. This paper proposes a SynthAC framework, which leverages recent advances in audio generative models and commonly available text corpus to create synthetic text-audio pairs, thereby enhancing text-audio represent…
▽ More
Data-driven approaches hold promise for audio captioning. However, the development of audio captioning methods can be biased due to the limited availability and quality of text-audio data. This paper proposes a SynthAC framework, which leverages recent advances in audio generative models and commonly available text corpus to create synthetic text-audio pairs, thereby enhancing text-audio representation. Specifically, the text-to-audio generation model, i.e., AudioLDM, is used to generate synthetic audio signals with captions from an image captioning dataset. Our SynthAC expands the availability of well-annotated captions from the text-vision domain to audio captioning, thus enhancing text-audio representation by learning relations within synthetic text-audio pairs. Experiments demonstrate that our SynthAC framework can benefit audio captioning models by incorporating well-annotated text corpus from the text-vision domain, offering a promising solution to the challenge caused by data scarcity. Furthermore, SynthAC can be easily adapted to various state-of-the-art methods, leading to substantial performance improvements.
△ Less
Submitted 18 September, 2023;
originally announced September 2023.
-
Anomalous Sound Detection Using Self-Attention-Based Frequency Pattern Analysis of Machine Sounds
Authors:
Hejing Zhang,
Jian Guan,
Qiaoxi Zhu,
Feiyang Xiao,
Youde Liu
Abstract:
Different machines can exhibit diverse frequency patterns in their emitted sound. This feature has been recently explored in anomaly sound detection and reached state-of-the-art performance. However, existing methods rely on the manual or empirical determination of the frequency filter by observing the effective frequency range in the training data, which may be impractical for general application…
▽ More
Different machines can exhibit diverse frequency patterns in their emitted sound. This feature has been recently explored in anomaly sound detection and reached state-of-the-art performance. However, existing methods rely on the manual or empirical determination of the frequency filter by observing the effective frequency range in the training data, which may be impractical for general application. This paper proposes an anomalous sound detection method using self-attention-based frequency pattern analysis and spectral-temporal information fusion. Our experiments demonstrate that the self-attention module automatically and adaptively analyses the effective frequencies of a machine sound and enhances that information in the spectral feature representation. With spectral-temporal information fusion, the obtained audio feature eventually improves the anomaly detection performance on the DCASE 2020 Challenge Task 2 dataset.
△ Less
Submitted 6 September, 2023; v1 submitted 27 August, 2023;
originally announced August 2023.
-
State Estimator Design: Addressing General Delay Structures with Dissipative Constraints
Authors:
Qian Feng,
Feng Xiao,
Xiaoyu Wang
Abstract:
Dissipative estimator (observer) design for continuous time-delay systems poses a significant challenge when an unlimited number of pointwise and general distributed delays (DDs) are concerned. We propose an effective solution to this semi-open problem using the Krasovskiĭ functional (KF) framework in conjunction with a quadratic supply rate function, where both the plant and the estimator can acc…
▽ More
Dissipative estimator (observer) design for continuous time-delay systems poses a significant challenge when an unlimited number of pointwise and general distributed delays (DDs) are concerned. We propose an effective solution to this semi-open problem using the Krasovskiĭ functional (KF) framework in conjunction with a quadratic supply rate function, where both the plant and the estimator can accommodate an unlimited number of pointwise and general distributed delays with an unlimited number of square-integrable kernels. A key contribution is the introduction of a control concept called Kronecker-Seuret Decomposition (KSD) for matrix-valued functions, which allows for the factorizations or approximations of any DD integral kernel without introducing conservatism. Moreover, using KSD facilitates the construction of complete-type KFs with integral kernels that can contain any number of weakly differentiable and linearly independent functions. Our proposed solution is formulated as sequential convex SDP problems and is set out in two theorems along with an off-line iterative algorithm, which eliminates the need for nonlinear numerical solvers. We show the effectiveness of our method using two challenging numerical experiments, including a system stabilized by a non-smooth controller.
△ Less
Submitted 7 August, 2024; v1 submitted 24 July, 2023;
originally announced July 2023.
-
GAN-based Image Compression with Improved RDO Process
Authors:
Fanxin Xia,
Jian Jin,
Lili Meng,
Feng Ding,
Huaxiang Zhang
Abstract:
GAN-based image compression schemes have shown remarkable progress lately due to their high perceptual quality at low bit rates. However, there are two main issues, including 1) the reconstructed image perceptual degeneration in color, texture, and structure as well as 2) the inaccurate entropy model. In this paper, we present a novel GAN-based image compression approach with improved rate-distort…
▽ More
GAN-based image compression schemes have shown remarkable progress lately due to their high perceptual quality at low bit rates. However, there are two main issues, including 1) the reconstructed image perceptual degeneration in color, texture, and structure as well as 2) the inaccurate entropy model. In this paper, we present a novel GAN-based image compression approach with improved rate-distortion optimization (RDO) process. To achieve this, we utilize the DISTS and MS-SSIM metrics to measure perceptual degeneration in color, texture, and structure. Besides, we absorb the discretized gaussian-laplacian-logistic mixture model (GLLMM) for entropy modeling to improve the accuracy in estimating the probability distributions of the latent representation. During the evaluation process, instead of evaluating the perceptual quality of the reconstructed image via IQA metrics, we directly conduct the Mean Opinion Score (MOS) experiment among different codecs, which fully reflects the actual perceptual results of humans. Experimental results demonstrate that the proposed method outperforms the existing GAN-based methods and the state-of-the-art hybrid codec (i.e., VVC).
△ Less
Submitted 17 June, 2023;
originally announced June 2023.
-
Conditional Diffusion Models for Semantic 3D Brain MRI Synthesis
Authors:
Zolnamar Dorjsembe,
Hsing-Kuo Pao,
Sodtavilan Odonchimed,
Furen Xiao
Abstract:
Artificial intelligence (AI) in healthcare, especially in medical imaging, faces challenges due to data scarcity and privacy concerns. Addressing these, we introduce Med-DDPM, a diffusion model designed for 3D semantic brain MRI synthesis. This model effectively tackles data scarcity and privacy issues by integrating semantic conditioning. This involves the channel-wise concatenation of a conditio…
▽ More
Artificial intelligence (AI) in healthcare, especially in medical imaging, faces challenges due to data scarcity and privacy concerns. Addressing these, we introduce Med-DDPM, a diffusion model designed for 3D semantic brain MRI synthesis. This model effectively tackles data scarcity and privacy issues by integrating semantic conditioning. This involves the channel-wise concatenation of a conditioning image to the model input, enabling control in image generation. Med-DDPM demonstrates superior stability and performance compared to existing 3D brain imaging synthesis methods. It generates diverse, anatomically coherent images with high visual fidelity. In terms of dice score accuracy in the tumor segmentation task, Med-DDPM achieves 0.6207, close to the 0.6531 accuracy of real images, and outperforms baseline models. Combined with real images, it further increases segmentation accuracy to 0.6675, showing the potential of our proposed method for data augmentation. This model represents the first use of a diffusion model in 3D semantic brain MRI synthesis, producing high-quality images. Its semantic conditioning feature also shows potential for image anonymization in biomedical imaging, addressing data and privacy issues. We provide the code and model weights for Med-DDPM on our GitHub repository (https://github.com/mobaidoctor/med-ddpm/) to support reproducibility.
△ Less
Submitted 19 April, 2024; v1 submitted 29 May, 2023;
originally announced May 2023.
-
Anomalous Sound Detection using Audio Representation with Machine ID based Contrastive Learning Pretraining
Authors:
Jian Guan,
Feiyang Xiao,
Youde Liu,
Qiaoxi Zhu,
Wenwu Wang
Abstract:
Existing contrastive learning methods for anomalous sound detection refine the audio representation of each audio sample by using the contrast between the samples' augmentations (e.g., with time or frequency masking). However, they might be biased by the augmented data, due to the lack of physical properties of machine sound, thereby limiting the detection performance. This paper uses contrastive…
▽ More
Existing contrastive learning methods for anomalous sound detection refine the audio representation of each audio sample by using the contrast between the samples' augmentations (e.g., with time or frequency masking). However, they might be biased by the augmented data, due to the lack of physical properties of machine sound, thereby limiting the detection performance. This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample. The proposed two-stage method uses contrastive learning to pretrain the audio representation model by incorporating machine ID and a self-supervised ID classifier to fine-tune the learnt model, while enhancing the relation between audio features from the same ID. Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification in overall anomaly detection performance and stability on DCASE 2020 Challenge Task2 dataset.
△ Less
Submitted 10 April, 2023; v1 submitted 7 April, 2023;
originally announced April 2023.
-
Graph Attention for Automated Audio Captioning
Authors:
Feiyang Xiao,
Jian Guan,
Qiaoxi Zhu,
Wenwu Wang
Abstract:
State-of-the-art audio captioning methods typically use the encoder-decoder structure with pretrained audio neural networks (PANNs) as encoders for feature extraction. However, the convolution operation used in PANNs is limited in capturing the long-time dependencies within an audio signal, thereby leading to potential performance degradation in audio captioning. This letter presents a novel metho…
▽ More
State-of-the-art audio captioning methods typically use the encoder-decoder structure with pretrained audio neural networks (PANNs) as encoders for feature extraction. However, the convolution operation used in PANNs is limited in capturing the long-time dependencies within an audio signal, thereby leading to potential performance degradation in audio captioning. This letter presents a novel method using graph attention (GraphAC) for encoder-decoder based audio captioning. In the encoder, a graph attention module is introduced after the PANNs to learn contextual association (i.e. the dependency among the audio features over different time frames) through an adjacency graph, and a top-k mask is used to mitigate the interference from noisy nodes. The learnt contextual association leads to a more effective feature representation with feature node aggregation. As a result, the decoder can predict important semantic information about the acoustic scene and events based on the contextual associations learned from the audio signal. Experimental results show that GraphAC outperforms the state-of-the-art methods with PANNs as the encoders, thanks to the incorporation of the graph attention module into the encoder for capturing the long-time dependencies within the audio signal. The source code is available at https://github.com/LittleFlyingSheep/GraphAC.
△ Less
Submitted 10 April, 2023; v1 submitted 7 April, 2023;
originally announced April 2023.
-
Frame-Level Multi-Label Playing Technique Detection Using Multi-Scale Network and Self-Attention Mechanism
Authors:
Dichucheng Li,
Mingjin Che,
Wenwu Meng,
Yulun Wu,
Yi Yu,
Fan Xia,
Wei Li
Abstract:
Instrument playing technique (IPT) is a key element of musical presentation. However, most of the existing works for IPT detection only concern monophonic music signals, yet little has been done to detect IPTs in polyphonic instrumental solo pieces with overlapping IPTs or mixed IPTs. In this paper, we formulate it as a frame-level multi-label classification problem and apply it to Guzheng, a Chin…
▽ More
Instrument playing technique (IPT) is a key element of musical presentation. However, most of the existing works for IPT detection only concern monophonic music signals, yet little has been done to detect IPTs in polyphonic instrumental solo pieces with overlapping IPTs or mixed IPTs. In this paper, we formulate it as a frame-level multi-label classification problem and apply it to Guzheng, a Chinese plucked string instrument. We create a new dataset, Guzheng\_Tech99, containing Guzheng recordings and onset, offset, pitch, IPT annotations of each note. Because different IPTs vary a lot in their lengths, we propose a new method to solve this problem using multi-scale network and self-attention. The multi-scale network extracts features from different scales, and the self-attention mechanism applied to the feature maps at the coarsest scale further enhances the long-range feature extraction. Our approach outperforms existing works by a large margin, indicating its effectiveness in IPT detection.
△ Less
Submitted 23 March, 2023;
originally announced March 2023.
-
Advanced Multi-Microscopic Views Cell Semi-supervised Segmentation
Authors:
Fang Hu,
Xuexue Sun,
Ke Qing,
Fenxi Xiao,
Zhi Wang,
Xiaolu Fan
Abstract:
Although deep learning (DL) shows powerful potential in cell segmentation tasks, it suffers from poor generalization as DL-based methods originally simplified cell segmentation in detecting cell membrane boundary, lacking prominent cellular structures to position overall differentiating. Moreover, the scarcity of annotated cell images limits the performance of DL models. Segmentation limitations o…
▽ More
Although deep learning (DL) shows powerful potential in cell segmentation tasks, it suffers from poor generalization as DL-based methods originally simplified cell segmentation in detecting cell membrane boundary, lacking prominent cellular structures to position overall differentiating. Moreover, the scarcity of annotated cell images limits the performance of DL models. Segmentation limitations of a single category of cell make massive practice difficult, much less, with varied modalities. In this paper, we introduce a novel semi-supervised cell segmentation method called Multi-Microscopic-view Cell semi-supervised Segmentation (MMCS), which can train cell segmentation models utilizing less labeled multi-posture cell images with different microscopy well. Technically, MMCS consists of Nucleus-assisted global recognition, Self-adaptive diameter filter, and Temporal-ensembling models. Nucleus-assisted global recognition adds additional cell nucleus channel to improve the global distinguishing performance of fuzzy cell membrane boundaries even when cells aggregate. Besides, self-adapted cell diameter filter can help separate multi-resolution cells with different morphology properly. It further leverages the temporal-ensembling models to improve the semi-supervised training process, achieving effective training with less labeled data. Additionally, optimizing the weight of unlabeled loss contributed to total loss also improve the model performance. Evaluated on the Tuning Set of NeurIPS 2022 Cell Segmentation Challenge (NeurIPS CellSeg), MMCS achieves an F1-score of 0.8239 and the running time for all cases is within the time tolerance.
△ Less
Submitted 21 March, 2023;
originally announced March 2023.
-
Unsupervised Domain Adaptation for Automated Knee Osteoarthritis Phenotype Classification
Authors:
Junru Zhong,
Yongcheng Yao,
Donal G. Cahill,
Fan Xiao,
Siyue Li,
Jack Lee,
Kevin Ki-Wai Ho,
Michael Tim-Yun Ong,
James F. Griffith,
Weitian Chen
Abstract:
Purpose: The aim of this study was to demonstrate the utility of unsupervised domain adaptation (UDA) in automated knee osteoarthritis (OA) phenotype classification using a small dataset (n=50). Materials and Methods: For this retrospective study, we collected 3,166 three-dimensional (3D) double-echo steady-state magnetic resonance (MR) images from the Osteoarthritis Initiative dataset and 50 3D t…
▽ More
Purpose: The aim of this study was to demonstrate the utility of unsupervised domain adaptation (UDA) in automated knee osteoarthritis (OA) phenotype classification using a small dataset (n=50). Materials and Methods: For this retrospective study, we collected 3,166 three-dimensional (3D) double-echo steady-state magnetic resonance (MR) images from the Osteoarthritis Initiative dataset and 50 3D turbo/fast spin-echo MR images from our institute (in 2020 and 2021) as the source and target datasets, respectively. For each patient, the degree of knee OA was initially graded according to the MRI Osteoarthritis Knee Score (MOAKS) before being converted to binary OA phenotype labels. The proposed UDA pipeline included (a) pre-processing, which involved automatic segmentation and region-of-interest cropping; (b) source classifier training, which involved pre-training phenotype classifiers on the source dataset; (c) target encoder adaptation, which involved unsupervised adaption of the source encoder to the target encoder and (d) target classifier validation, which involved statistical analysis of the target classification performance evaluated by the area under the receiver operating characteristic curve (AUROC), sensitivity, specificity and accuracy. Additionally, a classifier was trained without UDA for comparison. Results: The target classifier trained with UDA achieved improved AUROC, sensitivity, specificity and accuracy for both knee OA phenotypes compared with the classifier trained without UDA. Conclusion: The proposed UDA approach improves the performance of automated knee OA phenotype classification for small target datasets by utilising a large, high-quality source dataset for training. The results successfully demonstrated the advantages of the UDA approach in classification on small datasets.
△ Less
Submitted 13 December, 2022;
originally announced December 2022.
-
Fully Automated Deep Learning-enabled Detection for Hepatic Steatosis on Computed Tomography: A Multicenter International Validation Study
Authors:
Zhongyi Zhang,
Guixia Li,
Ziqiang Wang,
Feng Xia,
Ning Zhao,
Huibin Nie,
Zezhong Ye,
Joshua Lin,
Yiyi Hui,
Xiangchun Liu
Abstract:
Despite high global prevalence of hepatic steatosis, no automated diagnostics demonstrated generalizability in detecting steatosis on multiple international datasets. Traditionally, hepatic steatosis detection relies on clinicians selecting the region of interest (ROI) on computed tomography (CT) to measure liver attenuation. ROI selection demands time and expertise, and therefore is not routinely…
▽ More
Despite high global prevalence of hepatic steatosis, no automated diagnostics demonstrated generalizability in detecting steatosis on multiple international datasets. Traditionally, hepatic steatosis detection relies on clinicians selecting the region of interest (ROI) on computed tomography (CT) to measure liver attenuation. ROI selection demands time and expertise, and therefore is not routinely performed in populations. To automate the process, we validated an existing artificial intelligence (AI) system for 3D liver segmentation and used it to purpose a novel method: AI-ROI, which could automatically select the ROI for attenuation measurements. AI segmentation and AI-ROI method were evaluated on 1,014 non-contrast enhanced chest CT images from eight international datasets: LIDC-IDRI, NSCLC-Lung1, RIDER, VESSEL12, RICORD-1A, RICORD-1B, COVID-19-Italy, and COVID-19-China. AI segmentation achieved a mean dice coefficient of 0.957. Attenuations measured by AI-ROI showed no significant differences (p = 0.545) and a reduction of 71% time compared to expert measurements. The area under the curve (AUC) of the steatosis classification of AI-ROI is 0.921 (95% CI: 0.883 - 0.959). If performed as a routine screening method, our AI protocol could potentially allow early non-invasive, non-pharmacological preventative interventions for hepatic steatosis. 1,014 expert-annotated liver segmentations of patients with hepatic steatosis annotations can be downloaded here: https://drive.google.com/drive/folders/1-g_zJeAaZXYXGqL1OeF6pUjr6KB0igJX.
△ Less
Submitted 6 November, 2022; v1 submitted 26 October, 2022;
originally announced October 2022.
-
Robotic Table Wiping via Reinforcement Learning and Whole-body Trajectory Optimization
Authors:
Thomas Lew,
Sumeet Singh,
Mario Prats,
Jeffrey Bingham,
Jonathan Weisz,
Benjie Holson,
Xiaohan Zhang,
Vikas Sindhwani,
Yao Lu,
Fei Xia,
Peng Xu,
Tingnan Zhang,
Jie Tan,
Montserrat Gonzalez
Abstract:
We propose a framework to enable multipurpose assistive mobile robots to autonomously wipe tables to clean spills and crumbs. This problem is challenging, as it requires planning wiping actions while reasoning over uncertain latent dynamics of crumbs and spills captured via high-dimensional visual observations. Simultaneously, we must guarantee constraints satisfaction to enable safe deployment in…
▽ More
We propose a framework to enable multipurpose assistive mobile robots to autonomously wipe tables to clean spills and crumbs. This problem is challenging, as it requires planning wiping actions while reasoning over uncertain latent dynamics of crumbs and spills captured via high-dimensional visual observations. Simultaneously, we must guarantee constraints satisfaction to enable safe deployment in unstructured cluttered environments. To tackle this problem, we first propose a stochastic differential equation to model crumbs and spill dynamics and absorption with a robot wiper. Using this model, we train a vision-based policy for planning wiping actions in simulation using reinforcement learning (RL). To enable zero-shot sim-to-real deployment, we dovetail the RL policy with a whole-body trajectory optimization framework to compute base and arm joint trajectories that execute the desired wiping motions while guaranteeing constraints satisfaction. We extensively validate our approach in simulation and on hardware. Video: https://youtu.be/inORKP4F3EI
△ Less
Submitted 19 October, 2022;
originally announced October 2022.
-
Deep-OCTA: Ensemble Deep Learning Approaches for Diabetic Retinopathy Analysis on OCTA Images
Authors:
Junlin Hou,
Fan Xiao,
Jilan Xu,
Yuejie Zhang,
Haidong Zou,
Rui Feng
Abstract:
The ultra-wide optical coherence tomography angiography (OCTA) has become an important imaging modality in diabetic retinopathy (DR) diagnosis. However, there are few researches focusing on automatic DR analysis using ultra-wide OCTA. In this paper, we present novel and practical deep-learning solutions based on ultra-wide OCTA for the Diabetic Retinopathy Analysis Challenge (DRAC). In the segment…
▽ More
The ultra-wide optical coherence tomography angiography (OCTA) has become an important imaging modality in diabetic retinopathy (DR) diagnosis. However, there are few researches focusing on automatic DR analysis using ultra-wide OCTA. In this paper, we present novel and practical deep-learning solutions based on ultra-wide OCTA for the Diabetic Retinopathy Analysis Challenge (DRAC). In the segmentation of DR lesions task, we utilize UNet and UNet++ to segment three lesions with strong data augmentation and model ensemble. In the image quality assessment task, we create an ensemble of InceptionV3, SE-ResNeXt, and Vision Transformer models. Pre-training on the large dataset as well as the hybrid MixUp and CutMix strategy are both adopted to boost the generalization ability of our model. In the DR grading task, we build a Vision Transformer (ViT) and fnd that the ViT model pre-trained on color fundus images serves as a useful substrate for OCTA images. Our proposed methods ranked 4th, 3rd, and 5th on the three leaderboards of DRAC, respectively. The source code will be made available at https://github.com/FDU-VTS/DRAC.
△ Less
Submitted 2 October, 2022;
originally announced October 2022.
-
Playing Technique Detection by Fusing Note Onset Information in Guzheng Performance
Authors:
Dichucheng Li,
Yulun Wu,
Qinyu Li,
Jiahao Zhao,
Yi Yu,
Fan Xia,
Wei Li
Abstract:
The Guzheng is a kind of traditional Chinese instruments with diverse playing techniques. Instrument playing techniques (IPT) play an important role in musical performance. However, most of the existing works for IPT detection show low efficiency for variable-length audio and provide no assurance in the generalization as they rely on a single sound bank for training and testing. In this study, we…
▽ More
The Guzheng is a kind of traditional Chinese instruments with diverse playing techniques. Instrument playing techniques (IPT) play an important role in musical performance. However, most of the existing works for IPT detection show low efficiency for variable-length audio and provide no assurance in the generalization as they rely on a single sound bank for training and testing. In this study, we propose an end-to-end Guzheng playing technique detection system using Fully Convolutional Networks that can be applied to variable-length audio. Because each Guzheng playing technique is applied to a note, a dedicated onset detector is trained to divide an audio into several notes and its predictions are fused with frame-wise IPT predictions. During fusion, we add the IPT predictions frame by frame inside each note and get the IPT with the highest probability within each note as the final output of that note. We create a new dataset named GZ_IsoTech from multiple sound banks and real-world recordings for Guzheng performance analysis. Our approach achieves 87.97% in frame-level accuracy and 80.76% in note-level F1-score, outperforming existing works by a large margin, which indicates the effectiveness of our proposed method in IPT detection.
△ Less
Submitted 19 September, 2022;
originally announced September 2022.
-
Physics-based neural network for non-invasive control of coherent light in scattering media
Authors:
Alexandra d'Arco,
Fei Xia,
Antoine Boniface,
Jonathan Dong,
Sylvain Gigan
Abstract:
Optical imaging through complex media, such as biological tissues or fog, is challenging due to light scattering. In the multiple scattering regime, wavefront shaping provides an effective method to retrieve information; it relies on measuring how the propagation of different optical wavefronts are impacted by scattering. Based on this principle, several wavefront shaping techniques were successfu…
▽ More
Optical imaging through complex media, such as biological tissues or fog, is challenging due to light scattering. In the multiple scattering regime, wavefront shaping provides an effective method to retrieve information; it relies on measuring how the propagation of different optical wavefronts are impacted by scattering. Based on this principle, several wavefront shaping techniques were successfully developed, but most of them are highly invasive and limited to proof-of-principle experiments. Here, we propose to use a neural network approach to non-invasively characterize and control light scattering inside the medium and also to retrieve information of hidden objects buried within it. Unlike most of the recently-proposed approaches, the architecture of our neural network with its layers, connected nodes and activation functions has a true physical meaning as it mimics the propagation of light in our optical system. It is trained with an experimentally-measured input/output dataset built from a series of incident light patterns and corresponding camera snapshots. We apply our physics-based neural network to a fluorescence microscope in epi-configuration and demonstrate its performance through numerical simulations and experiments. This flexible method can include physical priors and we show that it can be applied to other systems as, for example, non-linear or coherent contrast mechanisms.
△ Less
Submitted 1 June, 2022;
originally announced June 2022.
-
Denoising of Three-Dimensional Fast Spin Echo Magnetic Resonance Images of Knee Joints using Spatial-Variant Noise-Relevant Residual Learning of Convolution Neural Network
Authors:
Shutian Zhao,
Donal G. Cahill,
Siyue Li,
Fan Xiao,
Thierry Blu,
James F Griffith,
Weitian Chen
Abstract:
Two-dimensional (2D) fast spin echo (FSE) techniques play a central role in the clinical magnetic resonance imaging (MRI) of knee joints. Moreover, three-dimensional (3D) FSE provides high-isotropic-resolution magnetic resonance (MR) images of knee joints, but it has a reduced signal-to-noise ratio compared to 2D FSE. Deep-learning denoising methods are a promising approach for denoising MR images…
▽ More
Two-dimensional (2D) fast spin echo (FSE) techniques play a central role in the clinical magnetic resonance imaging (MRI) of knee joints. Moreover, three-dimensional (3D) FSE provides high-isotropic-resolution magnetic resonance (MR) images of knee joints, but it has a reduced signal-to-noise ratio compared to 2D FSE. Deep-learning denoising methods are a promising approach for denoising MR images, but they are often trained using synthetic noise due to challenges in obtaining true noise distributions for MR images. In this study, inherent true noise information from 2-NEX acquisition was used to develop a deep-learning model based on residual learning of convolutional neural network (CNN), and this model was used to suppress the noise in 3D FSE MR images of knee joints. The proposed CNN used two-step residual learning over parallel transporting and residual blocks and was designed to comprehensively learn real noise features from 2-NEX training data. The results of an ablation study validated the network design. The new method achieved improved denoising performance of 3D FSE knee MR images compared with current state-of-the-art methods, based on the peak signal-to-noise ratio and structural similarity index measure. The improved image quality after denoising using the new method was verified by radiological evaluation. A deep CNN using the inherent spatial-varying noise information in 2-NEX acquisitions was developed. This method showed promise for clinical MRI assessments of the knee, and has potential applications for the assessment of other anatomical structures.
△ Less
Submitted 20 April, 2022;
originally announced April 2022.
-
Local Information Assisted Attention-free Decoder for Audio Captioning
Authors:
Feiyang Xiao,
Jian Guan,
Haiyan Lan,
Qiaoxi Zhu,
Wenwu Wang
Abstract:
Automated audio captioning aims to describe audio data with captions using natural language. Existing methods often employ an encoder-decoder structure, where the attention-based decoder (e.g., Transformer decoder) is widely used and achieves state-of-the-art performance. Although this method effectively captures global information within audio data via the self-attention mechanism, it may ignore…
▽ More
Automated audio captioning aims to describe audio data with captions using natural language. Existing methods often employ an encoder-decoder structure, where the attention-based decoder (e.g., Transformer decoder) is widely used and achieves state-of-the-art performance. Although this method effectively captures global information within audio data via the self-attention mechanism, it may ignore the event with short time duration, due to its limitation in capturing local information in an audio signal, leading to inaccurate prediction of captions. To address this issue, we propose a method using the pretrained audio neural networks (PANNs) as the encoder and local information assisted attention-free Transformer (LocalAFT) as the decoder. The novelty of our method is in the proposal of the LocalAFT decoder, which allows local information within an audio signal to be captured while retaining the global information. This enables the events of different duration, including short duration, to be captured for more precise caption generation. Experiments show that our method outperforms the state-of-the-art methods in Task 6 of the DCASE 2021 Challenge with the standard attention-based decoder for caption generation.
△ Less
Submitted 3 July, 2022; v1 submitted 10 January, 2022;
originally announced January 2022.
-
Distributed strategy-updating rules for aggregative games of multi-integrator systems with coupled constraints
Authors:
Xin Cai,
Feng Xiao,
Bo Wei
Abstract:
In this paper, we explore aggregative games over networks of multi-integrator agents with coupled constraints. To reach the general Nash equilibrium of an aggregative game, a distributed strategy-updating rule is proposed by a combination of the coordination of Lagrange multipliers and the estimation of the aggregator. Each player has only access to partial-decision information and communicates wi…
▽ More
In this paper, we explore aggregative games over networks of multi-integrator agents with coupled constraints. To reach the general Nash equilibrium of an aggregative game, a distributed strategy-updating rule is proposed by a combination of the coordination of Lagrange multipliers and the estimation of the aggregator. Each player has only access to partial-decision information and communicates with his neighbors in a weight-balanced digraph which characterizes players' preferences as to the values of information received from neighbors. We first consider networks of double-integrator agents and then focus on multi-integrator agents. The effectiveness of the proposed strategy-updating rules is demonstrated by analyzing the convergence of corresponding dynamical systems via the Lyapunov stability theory, singular perturbation theory and passive theory. Numerical examples are given to illustrate our results.
△ Less
Submitted 20 June, 2021;
originally announced June 2021.