-
ADAgent: LLM Agent for Alzheimer's Disease Analysis with Collaborative Coordinator
Authors:
Wenlong Hou,
Guangqian Yang,
Ye Du,
Yeung Lau,
Lihao Liu,
Junjun He,
Ling Long,
Shujun Wang
Abstract:
Alzheimer's disease (AD) is a progressive and irreversible neurodegenerative disease. Early and precise diagnosis of AD is crucial for timely intervention and treatment planning to alleviate the progressive neurodegeneration. However, most existing methods rely on single-modality data, which contrasts with the multifaceted approach used by medical experts. While some deep learning approaches proce…
▽ More
Alzheimer's disease (AD) is a progressive and irreversible neurodegenerative disease. Early and precise diagnosis of AD is crucial for timely intervention and treatment planning to alleviate the progressive neurodegeneration. However, most existing methods rely on single-modality data, which contrasts with the multifaceted approach used by medical experts. While some deep learning approaches process multi-modal data, they are limited to specific tasks with a small set of input modalities and cannot handle arbitrary combinations. This highlights the need for a system that can address diverse AD-related tasks, process multi-modal or missing input, and integrate multiple advanced methods for improved performance. In this paper, we propose ADAgent, the first specialized AI agent for AD analysis, built on a large language model (LLM) to address user queries and support decision-making. ADAgent integrates a reasoning engine, specialized medical tools, and a collaborative outcome coordinator to facilitate multi-modal diagnosis and prognosis tasks in AD. Extensive experiments demonstrate that ADAgent outperforms SOTA methods, achieving significant improvements in accuracy, including a 2.7% increase in multi-modal diagnosis, a 0.7% improvement in multi-modal prognosis, and enhancements in MRI and PET diagnosis tasks.
△ Less
Submitted 15 June, 2025; v1 submitted 11 June, 2025;
originally announced June 2025.
-
VocalCrypt: Novel Active Defense Against Deepfake Voice Based on Masking Effect
Authors:
Qingyuan Fei,
Wenjie Hou,
Xuan Hai,
Xin Liu
Abstract:
The rapid advancements in AI voice cloning, fueled by machine learning, have significantly impacted text-to-speech (TTS) and voice conversion (VC) fields. While these developments have led to notable progress, they have also raised concerns about the misuse of AI VC technology, causing economic losses and negative public perceptions. To address this challenge, this study focuses on creating active…
▽ More
The rapid advancements in AI voice cloning, fueled by machine learning, have significantly impacted text-to-speech (TTS) and voice conversion (VC) fields. While these developments have led to notable progress, they have also raised concerns about the misuse of AI VC technology, causing economic losses and negative public perceptions. To address this challenge, this study focuses on creating active defense mechanisms against AI VC systems.
We propose a novel active defense method, VocalCrypt, which embeds pseudo-timbre (jamming information) based on SFS into audio segments that are imperceptible to the human ear, thereby forming systematic fragments to prevent voice cloning. This approach protects the voice without compromising its quality. In comparison to existing methods, such as adversarial noise incorporation, VocalCrypt significantly enhances robustness and real-time performance, achieving a 500\% increase in generation speed while maintaining interference effectiveness.
Unlike audio watermarking techniques, which focus on post-detection, our method offers preemptive defense, reducing implementation costs and enhancing feasibility. Extensive experiments using the Zhvoice and VCTK Corpus datasets show that our AI-cloned speech defense system performs excellently in automatic speaker verification (ASV) tests while preserving the integrity of the protected audio.
△ Less
Submitted 14 February, 2025;
originally announced February 2025.
-
RFBoost: Understanding and Boosting Deep WiFi Sensing via Physical Data Augmentation
Authors:
Weiying Hou,
Chenshu Wu
Abstract:
Deep learning shows promising performance in wireless sensing. However, deep wireless sensing (DWS) heavily relies on large datasets. Unfortunately, building comprehensive datasets for DWS is difficult and costly, because wireless data depends on environmental factors and cannot be labeled offline. Despite recent advances in few-shot/cross-domain learning, DWS is still facing data scarcity issues.…
▽ More
Deep learning shows promising performance in wireless sensing. However, deep wireless sensing (DWS) heavily relies on large datasets. Unfortunately, building comprehensive datasets for DWS is difficult and costly, because wireless data depends on environmental factors and cannot be labeled offline. Despite recent advances in few-shot/cross-domain learning, DWS is still facing data scarcity issues. In this paper, we investigate a distinct perspective of radio data augmentation (RDA) for WiFi sensing and present a data-space solution. Our key insight is that wireless signals inherently exhibit data diversity, contributing more information to be extracted for DWS. We present RFBoost, a simple and effective RDA framework encompassing novel physical data augmentation techniques. We implement RFBoost as a plug-and-play module integrated with existing deep models and evaluate it on multiple datasets. Experimental results demonstrate that RFBoost achieves remarkable average accuracy improvements of 5.4% on existing models without additional data collection or model modifications, and the best-boosted performance outperforms 11 state-of-the-art baseline models without RDA. RFBoost pioneers the study of RDA, an important yet currently underexplored building block for DWS, which we expect to become a standard DWS component of WiFi sensing and beyond. RFBoost is released at https://github.com/aiot-lab/RFBoost.
△ Less
Submitted 3 October, 2024;
originally announced October 2024.
-
RMFA-Net: A Neural ISP for Real RAW to RGB Image Reconstruction
Authors:
Fei Li,
Wenbo Hou,
Peng Jia
Abstract:
Deep learning-based ISP algorithms have demonstrated significant potential in raw2rgb reconstruction. However, existing networks have not fully considered the specific characteristics of raw data, such as black level and CFA, which can negatively impact texture and color if mishandled. Moreover, uneven exposure in raw data is also not considered carefully, leading to adverse effects on contrast an…
▽ More
Deep learning-based ISP algorithms have demonstrated significant potential in raw2rgb reconstruction. However, existing networks have not fully considered the specific characteristics of raw data, such as black level and CFA, which can negatively impact texture and color if mishandled. Moreover, uneven exposure in raw data is also not considered carefully, leading to adverse effects on contrast and brightness. In this paper, we introduce RMFA-Net to tackle these problems. We perform implicit black level correction to mitigate color shifts in dim scenes. To preserve high-frequency information and prevent misalignment, we propose a novel Three-Channel-Split mode. To address the issue of uneven exposure, we designed an explicit tone mapping module based on the Retinex theory. We train and evaluate our models using the dataset released by the Mobile AI 2022 Learned Smartphone ISP Challenge. It is demonstrated that RMFA-Net outperforms previous algorithms, achieving a PSNR score of over 25 dB, surpassing the state-of-the-art by +1 dB. Furthermore, we developed a lightweight version, RMFANet-tiny, for engineering deployment while still maintaining strong performance, surpassing the SOTA by +0.5 dB.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Boosting Cross-Domain Speech Recognition with Self-Supervision
Authors:
Han Zhu,
Gaofeng Cheng,
Jindong Wang,
Wenxin Hou,
Pengyuan Zhang,
Yonghong Yan
Abstract:
The cross-domain performance of automatic speech recognition (ASR) could be severely hampered due to the mismatch between training and testing distributions. Since the target domain usually lacks labeled data, and domain shifts exist at acoustic and linguistic levels, it is challenging to perform unsupervised domain adaptation (UDA) for ASR. Previous work has shown that self-supervised learning (S…
▽ More
The cross-domain performance of automatic speech recognition (ASR) could be severely hampered due to the mismatch between training and testing distributions. Since the target domain usually lacks labeled data, and domain shifts exist at acoustic and linguistic levels, it is challenging to perform unsupervised domain adaptation (UDA) for ASR. Previous work has shown that self-supervised learning (SSL) or pseudo-labeling (PL) is effective in UDA by exploiting the self-supervisions of unlabeled data. However, these self-supervisions also face performance degradation in mismatched domain distributions, which previous work fails to address. This work presents a systematic UDA framework to fully utilize the unlabeled data with self-supervision in the pre-training and fine-tuning paradigm. On the one hand, we apply continued pre-training and data replay techniques to mitigate the domain mismatch of the SSL pre-trained model. On the other hand, we propose a domain-adaptive fine-tuning approach based on the PL technique with three unique modifications: Firstly, we design a dual-branch PL method to decrease the sensitivity to the erroneous pseudo-labels; Secondly, we devise an uncertainty-aware confidence filtering strategy to improve pseudo-label correctness; Thirdly, we introduce a two-step PL approach to incorporate target domain linguistic knowledge, thus generating more accurate target domain pseudo-labels. Experimental results on various cross-domain scenarios demonstrate that the proposed approach effectively boosts the cross-domain performance and significantly outperforms previous approaches.
△ Less
Submitted 30 July, 2023; v1 submitted 20 June, 2022;
originally announced June 2022.
-
Reciprocal phase transition-enabled electro-optic modulation
Authors:
Fang Zou,
Lei Zou,
Ye Tian,
Yiming Zhang,
Erwin Bente,
Weigang Hou,
Yu Liu,
Siming Chen,
Victoria Cao,
Lei Guo,
Songsui Li,
Lianshan Yan,
Wei Pan,
Dusan Milosevic,
Zizheng Cao,
A. M. J. Koonen,
Huiyun Liu,
Xihua Zou
Abstract:
Electro-optic (EO) modulation is a well-known and essential topic in the field of communications and sensing. Its ultrahigh efficiency is unprecedentedly desired in the current green and data era. However, dramatically increasing the modulation efficiency is difficult due to the monotonic mapping relationship between the electrical signal and modulated optical signal. Here, a new mechanism termed…
▽ More
Electro-optic (EO) modulation is a well-known and essential topic in the field of communications and sensing. Its ultrahigh efficiency is unprecedentedly desired in the current green and data era. However, dramatically increasing the modulation efficiency is difficult due to the monotonic mapping relationship between the electrical signal and modulated optical signal. Here, a new mechanism termed phase-transition EO modulation is revealed from the reciprocal transition between two distinct phase planes arising from the bifurcation. Remarkably, a monolithically integrated mode-locked laser (MLL) is implemented as a prototype. A 24.8-GHz radio-frequency signal is generated and modulated, achieving a modulation energy efficiency of 3.06 fJ/bit improved by about four orders of magnitude and a contrast ratio exceeding 50 dB. Thus, MLL-based phase-transition EO modulation is characterised by ultrahigh modulation efficiency and ultrahigh contrast ratio, as experimentally proved in radio-over-fibre and underwater acoustic-sensing systems. This phase-transition EO modulation opens a new avenue for green communication and ubiquitous connections.
△ Less
Submitted 22 November, 2022; v1 submitted 28 March, 2022;
originally announced March 2022.
-
Exploiting Adapters for Cross-lingual Low-resource Speech Recognition
Authors:
Wenxin Hou,
Han Zhu,
Yidong Wang,
Jindong Wang,
Tao Qin,
Renjun Xu,
Takahiro Shinozaki
Abstract:
Cross-lingual speech adaptation aims to solve the problem of leveraging multiple rich-resource languages to build models for a low-resource target language. Since the low-resource language has limited training data, speech recognition models can easily overfit. In this paper, we propose to use adapters to investigate the performance of multiple adapters for parameter-efficient cross-lingual speech…
▽ More
Cross-lingual speech adaptation aims to solve the problem of leveraging multiple rich-resource languages to build models for a low-resource target language. Since the low-resource language has limited training data, speech recognition models can easily overfit. In this paper, we propose to use adapters to investigate the performance of multiple adapters for parameter-efficient cross-lingual speech adaptation. Based on our previous MetaAdapter that implicitly leverages adapters, we propose a novel algorithms called SimAdapter for explicitly learning knowledge from adapters. Our algorithm leverages adapters which can be easily integrated into the Transformer structure.MetaAdapter leverages meta-learning to transfer the general knowledge from training data to the test language. SimAdapter aims to learn the similarities between the source and target languages during fine-tuning using the adapters. We conduct extensive experiments on five-low-resource languages in Common Voice dataset. Results demonstrate that our MetaAdapter and SimAdapter methods can reduce WER by 2.98% and 2.55% with only 2.5% and 15.5% of trainable parameters compared to the strong full-model fine-tuning baseline. Moreover, we also show that these two novel algorithms can be integrated for better performance with up to 3.55% relative WER reduction.
△ Less
Submitted 17 December, 2021; v1 submitted 18 May, 2021;
originally announced May 2021.
-
Cross-domain Speech Recognition with Unsupervised Character-level Distribution Matching
Authors:
Wenxin Hou,
Jindong Wang,
Xu Tan,
Tao Qin,
Takahiro Shinozaki
Abstract:
End-to-end automatic speech recognition (ASR) can achieve promising performance with large-scale training data. However, it is known that domain mismatch between training and testing data often leads to a degradation of recognition accuracy. In this work, we focus on the unsupervised domain adaptation for ASR and propose CMatch, a Character-level distribution matching method to perform fine-graine…
▽ More
End-to-end automatic speech recognition (ASR) can achieve promising performance with large-scale training data. However, it is known that domain mismatch between training and testing data often leads to a degradation of recognition accuracy. In this work, we focus on the unsupervised domain adaptation for ASR and propose CMatch, a Character-level distribution matching method to perform fine-grained adaptation between each character in two domains. First, to obtain labels for the features belonging to each character, we achieve frame-level label assignment using the Connectionist Temporal Classification (CTC) pseudo labels. Then, we match the character-level distributions using Maximum Mean Discrepancy. We train our algorithm using the self-training technique. Experiments on the Libri-Adapt dataset show that our proposed approach achieves 14.39% and 16.50% relative Word Error Rate (WER) reduction on both cross-device and cross-environment ASR. We also comprehensively analyze the different strategies for frame-level label assignment and Transformer adaptations.
△ Less
Submitted 8 June, 2021; v1 submitted 15 April, 2021;
originally announced April 2021.
-
Speak or Chat with Me: End-to-End Spoken Language Understanding System with Flexible Inputs
Authors:
Sujeong Cha,
Wangrui Hou,
Hyun Jung,
My Phung,
Michael Picheny,
Hong-Kwang Kuo,
Samuel Thomas,
Edmilson Morais
Abstract:
A major focus of recent research in spoken language understanding (SLU) has been on the end-to-end approach where a single model can predict intents directly from speech inputs without intermediate transcripts. However, this approach presents some challenges. First, since speech can be considered as personally identifiable information, in some cases only automatic speech recognition (ASR) transcri…
▽ More
A major focus of recent research in spoken language understanding (SLU) has been on the end-to-end approach where a single model can predict intents directly from speech inputs without intermediate transcripts. However, this approach presents some challenges. First, since speech can be considered as personally identifiable information, in some cases only automatic speech recognition (ASR) transcripts are accessible. Second, intent-labeled speech data is scarce. To address the first challenge, we propose a novel system that can predict intents from flexible types of inputs: speech, ASR transcripts, or both. We demonstrate strong performance for either modality separately, and when both speech and ASR transcripts are available, through system combination, we achieve better results than using a single input modality. To address the second challenge, we leverage a semantically robust pre-trained BERT model and adopt a cross-modal system that co-trains text embeddings and acoustic embeddings in a shared latent space. We further enhance this system by utilizing an acoustic module pre-trained on LibriSpeech and domain-adapting the text module on our target datasets. Our experiments show significant advantages for these pre-training and fine-tuning strategies, resulting in a system that achieves competitive intent-classification performance on Snips SLU and Fluent Speech Commands datasets.
△ Less
Submitted 14 June, 2021; v1 submitted 7 April, 2021;
originally announced April 2021.
-
Boosting Semantic Human Matting with Coarse Annotations
Authors:
Jinlin Liu,
Yuan Yao,
Wendi Hou,
Miaomiao Cui,
Xuansong Xie,
Changshui Zhang,
Xian-sheng Hua
Abstract:
Semantic human matting aims to estimate the per-pixel opacity of the foreground human regions. It is quite challenging and usually requires user interactive trimaps and plenty of high quality annotated data. Annotating such kind of data is labor intensive and requires great skills beyond normal users, especially considering the very detailed hair part of humans. In contrast, coarse annotated human…
▽ More
Semantic human matting aims to estimate the per-pixel opacity of the foreground human regions. It is quite challenging and usually requires user interactive trimaps and plenty of high quality annotated data. Annotating such kind of data is labor intensive and requires great skills beyond normal users, especially considering the very detailed hair part of humans. In contrast, coarse annotated human dataset is much easier to acquire and collect from the public dataset. In this paper, we propose to use coarse annotated data coupled with fine annotated data to boost end-to-end semantic human matting without trimaps as extra input. Specifically, we train a mask prediction network to estimate the coarse semantic mask using the hybrid data, and then propose a quality unification network to unify the quality of the previous coarse mask outputs. A matting refinement network takes in the unified mask and the input image to predict the final alpha matte. The collected coarse annotated dataset enriches our dataset significantly, allows generating high quality alpha matte for real images. Experimental results show that the proposed method performs comparably against state-of-the-art methods. Moreover, the proposed method can be used for refining coarse annotated public dataset, as well as semantic segmentation methods, which reduces the cost of annotating high quality human data to a great extent.
△ Less
Submitted 10 April, 2020;
originally announced April 2020.
-
Indoor Localization System of ROS mobile robot based on Visible Light Communication
Authors:
Weipeng Guan,
Shihuan Chen,
Shangsheng Wen,
Wenyuan Hou,
Zequn Tan,
Ruihong Cen
Abstract:
In this paper, an indoor robot localization system based on Robot Operating System (ROS) and visible light communication (VLC) is presented. On the basis of our previous work, we innovatively designed a VLC localization and navigation package based on Robot Operating System (ROS), which contains the LED-ID detection and recognition method, the video target tracking algorithm and the double-lamp po…
▽ More
In this paper, an indoor robot localization system based on Robot Operating System (ROS) and visible light communication (VLC) is presented. On the basis of our previous work, we innovatively designed a VLC localization and navigation package based on Robot Operating System (ROS), which contains the LED-ID detection and recognition method, the video target tracking algorithm and the double-lamp positioning algorithm. This package exploited the principle of double-lamp positioning and the loose coupling characteristics of the ROS system, which is implemented by loosely coupled ROS nodes. Consequently, this paper combines ROS and VLC, aiming at promoting the application of VLC positioning in mature robotic systems. Moreover, it pushed forward the development of localization application based on VLC technology and lays a foundation for transplanting to other ROS robot platforms. Experimental results show that the proposed system can provide indoor localization within 1 cm and possesses a good real-time performance which takes only 0.4 seconds for one-time positioning. And if a high-performance laptop is used, the single positioning time can be reduced to 0.08 seconds. Therefore, this study confirms the practical application and the superior performance of VLC technology in ROS robot, showing the great potential of VLC localization. T he video demo of the proposed robot positioning system based on VLC can be seen in *
△ Less
Submitted 6 January, 2020;
originally announced January 2020.
-
Study on the spectral reconstruction of typical surface types based on spectral library and principal component analysis
Authors:
Weizhen Hou,
Yilan Mao,
Chi Xu,
Zhengqiang Li,
Donghui Li,
Yan Ma,
Hua Xu
Abstract:
To meet the demanding of spectral reconstruction in the visible and near-infrared wavelength, the spectral reconstruction method for typical surface types is discussed based on the USGS /ASTER spectral library and principal component analysis (PCA). A new spectral reconstructed model is proposed by the information of several typical bands instead of all of the wavelength bands, and a linear combin…
▽ More
To meet the demanding of spectral reconstruction in the visible and near-infrared wavelength, the spectral reconstruction method for typical surface types is discussed based on the USGS /ASTER spectral library and principal component analysis (PCA). A new spectral reconstructed model is proposed by the information of several typical bands instead of all of the wavelength bands, and a linear combination spectral reconstruction model is also discussed. By selecting 4 typical spectral datasets including green vegetation, bare soil, rangeland and concrete in the spectral range of 400-900 nm, the PCA results show that 6 principal components could characterized the spectral dataset, and the relative reconstructed errors are smaller than 2%. If only 6-7 selected typical bands are employed to spectral reconstruction for all the surface reflectance in 400-900 nm, except that the reconstructed error of green vegetation is about 3.3%, the relative errors of other 3 datasets are all smaller than 1.6%. The correlation coefficients of those 4 datasets are all larger than 0.99, which can effectively satisfy the needs of spectral reconstruction. In addition, based on the spectral library and the linear combination model of 4 common used bands of satellite remote sensing such as 490, 555, 670 and 865 nm, the reconstructed errors are smaller than 8.5% in high reflectance region and smaller than 1.5% in low reflectance region respectively, which basically meet the needs of spectral reconstruction. This study can provide a reference value for the surface reflectance processing and spectral reconstruction in satellite remote sensing research.
△ Less
Submitted 15 June, 2019;
originally announced June 2019.