-
Sensing Cardiac Health Across Scenarios and Devices: A Multi-Modal Foundation Model Pretrained on Heterogeneous Data from 1.7 Million Individuals
Authors:
Xiao Gu,
Wei Tang,
Jinpei Han,
Veer Sangha,
Fenglin Liu,
Shreyank N Gowda,
Antonio H. Ribeiro,
Patrick Schwab,
Kim Branson,
Lei Clifton,
Antonio Luiz P. Ribeiro,
Zhangdaihong Liu,
David A. Clifton
Abstract:
Cardiac biosignals, such as electrocardiograms (ECG) and photoplethysmograms (PPG), are of paramount importance for the diagnosis, prevention, and management of cardiovascular diseases, and have been extensively used in a variety of clinical tasks. Conventional deep learning approaches for analyzing these signals typically rely on homogeneous datasets and static bespoke models, limiting their robu…
▽ More
Cardiac biosignals, such as electrocardiograms (ECG) and photoplethysmograms (PPG), are of paramount importance for the diagnosis, prevention, and management of cardiovascular diseases, and have been extensively used in a variety of clinical tasks. Conventional deep learning approaches for analyzing these signals typically rely on homogeneous datasets and static bespoke models, limiting their robustness and generalizability across diverse clinical settings and acquisition protocols. In this study, we present a cardiac sensing foundation model (CSFM) that leverages advanced transformer architectures and a generative, masked pretraining strategy to learn unified representations from vast, heterogeneous health records. Our model is pretrained on an innovative multi-modal integration of data from multiple large-scale datasets (including MIMIC-III-WDB, MIMIC-IV-ECG, and CODE), comprising cardiac signals and the corresponding clinical or machine-generated text reports from approximately 1.7 million individuals. We demonstrate that the embeddings derived from our CSFM not only serve as effective feature extractors across diverse cardiac sensing scenarios, but also enable seamless transfer learning across varying input configurations and sensor modalities. Extensive evaluations across diagnostic tasks, demographic information recognition, vital sign measurement, clinical outcome prediction, and ECG question answering reveal that CSFM consistently outperforms traditional one-modal-one-task approaches. Notably, CSFM exhibits robust performance across multiple ECG lead configurations from standard 12-lead systems to single-lead setups, and in scenarios where only ECG, only PPG, or a combination thereof is available. These findings highlight the potential of CSFM as a versatile and scalable solution, for comprehensive cardiac monitoring.
△ Less
Submitted 23 June, 2025;
originally announced July 2025.
-
Efficient and Generalizable Speaker Diarization via Structured Pruning of Self-Supervised Models
Authors:
Jiangyu Han,
Petr Pálka,
Marc Delcroix,
Federico Landini,
Johan Rohdin,
Jan Cernocký,
Lukáš Burget
Abstract:
Self-supervised learning (SSL) models such as WavLM have brought substantial improvements to speaker diarization by providing rich contextual representations. However, the high computational and memory costs of these models hinder their deployment in real-time and resource-constrained scenarios. In this work, we present a comprehensive study on compressing SSL-based diarization models through stru…
▽ More
Self-supervised learning (SSL) models such as WavLM have brought substantial improvements to speaker diarization by providing rich contextual representations. However, the high computational and memory costs of these models hinder their deployment in real-time and resource-constrained scenarios. In this work, we present a comprehensive study on compressing SSL-based diarization models through structured pruning guided by knowledge distillation. Building upon our previous work, we extend the analysis to include pruning objectives based on multiply-accumulate operations (MACs), investigate module-wise and progressive pruning strategies, and examine the impact of training data quantity. Experimental results show that our method reduces model size by up to 80% without degrading performance, achieving up to 4x faster inference on a single GPU. We further perform large-scale evaluations on a diverse compound dataset comprising eight public diarization corpora, where our best pruned model achieves state-of-the-art performance across most conditions. Additionally, we show strong generalization to the CHiME-6 dataset, attaining performance comparable to the third-place system in the CHiME-7 challenge without any domain adaptation. All models and code are publicly released to support reproducibility and future research.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
BUT System for the MLC-SLM Challenge
Authors:
Alexander Polok,
Jiangyu Han,
Dominik Klement,
Samuele Cornell,
Jan Černocký,
Lukáš Burget
Abstract:
We present a two-speaker automatic speech recognition (ASR) system that combines DiCoW -- a diarization-conditioned variant of Whisper -- with DiariZen, a diarization pipeline built on top of Pyannote. We first evaluate both systems in out-of-domain (OOD) multilingual scenarios without any fine-tuning. In this scenario, DiariZen consistently outperforms the baseline Pyannote diarization model, dem…
▽ More
We present a two-speaker automatic speech recognition (ASR) system that combines DiCoW -- a diarization-conditioned variant of Whisper -- with DiariZen, a diarization pipeline built on top of Pyannote. We first evaluate both systems in out-of-domain (OOD) multilingual scenarios without any fine-tuning. In this scenario, DiariZen consistently outperforms the baseline Pyannote diarization model, demonstrating strong generalization. Despite being fine-tuned on English-only data for target-speaker ASR, DiCoW retains solid multilingual performance, indicating that encoder modifications preserve Whisper's multilingual capabilities. We then fine-tune both DiCoW and DiariZen on the MLC-SLM challenge data. The fine-tuned DiariZen continues to outperform the fine-tuned Pyannote baseline, while DiCoW sees further gains from domain adaptation. Our final system achieves a micro-average tcpWER/CER of 16.75% and ranks second in Task 2 of the MLC-SLM challenge. Lastly, we identify several labeling inconsistencies in the training data -- such as missing speech segments and incorrect silence annotations -- which can hinder diarization fine-tuning. We propose simple mitigation strategies to address these issues and improve system robustness.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
crossMoDA Challenge: Evolution of Cross-Modality Domain Adaptation Techniques for Vestibular Schwannoma and Cochlea Segmentation from 2021 to 2023
Authors:
Navodini Wijethilake,
Reuben Dorent,
Marina Ivory,
Aaron Kujawa,
Stefan Cornelissen,
Patrick Langenhuizen,
Mohamed Okasha,
Anna Oviedova,
Hexin Dong,
Bogyeong Kang,
Guillaume Sallé,
Luyi Han,
Ziyuan Zhao,
Han Liu,
Tao Yang,
Shahad Hardan,
Hussain Alasmawi,
Santosh Sanjeev,
Yuzhou Zhuang,
Satoshi Kondo,
Maria Baldeon Calisto,
Shaikh Muhammad Uzair Noman,
Cancan Chen,
Ipek Oguz,
Rongguo Zhang
, et al. (14 additional authors not shown)
Abstract:
The cross-Modality Domain Adaptation (crossMoDA) challenge series, initiated in 2021 in conjunction with the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), focuses on unsupervised cross-modality segmentation, learning from contrast-enhanced T1 (ceT1) and transferring to T2 MRI. The task is an extreme example of domain shift chosen to serve as a mea…
▽ More
The cross-Modality Domain Adaptation (crossMoDA) challenge series, initiated in 2021 in conjunction with the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), focuses on unsupervised cross-modality segmentation, learning from contrast-enhanced T1 (ceT1) and transferring to T2 MRI. The task is an extreme example of domain shift chosen to serve as a meaningful and illustrative benchmark. From a clinical application perspective, it aims to automate Vestibular Schwannoma (VS) and cochlea segmentation on T2 scans for more cost-effective VS management. Over time, the challenge objectives have evolved to enhance its clinical relevance. The challenge evolved from using single-institutional data and basic segmentation in 2021 to incorporating multi-institutional data and Koos grading in 2022, and by 2023, it included heterogeneous routine data and sub-segmentation of intra- and extra-meatal tumour components. In this work, we report the findings of the 2022 and 2023 editions and perform a retrospective analysis of the challenge progression over the years. The observations from the successive challenge contributions indicate that the number of outliers decreases with an expanding dataset. This is notable since the diversity of scanning protocols of the datasets concurrently increased. The winning approach of the 2023 edition reduced the number of outliers on the 2021 and 2022 testing data, demonstrating how increased data heterogeneity can enhance segmentation performance even on homogeneous data. However, the cochlea Dice score declined in 2023, likely due to the added complexity from tumour sub-annotations affecting overall segmentation performance. While progress is still needed for clinically acceptable VS segmentation, the plateauing performance suggests that a more challenging cross-modal task may better serve future benchmarking.
△ Less
Submitted 24 June, 2025; v1 submitted 13 June, 2025;
originally announced June 2025.
-
Fine-tune Before Structured Pruning: Towards Compact and Accurate Self-Supervised Models for Speaker Diarization
Authors:
Jiangyu Han,
Federico Landini,
Johan Rohdin,
Anna Silnova,
Mireia Diez,
Jan Cernocky,
Lukas Burget
Abstract:
Self-supervised learning (SSL) models like WavLM can be effectively utilized when building speaker diarization systems but are often large and slow, limiting their use in resource constrained scenarios. Previous studies have explored compression techniques, but usually for the price of degraded performance at high pruning ratios. In this work, we propose to compress SSL models through structured p…
▽ More
Self-supervised learning (SSL) models like WavLM can be effectively utilized when building speaker diarization systems but are often large and slow, limiting their use in resource constrained scenarios. Previous studies have explored compression techniques, but usually for the price of degraded performance at high pruning ratios. In this work, we propose to compress SSL models through structured pruning by introducing knowledge distillation. Different from the existing works, we emphasize the importance of fine-tuning SSL models before pruning. Experiments on far-field single-channel AMI, AISHELL-4, and AliMeeting datasets show that our method can remove redundant parameters of WavLM Base+ and WavLM Large by up to 80% without any performance degradation. After pruning, the inference speeds on a single GPU for the Base+ and Large models are 4.0 and 2.6 times faster, respectively. Our source code is publicly available.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Decoding Speaker-Normalized Pitch from EEG for Mandarin Perception
Authors:
Jiaxin Chen,
Yiming Wang,
Ziyu Zhang,
Jiayang Han,
Yin-Long Liu,
Rui Feng,
Xiuyuan Liang,
Zhen-Hua Ling,
Jiahong Yuan
Abstract:
The same speech content produced by different speakers exhibits significant differences in pitch contour, yet listeners' semantic perception remains unaffected. This phenomenon may stem from the brain's perception of pitch contours being independent of individual speakers' pitch ranges. In this work, we recorded electroencephalogram (EEG) while participants listened to Mandarin monosyllables with…
▽ More
The same speech content produced by different speakers exhibits significant differences in pitch contour, yet listeners' semantic perception remains unaffected. This phenomenon may stem from the brain's perception of pitch contours being independent of individual speakers' pitch ranges. In this work, we recorded electroencephalogram (EEG) while participants listened to Mandarin monosyllables with varying tones, phonemes, and speakers. The CE-ViViT model is proposed to decode raw or speaker-normalized pitch contours directly from EEG. Experimental results demonstrate that the proposed model can decode pitch contours with modest errors, achieving performance comparable to state-of-the-art EEG regression methods. Moreover, speaker-normalized pitch contours were decoded more accurately, supporting the neural encoding of relative pitch.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models
Authors:
Kai Li,
Can Shen,
Yile Liu,
Jirui Han,
Kelong Zheng,
Xuechao Zou,
Zhe Wang,
Xingjian Du,
Shun Zhang,
Hanjun Luo,
Yingbin Jin,
Xinxin Xing,
Ziyang Ma,
Yue Liu,
Xiaojun Jia,
Yifan Zhang,
Junfeng Fang,
Kun Wang,
Yibo Yan,
Haoyang Li,
Yiming Li,
Xiaobin Zhuang,
Yang Liu,
Haibo Hu,
Zhizheng Wu
, et al. (6 additional authors not shown)
Abstract:
The rapid advancement and expanding applications of Audio Large Language Models (ALLMs) demand a rigorous understanding of their trustworthiness. However, systematic research on evaluating these models, particularly concerning risks unique to the audio modality, remains largely unexplored. Existing evaluation frameworks primarily focus on the text modality or address only a restricted set of safet…
▽ More
The rapid advancement and expanding applications of Audio Large Language Models (ALLMs) demand a rigorous understanding of their trustworthiness. However, systematic research on evaluating these models, particularly concerning risks unique to the audio modality, remains largely unexplored. Existing evaluation frameworks primarily focus on the text modality or address only a restricted set of safety dimensions, failing to adequately account for the unique characteristics and application scenarios inherent to the audio modality. We introduce AudioTrust-the first multifaceted trustworthiness evaluation framework and benchmark specifically designed for ALLMs. AudioTrust facilitates assessments across six key dimensions: fairness, hallucination, safety, privacy, robustness, and authentication. To comprehensively evaluate these dimensions, AudioTrust is structured around 18 distinct experimental setups. Its core is a meticulously constructed dataset of over 4,420 audio/text samples, drawn from real-world scenarios (e.g., daily conversations, emergency calls, voice assistant interactions), specifically designed to probe the multifaceted trustworthiness of ALLMs. For assessment, the benchmark carefully designs 9 audio-specific evaluation metrics, and we employ a large-scale automated pipeline for objective and scalable scoring of model outputs. Experimental results reveal the trustworthiness boundaries and limitations of current state-of-the-art open-source and closed-source ALLMs when confronted with various high-risk audio scenarios, offering valuable insights for the secure and trustworthy deployment of future audio models. Our platform and benchmark are available at https://github.com/JusperLee/AudioTrust.
△ Less
Submitted 1 July, 2025; v1 submitted 22 May, 2025;
originally announced May 2025.
-
Analysis of ABC Frontend Audio Systems for the NIST-SRE24
Authors:
Sara Barahona,
Anna Silnova,
Ladislav Mošner,
Junyi Peng,
Oldřich Plchot,
Johan Rohdin,
Lin Zhang,
Jiangyu Han,
Petr Palka,
Federico Landini,
Lukáš Burget,
Themos Stafylakis,
Sandro Cumani,
Dominik Boboš,
Miroslav Hlavaček,
Martin Kodovsky,
Tomáš Pavlíček
Abstract:
We present a comprehensive analysis of the embedding extractors (frontends) developed by the ABC team for the audio track of NIST SRE 2024. We follow the two scenarios imposed by NIST: using only a provided set of telephone recordings for training (fixed) or adding publicly available data (open condition). Under these constraints, we develop the best possible speaker embedding extractors for the p…
▽ More
We present a comprehensive analysis of the embedding extractors (frontends) developed by the ABC team for the audio track of NIST SRE 2024. We follow the two scenarios imposed by NIST: using only a provided set of telephone recordings for training (fixed) or adding publicly available data (open condition). Under these constraints, we develop the best possible speaker embedding extractors for the pre-dominant conversational telephone speech (CTS) domain. We explored architectures based on ResNet with different pooling mechanisms, recently introduced ReDimNet architecture, as well as a system based on the XLS-R model, which represents the family of large pre-trained self-supervised models. In open condition, we train on VoxBlink2 dataset, containing 110 thousand speakers across multiple languages. We observed a good performance and robustness of VoxBlink-trained models, and our experiments show practical recipes for developing state-of-the-art frontends for speaker recognition.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
Parameter Convergence Detector Based on VAMP Deep Unfolding: A Novel Radar Constant False Alarm Rate Detection Algorithm
Authors:
Haoyun Zhang,
Jianghong Han,
Xueqian Wang,
Gang Li,
Xiao-Ping Zhang
Abstract:
The sub-Nyquist radar framework exploits the sparsity of signals, which effectively alleviates the pressure on system storage and transmission bandwidth. Compressed sensing (CS) algorithms, such as the VAMP algorithm, are used for sparse signal processing in the sub-Nyquist radar framework. By combining deep unfolding techniques with VAMP, faster convergence and higher accuracy than traditional CS…
▽ More
The sub-Nyquist radar framework exploits the sparsity of signals, which effectively alleviates the pressure on system storage and transmission bandwidth. Compressed sensing (CS) algorithms, such as the VAMP algorithm, are used for sparse signal processing in the sub-Nyquist radar framework. By combining deep unfolding techniques with VAMP, faster convergence and higher accuracy than traditional CS algorithms are achieved. However, deep unfolding disrupts the parameter constrains in traditional VAMP algorithm, leading to the distribution of non-sparse noisy estimation in VAMP deep unfolding unknown, and its distribution parameter unable to be obtained directly using method of traditional VAMP, which prevents the application of VAMP deep unfolding in radar constant false alarm rate (CFAR) detection. To address this problem, we explore the distribution of the non-sparse noisy estimation and propose a parameter convergence detector (PCD) to achieve CFAR detection based on VAMP deep unfolding. Compared to the state-of-the-art methods, PCD leverages not only the sparse solution, but also the non-sparse noisy estimation, which is used to iteratively estimate the distribution parameter and served as the test statistic in detection process. In this way, the proposed algorithm takes advantage of both the enhanced sparse recovery accuracy from deep unfolding and the distribution property of VAMP, thereby achieving superior CFAR detection performance. Additionally, the PCD requires no information about the power of AWGN in the environment, which is more suitable for practical application. The convergence performance and effectiveness of the proposed PCD are analyzed based on the Banach Fixed-Point Theorem. Numerical simulations and practical data experiments demonstrate that PCD can achieve better false alarm control and target detection performance.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
SELIC: Semantic-Enhanced Learned Image Compression via High-Level Textual Guidance
Authors:
Haisheng Fu,
Jie Liang,
Zhenman Fang,
Jingning Han
Abstract:
Learned image compression (LIC) techniques have achieved remarkable progress; however, effectively integrating high-level semantic information remains challenging. In this work, we present a \underline{S}emantic-\underline{E}nhanced \underline{L}earned \underline{I}mage \underline{C}ompression framework, termed \textbf{SELIC}, which leverages high-level textual guidance to improve rate-distortion…
▽ More
Learned image compression (LIC) techniques have achieved remarkable progress; however, effectively integrating high-level semantic information remains challenging. In this work, we present a \underline{S}emantic-\underline{E}nhanced \underline{L}earned \underline{I}mage \underline{C}ompression framework, termed \textbf{SELIC}, which leverages high-level textual guidance to improve rate-distortion performance. Specifically, \textbf{SELIC} employs a text encoder to extract rich semantic descriptions from the input image. These textual features are transformed into fixed-dimension tensors and seamlessly fused with the image-derived latent representation. By embedding the \textbf{SELIC} tensor directly into the compression pipeline, our approach enriches the bitstream without requiring additional inputs at the decoder, thereby maintaining fast and efficient decoding. Extensive experiments on benchmark datasets (e.g., Kodak) demonstrate that integrating semantic information substantially enhances compression quality. Our \textbf{SELIC}-guided method outperforms a baseline LIC model without semantic integration by approximately 0.1-0.15 dB across a wide range of bit rates in PSNR and achieves a 4.9\% BD-rate improvement over VVC. Moreover, this improvement comes with minimal computational overhead, making the proposed \textbf{SELIC} framework a practical solution for advanced image compression applications.
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
Vision-to-Music Generation: A Survey
Authors:
Zhaokai Wang,
Chenxi Bao,
Le Zhuo,
Jingrui Han,
Yang Yue,
Yihong Tang,
Victor Shea-Jay Huang,
Yue Liao
Abstract:
Vision-to-music Generation, including video-to-music and image-to-music tasks, is a significant branch of multimodal artificial intelligence demonstrating vast application prospects in fields such as film scoring, short video creation, and dance music synthesis. However, compared to the rapid development of modalities like text and images, research in vision-to-music is still in its preliminary st…
▽ More
Vision-to-music Generation, including video-to-music and image-to-music tasks, is a significant branch of multimodal artificial intelligence demonstrating vast application prospects in fields such as film scoring, short video creation, and dance music synthesis. However, compared to the rapid development of modalities like text and images, research in vision-to-music is still in its preliminary stage due to its complex internal structure and the difficulty of modeling dynamic relationships with video. Existing surveys focus on general music generation without comprehensive discussion on vision-to-music. In this paper, we systematically review the research progress in the field of vision-to-music generation. We first analyze the technical characteristics and core challenges for three input types: general videos, human movement videos, and images, as well as two output types of symbolic music and audio music. We then summarize the existing methodologies on vision-to-music generation from the architecture perspective. A detailed review of common datasets and evaluation metrics is provided. Finally, we discuss current challenges and promising directions for future research. We hope our survey can inspire further innovation in vision-to-music generation and the broader field of multimodal generation in academic research and industrial applications. To follow latest works and foster further innovation in this field, we are continuously maintaining a GitHub repository at https://github.com/wzk1015/Awesome-Vision-to-Music-Generation.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
A CGAN-LSTM-Based Framework for Time-Varying Non-Stationary Channel Modeling
Authors:
Keying Guo,
Ruisi He,
Mi Yang,
Yuxin Zhang,
Bo Ai,
Haoxiang Zhang,
Jiahui Han,
Ruifeng Chen
Abstract:
Time-varying non-stationary channels, with complex dynamic variations and temporal evolution characteristics, have significant challenges in channel modeling and communication system performance evaluation. Most existing methods of time-varying channel modeling focus on predicting channel state at a given moment or simulating short-term channel fluctuations, which are unable to capture the long-te…
▽ More
Time-varying non-stationary channels, with complex dynamic variations and temporal evolution characteristics, have significant challenges in channel modeling and communication system performance evaluation. Most existing methods of time-varying channel modeling focus on predicting channel state at a given moment or simulating short-term channel fluctuations, which are unable to capture the long-term evolution of the channel. This paper emphasizes the generation of long-term dynamic channel to fully capture evolution of non-stationary channel properties. The generated channel not only reflects temporal dynamics but also ensures consistent stationarity. We propose a hybrid deep learning framework that combines conditional generative adversarial networks (CGAN) with long short-term memory (LSTM) networks. A stationarity-constrained approach is designed to ensure temporal correlation of the generated time-series channel. This method can generate channel with required temporal non-stationarity. The model is validated by comparing channel statistical features, and the results show that the generated channel is in good agreement with raw channel and provides good performance in terms of non-stationarity.
△ Less
Submitted 2 March, 2025;
originally announced March 2025.
-
$\mathbfΦ$-GAN: Physics-Inspired GAN for Generating SAR Images Under Limited Data
Authors:
Xidan Zhang,
Yihan Zhuang,
Qian Guo,
Haodong Yang,
Xuelin Qian,
Gong Cheng,
Junwei Han,
Zhongling Huang
Abstract:
Approaches for improving generative adversarial networks (GANs) training under a few samples have been explored for natural images. However, these methods have limited effectiveness for synthetic aperture radar (SAR) images, as they do not account for the unique electromagnetic scattering properties of SAR. To remedy this, we propose a physics-inspired regularization method dubbed $Φ$-GAN, which i…
▽ More
Approaches for improving generative adversarial networks (GANs) training under a few samples have been explored for natural images. However, these methods have limited effectiveness for synthetic aperture radar (SAR) images, as they do not account for the unique electromagnetic scattering properties of SAR. To remedy this, we propose a physics-inspired regularization method dubbed $Φ$-GAN, which incorporates the ideal point scattering center (PSC) model of SAR with two physical consistency losses. The PSC model approximates SAR targets using physical parameters, ensuring that $Φ$-GAN generates SAR images consistent with real physical properties while preventing discriminator overfitting by focusing on PSC-based decision cues. To embed the PSC model into GANs for end-to-end training, we introduce a physics-inspired neural module capable of estimating the physical parameters of SAR targets efficiently. This module retains the interpretability of the physical model and can be trained with limited data. We propose two physical loss functions: one for the generator, guiding it to produce SAR images with physical parameters consistent with real ones, and one for the discriminator, enhancing its robustness by basing decisions on PSC attributes. We evaluate $Φ$-GAN across several conditional GAN (cGAN) models, demonstrating state-of-the-art performance in data-scarce scenarios on three SAR image datasets.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Transient Stability Analysis and Fault Clearing Angle Estimation of VSG Based on Domain of Attraction Estimated by Trajectory Reversing Method
Authors:
Jiayue Lyu,
Tianzhi Fang,
Zhiheng Lin,
Jingxue Han,
Yantao Zhu
Abstract:
The virtual synchronous generator (VSG), with the analogous nonlinear power-angle relationship to the synchronous generator (SG), has attracted much attention as a promising solution for converter-based power systems. In this paper, a large signal model of the grid-connected VSG is first established. The trajectory reversing method (TRM) is then introduced to estimate the domain of attraction (DOA…
▽ More
The virtual synchronous generator (VSG), with the analogous nonlinear power-angle relationship to the synchronous generator (SG), has attracted much attention as a promising solution for converter-based power systems. In this paper, a large signal model of the grid-connected VSG is first established. The trajectory reversing method (TRM) is then introduced to estimate the domain of attraction (DOA) of VSG. Subsequently, the transient instability mechanism is revealed in detail based on the estimated DOA boundary. The impacts of system parameters on the DOA range are further investigated. It is found that loss of synchronization (LOS) occurs if the system trajectory lies outside the post-fault DOA range. In scenarios where no equilibrium points exist after a grid fault, system stability can be reestablished only when the fault clearing angle (FCA) does not exceed the critical clearing angle (CCA). Finally, the CCA derived from the DOA and that from the conventional equal area criteria (EAC) are compared. The results show that CCA obtained by our solution has a higher accuracy. Time-domain simulations are performed to verify the effectiveness of the proposed transient stability analysis method of grid-connected VSG.
△ Less
Submitted 26 February, 2025;
originally announced February 2025.
-
ENACT-Heart -- ENsemble-based Assessment Using CNN and Transformer on Heart Sounds
Authors:
Jiho Han,
Adnan Shaout
Abstract:
This study explores the application of Vision Transformer (ViT) principles in audio analysis, specifically focusing on heart sounds. This paper introduces ENACT-Heart - a novel ensemble approach that leverages the complementary strengths of Convolutional Neural Networks (CNN) and ViT through a Mixture of Experts (MoE) framework, achieving a remarkable classification accuracy of 97.52%. This outper…
▽ More
This study explores the application of Vision Transformer (ViT) principles in audio analysis, specifically focusing on heart sounds. This paper introduces ENACT-Heart - a novel ensemble approach that leverages the complementary strengths of Convolutional Neural Networks (CNN) and ViT through a Mixture of Experts (MoE) framework, achieving a remarkable classification accuracy of 97.52%. This outperforms the individual contributions of ViT (93.88%) and CNN (95.45%), demonstrating the potential for enhanced diagnostic accuracy in cardiovascular health monitoring. These results demonstrate the potential of ensemble methods in enhancing classification performance for cardiovascular health monitoring and diagnosis.
△ Less
Submitted 24 February, 2025;
originally announced February 2025.
-
Gesture-Aware Zero-Shot Speech Recognition for Patients with Language Disorders
Authors:
Seungbae Kim,
Daeun Lee,
Brielle Stark,
Jinyoung Han
Abstract:
Individuals with language disorders often face significant communication challenges due to their limited language processing and comprehension abilities, which also affect their interactions with voice-assisted systems that mostly rely on Automatic Speech Recognition (ASR). Despite advancements in ASR that address disfluencies, there has been little attention on integrating non-verbal communicatio…
▽ More
Individuals with language disorders often face significant communication challenges due to their limited language processing and comprehension abilities, which also affect their interactions with voice-assisted systems that mostly rely on Automatic Speech Recognition (ASR). Despite advancements in ASR that address disfluencies, there has been little attention on integrating non-verbal communication methods, such as gestures, which individuals with language disorders substantially rely on to supplement their communication. Recognizing the need to interpret the latent meanings of visual information not captured by speech alone, we propose a gesture-aware ASR system utilizing a multimodal large language model with zero-shot learning for individuals with speech impairments. Our experiment results and analyses show that including gesture information significantly enhances semantic understanding. This study can help develop effective communication technologies, specifically designed to meet the unique needs of individuals with language impairments.
△ Less
Submitted 18 February, 2025;
originally announced February 2025.
-
Uplink Coordinated Pilot Design for 1-bit Massive MIMO in Correlated Channel
Authors:
Hyeongtak Yun,
Juntaek Han,
Kaiming Shen,
Jeonghun Park
Abstract:
In this paper, we propose a coordinated pilot design method to minimize the channel estimation mean squared error (MSE) in 1-bit analog-to-digital converters (ADCs) massive multiple-input multiple-output (MIMO). Under the assumption that the well-known Bussgang linear minimum mean square error (BLMMSE) estimator is used for channel estimation, we first observe that the resulting MSE leads to an in…
▽ More
In this paper, we propose a coordinated pilot design method to minimize the channel estimation mean squared error (MSE) in 1-bit analog-to-digital converters (ADCs) massive multiple-input multiple-output (MIMO). Under the assumption that the well-known Bussgang linear minimum mean square error (BLMMSE) estimator is used for channel estimation, we first observe that the resulting MSE leads to an intractable optimization problem, as it involves the arcsin function and a complex multiple matrix ratio form. To resolve this, we derive the approximate MSE by assuming the low signal-to-noise ratio (SNR) regime, by which we develop an efficient coordinated pilot design based on a fractional programming technique. The proposed pilot design is distinguishable from the existing work in that it is applicable in general system environments, including correlated channel and multi-cell environments. We demonstrate that the proposed method outperforms the channel estimation accuracy performance compared to the conventional approaches.
△ Less
Submitted 19 February, 2025;
originally announced February 2025.
-
DC-VSR: Spatially and Temporally Consistent Video Super-Resolution with Video Diffusion Prior
Authors:
Janghyeok Han,
Gyujin Sim,
Geonung Kim,
Hyun-seung Lee,
Kyuha Choi,
Youngseok Han,
Sunghyun Cho
Abstract:
Video super-resolution (VSR) aims to reconstruct a high-resolution (HR) video from a low-resolution (LR) counterpart. Achieving successful VSR requires producing realistic HR details and ensuring both spatial and temporal consistency. To restore realistic details, diffusion-based VSR approaches have recently been proposed. However, the inherent randomness of diffusion, combined with their tile-bas…
▽ More
Video super-resolution (VSR) aims to reconstruct a high-resolution (HR) video from a low-resolution (LR) counterpart. Achieving successful VSR requires producing realistic HR details and ensuring both spatial and temporal consistency. To restore realistic details, diffusion-based VSR approaches have recently been proposed. However, the inherent randomness of diffusion, combined with their tile-based approach, often leads to spatio-temporal inconsistencies. In this paper, we propose DC-VSR, a novel VSR approach to produce spatially and temporally consistent VSR results with realistic textures. To achieve spatial and temporal consistency, DC-VSR adopts a novel Spatial Attention Propagation (SAP) scheme and a Temporal Attention Propagation (TAP) scheme that propagate information across spatio-temporal tiles based on the self-attention mechanism. To enhance high-frequency details, we also introduce Detail-Suppression Self-Attention Guidance (DSSAG), a novel diffusion guidance scheme. Comprehensive experiments demonstrate that DC-VSR achieves spatially and temporally consistent, high-quality VSR results, outperforming previous approaches.
△ Less
Submitted 26 May, 2025; v1 submitted 5 February, 2025;
originally announced February 2025.
-
Rethinking the Upsampling Layer in Hyperspectral Image Super Resolution
Authors:
Haohan Shi,
Fei Zhou,
Xin Sun,
Jungong Han
Abstract:
Deep learning has achieved significant success in single hyperspectral image super-resolution (SHSR); however, the high spectral dimensionality leads to a heavy computational burden, thus making it difficult to deploy in real-time scenarios. To address this issue, this paper proposes a novel lightweight SHSR network, i.e., LKCA-Net, that incorporates channel attention to calibrate multi-scale chan…
▽ More
Deep learning has achieved significant success in single hyperspectral image super-resolution (SHSR); however, the high spectral dimensionality leads to a heavy computational burden, thus making it difficult to deploy in real-time scenarios. To address this issue, this paper proposes a novel lightweight SHSR network, i.e., LKCA-Net, that incorporates channel attention to calibrate multi-scale channel features of hyperspectral images. Furthermore, we demonstrate, for the first time, that the low-rank property of the learnable upsampling layer is a key bottleneck in lightweight SHSR methods. To address this, we employ the low-rank approximation strategy to optimize the parameter redundancy of the learnable upsampling layer. Additionally, we introduce a knowledge distillation-based feature alignment technique to ensure the low-rank approximated network retains the same feature representation capacity as the original. We conducted extensive experiments on the Chikusei, Houston 2018, and Pavia Center datasets compared to some SOTAs. The results demonstrate that our method is competitive in performance while achieving speedups of several dozen to even hundreds of times compared to other well-performing SHSR methods.
△ Less
Submitted 30 January, 2025;
originally announced January 2025.
-
Unsupervised Patch-GAN with Targeted Patch Ranking for Fine-Grained Novelty Detection in Medical Imaging
Authors:
Jingkun Chen,
Guang Yang,
Xiao Zhang,
Jingchao Peng,
Tianlu Zhang,
Jianguo Zhang,
Jungong Han,
Vicente Grau
Abstract:
Detecting novel anomalies in medical imaging is challenging due to the limited availability of labeled data for rare abnormalities, which often display high variability and subtlety. This challenge is further compounded when small abnormal regions are embedded within larger normal areas, as whole-image predictions frequently overlook these subtle deviations. To address these issues, we propose an…
▽ More
Detecting novel anomalies in medical imaging is challenging due to the limited availability of labeled data for rare abnormalities, which often display high variability and subtlety. This challenge is further compounded when small abnormal regions are embedded within larger normal areas, as whole-image predictions frequently overlook these subtle deviations. To address these issues, we propose an unsupervised Patch-GAN framework designed to detect and localize anomalies by capturing both local detail and global structure. Our framework first reconstructs masked images to learn fine-grained, normal-specific features, allowing for enhanced sensitivity to minor deviations from normality. By dividing these reconstructed images into patches and assessing the authenticity of each patch, our approach identifies anomalies at a more granular level, overcoming the limitations of whole-image evaluation. Additionally, a patch-ranking mechanism prioritizes regions with higher abnormal scores, reinforcing the alignment between local patch discrepancies and the global image context. Experimental results on the ISIC 2016 skin lesion and BraTS 2019 brain tumor datasets validate our framework's effectiveness, achieving AUCs of 95.79% and 96.05%, respectively, and outperforming three state-of-the-art baselines.
△ Less
Submitted 29 January, 2025;
originally announced January 2025.
-
Radiologist-in-the-Loop Self-Training for Generalizable CT Metal Artifact Reduction
Authors:
Chenglong Ma,
Zilong Li,
Yuanlin Li,
Jing Han,
Junping Zhang,
Yi Zhang,
Jiannan Liu,
Hongming Shan
Abstract:
Metal artifacts in computed tomography (CT) images can significantly degrade image quality and impede accurate diagnosis. Supervised metal artifact reduction (MAR) methods, trained using simulated datasets, often struggle to perform well on real clinical CT images due to a substantial domain gap. Although state-of-the-art semi-supervised methods use pseudo ground-truths generated by a prior networ…
▽ More
Metal artifacts in computed tomography (CT) images can significantly degrade image quality and impede accurate diagnosis. Supervised metal artifact reduction (MAR) methods, trained using simulated datasets, often struggle to perform well on real clinical CT images due to a substantial domain gap. Although state-of-the-art semi-supervised methods use pseudo ground-truths generated by a prior network to mitigate this issue, their reliance on a fixed prior limits both the quality and quantity of these pseudo ground-truths, introducing confirmation bias and reducing clinical applicability. To address these limitations, we propose a novel Radiologist-In-the-loop SElf-training framework for MAR, termed RISE-MAR, which can integrate radiologists' feedback into the semi-supervised learning process, progressively improving the quality and quantity of pseudo ground-truths for enhanced generalization on real clinical CT images. For quality assurance, we introduce a clinical quality assessor model that emulates radiologist evaluations, effectively selecting high-quality pseudo ground-truths for semi-supervised training. For quantity assurance, our self-training framework iteratively generates additional high-quality pseudo ground-truths, expanding the clinical dataset and further improving model generalization. Extensive experimental results on multiple clinical datasets demonstrate the superior generalization performance of our RISE-MAR over state-of-the-art methods, advancing the development of MAR models for practical application. Code is available at https://github.com/Masaaki-75/rise-mar.
△ Less
Submitted 26 January, 2025;
originally announced January 2025.
-
Processing and Analyzing Real-World Driving Data: Insights on Trips, Scenarios, and Human Driving Behaviors
Authors:
Jihun Han,
Dominik Karbowski,
Ayman Moawad,
Namdoo Kim,
Aymeric Rousseau,
Shihong Fan,
Jason Hoon Lee,
Jinho Ha
Abstract:
Analyzing large volumes of real-world driving data is essential for providing meaningful and reliable insights into real-world trips, scenarios, and human driving behaviors. To this end, we developed a multi-level data processing approach that adds new information, segments data, and extracts desired parameters. Leveraging a confidential but extensive dataset (over 1 million km), this approach lea…
▽ More
Analyzing large volumes of real-world driving data is essential for providing meaningful and reliable insights into real-world trips, scenarios, and human driving behaviors. To this end, we developed a multi-level data processing approach that adds new information, segments data, and extracts desired parameters. Leveraging a confidential but extensive dataset (over 1 million km), this approach leads to three levels of in-depth analysis: trip, scenario, and driving. The trip-level analysis explains representative properties observed in real-world trips, while the scenario-level analysis focuses on scenario conditions resulting from road events that reduce vehicle speed. The driving-level analysis identifies the cause of driving regimes for specific situations and characterizes typical human driving behaviors. Such analyses can support the design of both trip- and scenario-based tests, the modeling of human drivers, and the establishment of guidelines for connected and automated vehicles.
△ Less
Submitted 15 January, 2025;
originally announced January 2025.
-
Reducing Latency by Eliminating CSIT Feedback: FDD Downlink MIMO Precoding Without CSIT Feedback for Internet-of-Things Communications
Authors:
Juntaek Han,
Namhyun Kim,
Jeonghun Park
Abstract:
This paper presents a novel framework for low-latency frequency division duplex (FDD) multi-input multi-output (MIMO) transmission with Internet of Things (IoT) communications. Our key idea is eliminating feedback associated with downlink channel state information at the transmitter (CSIT) acquisition. Instead, we propose to reconstruct downlink CSIT from uplink reference signals by exploiting the…
▽ More
This paper presents a novel framework for low-latency frequency division duplex (FDD) multi-input multi-output (MIMO) transmission with Internet of Things (IoT) communications. Our key idea is eliminating feedback associated with downlink channel state information at the transmitter (CSIT) acquisition. Instead, we propose to reconstruct downlink CSIT from uplink reference signals by exploiting the frequency invariance property on channel parameters. Nonetheless, the frequency disparity between the uplink and downlink makes it impossible to get perfect downlink CSIT, resulting in substantial interference. To address this, we formulate a max-min fairness problem and propose a rate-splitting multiple access (RSMA)-aided efficient precoding method. In particular, to fully harness the potential benefits of RSMA, we propose a method that approximates the error covariance matrix and incorporates it into the precoder optimization process. This approach effectively accounts for the impact of imperfect CSIT, enabling the design of a robust precoder that efficiently handles CSIT inaccuracies. Simulation results demonstrate that our framework outperforms other baseline methods in terms of the minimum spectral efficiency when no direct CSI feedback is used. Moreover, we show that our framework significantly reduces communication latency compared to conventional CSI feedback-based methods, underscoring its effectiveness in enhancing latency performance for IoT communications.
△ Less
Submitted 13 January, 2025;
originally announced January 2025.
-
DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition
Authors:
Alexander Polok,
Dominik Klement,
Martin Kocour,
Jiangyu Han,
Federico Landini,
Bolaji Yusuf,
Matthew Wiesner,
Sanjeev Khudanpur,
Jan Černocký,
Lukáš Burget
Abstract:
Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a significant challenge, particularly when systems conditioned on speaker embeddings fail to generalize to unseen speakers. In this work, we propose Diarization-Conditioned Whisper (DiCoW), a novel approach to target-speaker ASR that leverages speaker diarization outputs as conditioning information. DiCoW e…
▽ More
Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a significant challenge, particularly when systems conditioned on speaker embeddings fail to generalize to unseen speakers. In this work, we propose Diarization-Conditioned Whisper (DiCoW), a novel approach to target-speaker ASR that leverages speaker diarization outputs as conditioning information. DiCoW extends the pre-trained Whisper model by integrating diarization labels directly, eliminating reliance on speaker embeddings and reducing the need for extensive speaker-specific training data. Our method introduces frame-level diarization-dependent transformations (FDDT) and query-key biasing (QKb) techniques to refine the model's focus on target speakers while effectively handling overlapping speech. By leveraging diarization outputs as conditioning signals, DiCoW simplifies the workflow for multi-speaker ASR, improves generalization to unseen speakers and enables more reliable transcription in real-world multi-speaker recordings. Additionally, we explore the integration of a connectionist temporal classification (CTC) head to Whisper and demonstrate its ability to improve transcription efficiency through hybrid decoding. Notably, we show that our approach is not limited to Whisper; it also provides similar benefits when applied to the Branchformer model. We validate DiCoW on real-world datasets, including AMI and NOTSOFAR-1 from CHiME-8 challenge, as well as synthetic benchmarks such as Libri2Mix and LibriCSS, enabling direct comparisons with previous methods. Results demonstrate that DiCoW enhances the model's target-speaker ASR capabilities while maintaining Whisper's accuracy and robustness on single-speaker data.
△ Less
Submitted 30 December, 2024;
originally announced January 2025.
-
Zero-resource Speech Translation and Recognition with LLMs
Authors:
Karel Mundnich,
Xing Niu,
Prashant Mathur,
Srikanth Ronanki,
Brady Houston,
Veera Raghavendra Elluru,
Nilaksh Das,
Zejiang Hou,
Goeric Huybrechts,
Anshu Bhatia,
Daniel Garcia-Romero,
Kyu J. Han,
Katrin Kirchhoff
Abstract:
Despite recent advancements in speech processing, zero-resource speech translation (ST) and automatic speech recognition (ASR) remain challenging problems. In this work, we propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data. We achieve this by using a pre-trained multilingual speech encoder, a m…
▽ More
Despite recent advancements in speech processing, zero-resource speech translation (ST) and automatic speech recognition (ASR) remain challenging problems. In this work, we propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data. We achieve this by using a pre-trained multilingual speech encoder, a multilingual LLM, and a lightweight adaptation module that maps the audio representations to the token embedding space of the LLM. We perform several experiments both in ST and ASR to understand how to best train the model and what data has the most impact on performance in previously unseen languages. In ST, our best model is capable to achieve BLEU scores over 23 in CoVoST2 for two previously unseen languages, while in ASR, we achieve WERs of up to 28.2\%. We finally show that the performance of our system is bounded by the ability of the LLM to output text in the desired language.
△ Less
Submitted 30 December, 2024; v1 submitted 24 December, 2024;
originally announced December 2024.
-
VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music
Authors:
Jiatong Shi,
Hye-jin Shim,
Jinchuan Tian,
Siddhant Arora,
Haibin Wu,
Darius Petermann,
Jia Qi Yip,
You Zhang,
Yuxun Tang,
Wangyou Zhang,
Dareen Safar Alharthi,
Yichen Huang,
Koichi Saito,
Jionghao Han,
Yiwen Zhao,
Chris Donahue,
Shinji Watanabe
Abstract:
In this work, we introduce VERSA, a unified and standardized evaluation toolkit designed for various speech, audio, and music signals. The toolkit features a Pythonic interface with flexible configuration and dependency control, making it user-friendly and efficient. With full installation, VERSA offers 65 metrics with 729 metric variations based on different configurations. These metrics encompas…
▽ More
In this work, we introduce VERSA, a unified and standardized evaluation toolkit designed for various speech, audio, and music signals. The toolkit features a Pythonic interface with flexible configuration and dependency control, making it user-friendly and efficient. With full installation, VERSA offers 65 metrics with 729 metric variations based on different configurations. These metrics encompass evaluations utilizing diverse external resources, including matching and non-matching reference audio, text transcriptions, and text captions. As a lightweight yet comprehensive toolkit, VERSA is versatile to support the evaluation of a wide range of downstream scenarios. To demonstrate its capabilities, this work highlights example use cases for VERSA, including audio coding, speech synthesis, speech enhancement, singing synthesis, and music generation. The toolkit is available at https://github.com/wavlab-speech/versa.
△ Less
Submitted 26 March, 2025; v1 submitted 23 December, 2024;
originally announced December 2024.
-
Integrated Sensing and Communications in Downlink FDD MIMO without CSI Feedback
Authors:
Namhyun Kim,
Juntaek Han,
Jinseok Choi,
Ahmed Alkhateeb,
Chan-Byoung Chae,
Jeonghun Park
Abstract:
In this paper, we propose a precoding framework for frequency division duplex (FDD) integrated sensing and communication (ISAC) systems with multiple-input multiple-output (MIMO). Specifically, we aim to maximize ergodic sum spectral efficiency (SE) while satisfying a sensing beam pattern constraint defined by the mean squared error (MSE). Our method reconstructs downlink (DL) channel state inform…
▽ More
In this paper, we propose a precoding framework for frequency division duplex (FDD) integrated sensing and communication (ISAC) systems with multiple-input multiple-output (MIMO). Specifically, we aim to maximize ergodic sum spectral efficiency (SE) while satisfying a sensing beam pattern constraint defined by the mean squared error (MSE). Our method reconstructs downlink (DL) channel state information (CSI) from uplink (UL) training signals using partial reciprocity, eliminating the need for CSI feedback. To obtain the error covariance matrix of the reconstructed DL CSI, we devise an observed Fisher information-based estimation technique. Leveraging this, to mitigate interference caused by imperfect DL CSI reconstruction and sensing operations, we propose a rate-splitting multiple access (RSMA) aided precoder optimization method. This method jointly updates the precoding vector and Lagrange multipliers by solving the nonlinear eigenvalue problem with eigenvector dependency to maximize SE. The numerical results show that the proposed design achieves precise beam pattern control, maximizes SE, and significantly improves the sensing-communication trade-off compared to the state-of-the-art methods in FDD ISAC scenarios.
△ Less
Submitted 10 June, 2025; v1 submitted 17 December, 2024;
originally announced December 2024.
-
Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation
Authors:
Baisen Wang,
Le Zhuo,
Zhaokai Wang,
Chenxi Bao,
Wu Chengjing,
Xuecheng Nie,
Jiao Dai,
Jizhong Han,
Yue Liao,
Si Liu
Abstract:
Multimodal music generation aims to produce music from diverse input modalities, including text, videos, and images. Existing methods use a common embedding space for multimodal fusion. Despite their effectiveness in other modalities, their application in multimodal music generation faces challenges of data scarcity, weak cross-modal alignment, and limited controllability. This paper addresses the…
▽ More
Multimodal music generation aims to produce music from diverse input modalities, including text, videos, and images. Existing methods use a common embedding space for multimodal fusion. Despite their effectiveness in other modalities, their application in multimodal music generation faces challenges of data scarcity, weak cross-modal alignment, and limited controllability. This paper addresses these issues by using explicit bridges of text and music for multimodal alignment. We introduce a novel method named Visuals Music Bridge (VMB). Specifically, a Multimodal Music Description Model converts visual inputs into detailed textual descriptions to provide the text bridge; a Dual-track Music Retrieval module that combines broad and targeted retrieval strategies to provide the music bridge and enable user control. Finally, we design an Explicitly Conditioned Music Generation framework to generate music based on the two bridges. We conduct experiments on video-to-music, image-to-music, text-to-music, and controllable music generation tasks, along with experiments on controllability. The results demonstrate that VMB significantly enhances music quality, modality, and customization alignment compared to previous methods. VMB sets a new standard for interpretable and expressive multimodal music generation with applications in various multimedia fields. Demos and code are available at https://github.com/wbs2788/VMB.
△ Less
Submitted 12 December, 2024;
originally announced December 2024.
-
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
Authors:
Kaixiong Gong,
Kaituo Feng,
Bohao Li,
Yibing Wang,
Mofan Cheng,
Shijia Yang,
Jiaming Han,
Benyou Wang,
Yutong Bai,
Zhuoran Yang,
Xiangyu Yue
Abstract:
Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5 Pro, and Reka Core, have expanded their capabilities to include vision and audio modalities. While these models demonstrate impressive performance across a wide range of audio-visual applications, our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial: 1) determining which of two s…
▽ More
Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5 Pro, and Reka Core, have expanded their capabilities to include vision and audio modalities. While these models demonstrate impressive performance across a wide range of audio-visual applications, our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial: 1) determining which of two sounds is louder, and 2) determining which of two sounds has a higher pitch. Motivated by these observations, we introduce AV-Odyssey Bench, a comprehensive audio-visual benchmark designed to assess whether those MLLMs can truly understand the audio-visual information. This benchmark encompasses 4,555 carefully crafted problems, each incorporating text, visual, and audio components. To successfully infer answers, models must effectively leverage clues from both visual and audio inputs. To ensure precise and objective evaluation of MLLM responses, we have structured the questions as multiple-choice, eliminating the need for human evaluation or LLM-assisted assessment. We benchmark a series of closed-source and open-source models and summarize the observations. By revealing the limitations of current models, we aim to provide useful insight for future dataset collection and model development.
△ Less
Submitted 3 December, 2024;
originally announced December 2024.
-
Energy Efficient Automated Driving as a GNEP: Vehicle-in-the-loop Experiments
Authors:
Viranjan Bhattacharyya,
Tyler Ard,
Rongyao Wang,
Ardalan Vahidi,
Yunyi Jia,
Jihun Han
Abstract:
In this paper, a multi-agent motion planning problem is studied aiming to minimize energy consumption of connected automated vehicles (CAVs) in lane change scenarios. We model this interactive motion planning as a generalized Nash equilibrium problem and formalize how vehicle-to-vehicle intention sharing enables solution of the game between multiple CAVs as an optimal control problem for each agen…
▽ More
In this paper, a multi-agent motion planning problem is studied aiming to minimize energy consumption of connected automated vehicles (CAVs) in lane change scenarios. We model this interactive motion planning as a generalized Nash equilibrium problem and formalize how vehicle-to-vehicle intention sharing enables solution of the game between multiple CAVs as an optimal control problem for each agent, to arrive at a generalized Nash equilibrium. The method is implemented via model predictive control (MPC) and compared with an advanced baseline MPC which utilizes unilateral predictions of other agents' future states. A ROS-based in-the-loop testbed is developed: the method is first evaluated in software-in-the-loop and then vehicle-in-the-loop experiments are conducted. Experimental results demonstrate energy and travel time benefits of the presented method in interactive lane change maneuvers.
△ Less
Submitted 21 November, 2024;
originally announced November 2024.
-
Re-Parameterization of Lightweight Transformer for On-Device Speech Emotion Recognition
Authors:
Zixing Zhang,
Zhongren Dong,
Weixiang Xu,
Jing Han
Abstract:
With the increasing implementation of machine learning models on edge or Internet-of-Things (IoT) devices, deploying advanced models on resource-constrained IoT devices remains challenging. Transformer models, a currently dominant neural architecture, have achieved great success in broad domains but their complexity hinders its deployment on IoT devices with limited computation capability and stor…
▽ More
With the increasing implementation of machine learning models on edge or Internet-of-Things (IoT) devices, deploying advanced models on resource-constrained IoT devices remains challenging. Transformer models, a currently dominant neural architecture, have achieved great success in broad domains but their complexity hinders its deployment on IoT devices with limited computation capability and storage size. Although many model compression approaches have been explored, they often suffer from notorious performance degradation. To address this issue, we introduce a new method, namely Transformer Re-parameterization, to boost the performance of lightweight Transformer models. It consists of two processes: the High-Rank Factorization (HRF) process in the training stage and the deHigh-Rank Factorization (deHRF) process in the inference stage. In the former process, we insert an additional linear layer before the Feed-Forward Network (FFN) of the lightweight Transformer. It is supposed that the inserted HRF layers can enhance the model learning capability. In the later process, the auxiliary HRF layer will be merged together with the following FFN layer into one linear layer and thus recover the original structure of the lightweight model. To examine the effectiveness of the proposed method, we evaluate it on three widely used Transformer variants, i.e., ConvTransformer, Conformer, and SpeechFormer networks, in the application of speech emotion recognition on the IEMOCAP, M3ED and DAIC-WOZ datasets. Experimental results show that our proposed method consistently improves the performance of lightweight Transformers, even making them comparable to large models. The proposed re-parameterization approach enables advanced Transformer models to be deployed on resource-constrained IoT devices.
△ Less
Submitted 14 November, 2024;
originally announced November 2024.
-
Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks
Authors:
Chien-yu Huang,
Wei-Chih Chen,
Shu-wen Yang,
Andy T. Liu,
Chen-An Li,
Yu-Xiang Lin,
Wei-Cheng Tseng,
Anuj Diwan,
Yi-Jen Shih,
Jiatong Shi,
William Chen,
Chih-Kai Yang,
Wenze Ren,
Xuanjun Chen,
Chi-Yuan Hsiao,
Puyuan Peng,
Shih-Heng Wang,
Chun-Yi Kuan,
Ke-Han Lu,
Kai-Wei Chang,
Fabian Ritter-Gutierrez,
Kuan-Po Huang,
Siddhant Arora,
You-Kuan Lin,
Ming To Chuang
, et al. (55 additional authors not shown)
Abstract:
Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluati…
▽ More
Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluation benchmark poses a significant challenge. We present Dynamic-SUPERB Phase-2, an open and evolving benchmark for the comprehensive evaluation of instruction-based universal speech models. Building upon the first generation, this second version incorporates 125 new tasks contributed collaboratively by the global research community, expanding the benchmark to a total of 180 tasks, making it the largest benchmark for speech and audio evaluation. While the first generation of Dynamic-SUPERB was limited to classification tasks, Dynamic-SUPERB Phase-2 broadens its evaluation capabilities by introducing a wide array of novel and diverse tasks, including regression and sequence generation, across speech, music, and environmental audio. Evaluation results show that no model performed well universally. SALMONN-13B excelled in English ASR and Qwen2-Audio-7B-Instruct showed high accuracy in emotion recognition, but current models still require further innovations to handle a broader range of tasks. We open-source all task data and the evaluation pipeline at https://github.com/dynamic-superb/dynamic-superb.
△ Less
Submitted 9 June, 2025; v1 submitted 8 November, 2024;
originally announced November 2024.
-
Generative Artificial Intelligence Meets Synthetic Aperture Radar: A Survey
Authors:
Zhongling Huang,
Xidan Zhang,
Zuqian Tang,
Feng Xu,
Mihai Datcu,
Junwei Han
Abstract:
SAR images possess unique attributes that present challenges for both human observers and vision AI models to interpret, owing to their electromagnetic characteristics. The interpretation of SAR images encounters various hurdles, with one of the primary obstacles being the data itself, which includes issues related to both the quantity and quality of the data. The challenges can be addressed using…
▽ More
SAR images possess unique attributes that present challenges for both human observers and vision AI models to interpret, owing to their electromagnetic characteristics. The interpretation of SAR images encounters various hurdles, with one of the primary obstacles being the data itself, which includes issues related to both the quantity and quality of the data. The challenges can be addressed using generative AI technologies. Generative AI, often known as GenAI, is a very advanced and powerful technology in the field of artificial intelligence that has gained significant attention. The advancement has created possibilities for the creation of texts, photorealistic pictures, videos, and material in various modalities. This paper aims to comprehensively investigate the intersection of GenAI and SAR. First, we illustrate the common data generation-based applications in SAR field and compare them with computer vision tasks, analyzing the similarity, difference, and general challenges of them. Then, an overview of the latest GenAI models is systematically reviewed, including various basic models and their variations targeting the general challenges. Additionally, the corresponding applications in SAR domain are also included. Specifically, we propose to summarize the physical model based simulation approaches for SAR, and analyze the hybrid modeling methods that combine the GenAI and interpretable models. The evaluation methods that have been or could be applied to SAR, are also explored. Finally, the potential challenges and future prospects are discussed. To our best knowledge, this survey is the first exhaustive examination of the interdiscipline of SAR and GenAI, encompassing a wide range of topics, including deep neural networks, physical models, computer vision, and SAR images. The resources of this survey are open-source at \url{https://github.com/XAI4SAR/GenAIxSAR}.
△ Less
Submitted 4 November, 2024;
originally announced November 2024.
-
Graph Neural Networks Uncover Geometric Neural Representations in Reinforcement-Based Motor Learning
Authors:
Federico Nardi,
Jinpei Han,
Shlomi Haar,
A. Aldo Faisal
Abstract:
Graph Neural Networks (GNN) can capture the geometric properties of neural representations in EEG data. Here we utilise those to study how reinforcement-based motor learning affects neural activity patterns during motor planning, leveraging the inherent graph structure of EEG channels to capture the spatial relationships in brain activity. By exploiting task-specific symmetries, we define differen…
▽ More
Graph Neural Networks (GNN) can capture the geometric properties of neural representations in EEG data. Here we utilise those to study how reinforcement-based motor learning affects neural activity patterns during motor planning, leveraging the inherent graph structure of EEG channels to capture the spatial relationships in brain activity. By exploiting task-specific symmetries, we define different pretraining strategies that not only improve model performance across all participant groups but also validate the robustness of the geometric representations. Explainability analysis based on the graph structures reveals consistent group-specific neural signatures that persist across pretraining conditions, suggesting stable geometric structures in the neural representations associated with motor learning and feedback processing. These geometric patterns exhibit partial invariance to certain task space transformations, indicating symmetries that enable generalisation across conditions while maintaining specificity to individual learning strategies. This work demonstrates how GNNs can uncover the effects of previous outcomes on motor planning, in a complex real-world task, providing insights into the geometric principles governing neural representations. Our experimental design bridges the gap between controlled experiments and ecologically valid scenarios, offering new insights into the organisation of neural representations during naturalistic motor learning, which may open avenues for exploring fundamental principles governing brain activity in complex tasks.
△ Less
Submitted 31 October, 2024;
originally announced October 2024.
-
Low-Power Encoding for PAM-3 DRAM Bus
Authors:
Jonghyeon Nam,
Jaeduk Han,
Hokeun Kim
Abstract:
The 3-level pulse amplitude modulation (PAM-3) signaling is expected to be widely used in memory interfaces for its greater voltage margins compared to PAM-4. To maximize the benefit of PAM-3, we propose three low-power data encoding algorithms: PAM3-DBI, PAM3-MF, and PAM3-SORT. With the DRAM memory traces from the gem5 computer architecture simulator running benchmarks, we evaluate the energy eff…
▽ More
The 3-level pulse amplitude modulation (PAM-3) signaling is expected to be widely used in memory interfaces for its greater voltage margins compared to PAM-4. To maximize the benefit of PAM-3, we propose three low-power data encoding algorithms: PAM3-DBI, PAM3-MF, and PAM3-SORT. With the DRAM memory traces from the gem5 computer architecture simulator running benchmarks, we evaluate the energy efficiency of our three PAM-3 encoding techniques. The experimental results show the proposed algorithms can reduce termination power for high-speed memory links significantly by 41% to 90% for benchmark programs.
△ Less
Submitted 16 October, 2024;
originally announced October 2024.
-
Performance of a Threshold-based WDM and ACM for FSO Communication between Mobile Platforms in Maritime Environments
Authors:
Jae-Eun Han,
Sung Sik Nam,
Duck Dong Hwang,
Mohamed-Slim Alouini
Abstract:
In this study, we statistically analyze the performance of a threshold-based multiple optical signal selection scheme (TMOS) for wavelength division multiplexing (WDM) and adaptive coded modulation (ACM) using free space optical (FSO) communication between mobile platforms in maritime environments with fog and 3D pointing errors. Specifically, we derive a new closed-form expression for a composite…
▽ More
In this study, we statistically analyze the performance of a threshold-based multiple optical signal selection scheme (TMOS) for wavelength division multiplexing (WDM) and adaptive coded modulation (ACM) using free space optical (FSO) communication between mobile platforms in maritime environments with fog and 3D pointing errors. Specifically, we derive a new closed-form expression for a composite probability density function (PDF) that is more appropriate for applying various algorithms to FSO systems under the combined effects of fog and pointing errors. We then analyze the outage probability, average spectral efficiency (ASE), and bit error rate (BER) performance of the conventional detection techniques (i.e., heterodyne and intensity modulation/direct detection). The derived analytical results were cross-verified using Monte Carlo simulations. The results show that we can obtain a higher ASE performance by applying TMOS-based WDM and ACM and that the probability of the beam being detected in the photodetector increased at a low signal-to-noise ratio, contrary to conventional performance. Furthermore, it has been confirmed that applying WDM and ACM is suitable, particularly in maritime environments where channel conditions frequently change.
△ Less
Submitted 14 October, 2024;
originally announced October 2024.
-
Flex3D: Feed-Forward 3D Generation with Flexible Reconstruction Model and Input View Curation
Authors:
Junlin Han,
Jianyuan Wang,
Andrea Vedaldi,
Philip Torr,
Filippos Kokkinos
Abstract:
Generating high-quality 3D content from text, single images, or sparse view images remains a challenging task with broad applications. Existing methods typically employ multi-view diffusion models to synthesize multi-view images, followed by a feed-forward process for 3D reconstruction. However, these approaches are often constrained by a small and fixed number of input views, limiting their abili…
▽ More
Generating high-quality 3D content from text, single images, or sparse view images remains a challenging task with broad applications. Existing methods typically employ multi-view diffusion models to synthesize multi-view images, followed by a feed-forward process for 3D reconstruction. However, these approaches are often constrained by a small and fixed number of input views, limiting their ability to capture diverse viewpoints and, even worse, leading to suboptimal generation results if the synthesized views are of poor quality. To address these limitations, we propose Flex3D, a novel two-stage framework capable of leveraging an arbitrary number of high-quality input views. The first stage consists of a candidate view generation and curation pipeline. We employ a fine-tuned multi-view image diffusion model and a video diffusion model to generate a pool of candidate views, enabling a rich representation of the target 3D object. Subsequently, a view selection pipeline filters these views based on quality and consistency, ensuring that only the high-quality and reliable views are used for reconstruction. In the second stage, the curated views are fed into a Flexible Reconstruction Model (FlexRM), built upon a transformer architecture that can effectively process an arbitrary number of inputs. FlemRM directly outputs 3D Gaussian points leveraging a tri-plane representation, enabling efficient and detailed 3D generation. Through extensive exploration of design and training strategies, we optimize FlexRM to achieve superior performance in both reconstruction and generation tasks. Our results demonstrate that Flex3D achieves state-of-the-art performance, with a user study winning rate of over 92% in 3D generation tasks when compared to several of the latest feed-forward 3D generative models.
△ Less
Submitted 1 June, 2025; v1 submitted 1 October, 2024;
originally announced October 2024.
-
Semi-Supervised Cognitive State Classification from Speech with Multi-View Pseudo-Labeling
Authors:
Yuanchao Li,
Zixing Zhang,
Jing Han,
Peter Bell,
Catherine Lai
Abstract:
The lack of labeled data is a common challenge in speech classification tasks, particularly those requiring extensive subjective assessment, such as cognitive state classification. In this work, we propose a Semi-Supervised Learning (SSL) framework, introducing a novel multi-view pseudo-labeling method that leverages both acoustic and linguistic characteristics to select the most confident data fo…
▽ More
The lack of labeled data is a common challenge in speech classification tasks, particularly those requiring extensive subjective assessment, such as cognitive state classification. In this work, we propose a Semi-Supervised Learning (SSL) framework, introducing a novel multi-view pseudo-labeling method that leverages both acoustic and linguistic characteristics to select the most confident data for training the classification model. Acoustically, unlabeled data are compared to labeled data using the Frechet audio distance, calculated from embeddings generated by multiple audio encoders. Linguistically, large language models are prompted to revise automatic speech recognition transcriptions and predict labels based on our proposed task-specific knowledge. High-confidence data are identified when pseudo-labels from both sources align, while mismatches are treated as low-confidence data. A bimodal classifier is then trained to iteratively label the low-confidence data until a predefined criterion is met. We evaluate our SSL framework on emotion recognition and dementia detection tasks. Experimental results demonstrate that our method achieves competitive performance compared to fully supervised learning using only 30% of the labeled data and significantly outperforms two selected baselines.
△ Less
Submitted 30 April, 2025; v1 submitted 25 September, 2024;
originally announced September 2024.
-
A Survey of Foundation Models for Music Understanding
Authors:
Wenjun Li,
Ying Cai,
Ziyang Wu,
Wenyi Zhang,
Yifan Chen,
Rundong Qi,
Mengqi Dong,
Peigen Chen,
Xiao Dong,
Fenghao Shi,
Lei Guo,
Junwei Han,
Bao Ge,
Tianming Liu,
Lin Gan,
Tuo Zhang
Abstract:
Music is essential in daily life, fulfilling emotional and entertainment needs, and connecting us personally, socially, and culturally. A better understanding of music can enhance our emotions, cognitive skills, and cultural connections. The rapid advancement of artificial intelligence (AI) has introduced new ways to analyze music, aiming to replicate human understanding of music and provide relat…
▽ More
Music is essential in daily life, fulfilling emotional and entertainment needs, and connecting us personally, socially, and culturally. A better understanding of music can enhance our emotions, cognitive skills, and cultural connections. The rapid advancement of artificial intelligence (AI) has introduced new ways to analyze music, aiming to replicate human understanding of music and provide related services. While the traditional models focused on audio features and simple tasks, the recent development of large language models (LLMs) and foundation models (FMs), which excel in various fields by integrating semantic information and demonstrating strong reasoning abilities, could capture complex musical features and patterns, integrate music with language and incorporate rich musical, emotional and psychological knowledge. Therefore, they have the potential in handling complex music understanding tasks from a semantic perspective, producing outputs closer to human perception. This work, to our best knowledge, is one of the early reviews of the intersection of AI techniques and music understanding. We investigated, analyzed, and tested recent large-scale music foundation models in respect of their music comprehension abilities. We also discussed their limitations and proposed possible future directions, offering insights for researchers in this field.
△ Less
Submitted 14 September, 2024;
originally announced September 2024.
-
ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration
Authors:
Masao Someki,
Kwanghee Choi,
Siddhant Arora,
William Chen,
Samuele Cornell,
Jionghao Han,
Yifan Peng,
Jiatong Shi,
Vaibhav Srivastav,
Shinji Watanabe
Abstract:
We introduce ESPnet-EZ, an extension of the open-source speech processing toolkit ESPnet, aimed at quick and easy development of speech models. ESPnet-EZ focuses on two major aspects: (i) easy fine-tuning and inference of existing ESPnet models on various tasks and (ii) easy integration with popular deep neural network frameworks such as PyTorch-Lightning, Hugging Face transformers and datasets, a…
▽ More
We introduce ESPnet-EZ, an extension of the open-source speech processing toolkit ESPnet, aimed at quick and easy development of speech models. ESPnet-EZ focuses on two major aspects: (i) easy fine-tuning and inference of existing ESPnet models on various tasks and (ii) easy integration with popular deep neural network frameworks such as PyTorch-Lightning, Hugging Face transformers and datasets, and Lhotse. By replacing ESPnet design choices inherited from Kaldi with a Python-only, Bash-free interface, we dramatically reduce the effort required to build, debug, and use a new model. For example, to fine-tune a speech foundation model, ESPnet-EZ, compared to ESPnet, reduces the number of newly written code by 2.7x and the amount of dependent code by 6.7x while dramatically reducing the Bash script dependencies. The codebase of ESPnet-EZ is publicly available.
△ Less
Submitted 14 September, 2024;
originally announced September 2024.
-
Leveraging Self-Supervised Learning for Speaker Diarization
Authors:
Jiangyu Han,
Federico Landini,
Johan Rohdin,
Anna Silnova,
Mireia Diez,
Lukas Burget
Abstract:
End-to-end neural diarization has evolved considerably over the past few years, but data scarcity is still a major obstacle for further improvements. Self-supervised learning methods such as WavLM have shown promising performance on several downstream tasks, but their application on speaker diarization is somehow limited. In this work, we explore using WavLM to alleviate the problem of data scarci…
▽ More
End-to-end neural diarization has evolved considerably over the past few years, but data scarcity is still a major obstacle for further improvements. Self-supervised learning methods such as WavLM have shown promising performance on several downstream tasks, but their application on speaker diarization is somehow limited. In this work, we explore using WavLM to alleviate the problem of data scarcity for neural diarization training. We use the same pipeline as Pyannote and improve the local end-to-end neural diarization with WavLM and Conformer. Experiments on far-field AMI, AISHELL-4, and AliMeeting datasets show that our method substantially outperforms the Pyannote baseline and achieves new state-of-the-art results on AMI and AISHELL-4, respectively. In addition, by analyzing the system performance under different data quantity scenarios, we show that WavLM representations are much more robust against data scarcity than filterbank features, enabling less data hungry training strategies. Furthermore, we found that simulated data, usually used to train endto-end diarization models, does not help when using WavLM in our experiments. Additionally, we also evaluate our model on the recent CHiME8 NOTSOFAR-1 task where it achieves better performance than the Pyannote baseline. Our source code is publicly available at https://github.com/BUTSpeechFIT/DiariZen.
△ Less
Submitted 21 October, 2024; v1 submitted 14 September, 2024;
originally announced September 2024.
-
DreamBeast: Distilling 3D Fantastical Animals with Part-Aware Knowledge Transfer
Authors:
Runjia Li,
Junlin Han,
Luke Melas-Kyriazi,
Chunyi Sun,
Zhaochong An,
Zhongrui Gui,
Shuyang Sun,
Philip Torr,
Tomas Jakab
Abstract:
We present DreamBeast, a novel method based on score distillation sampling (SDS) for generating fantastical 3D animal assets composed of distinct parts. Existing SDS methods often struggle with this generation task due to a limited understanding of part-level semantics in text-to-image diffusion models. While recent diffusion models, such as Stable Diffusion 3, demonstrate a better part-level unde…
▽ More
We present DreamBeast, a novel method based on score distillation sampling (SDS) for generating fantastical 3D animal assets composed of distinct parts. Existing SDS methods often struggle with this generation task due to a limited understanding of part-level semantics in text-to-image diffusion models. While recent diffusion models, such as Stable Diffusion 3, demonstrate a better part-level understanding, they are prohibitively slow and exhibit other common problems associated with single-view diffusion models. DreamBeast overcomes this limitation through a novel part-aware knowledge transfer mechanism. For each generated asset, we efficiently extract part-level knowledge from the Stable Diffusion 3 model into a 3D Part-Affinity implicit representation. This enables us to instantly generate Part-Affinity maps from arbitrary camera views, which we then use to modulate the guidance of a multi-view diffusion model during SDS to create 3D assets of fantastical animals. DreamBeast significantly enhances the quality of generated 3D creatures with user-specified part compositions while reducing computational overhead, as demonstrated by extensive quantitative and qualitative evaluations.
△ Less
Submitted 12 September, 2024;
originally announced September 2024.
-
Muskits-ESPnet: A Comprehensive Toolkit for Singing Voice Synthesis in New Paradigm
Authors:
Yuning Wu,
Jiatong Shi,
Yifeng Yu,
Yuxun Tang,
Tao Qian,
Yueqian Lin,
Jionghao Han,
Xinyi Bai,
Shinji Watanabe,
Qin Jin
Abstract:
This research presents Muskits-ESPnet, a versatile toolkit that introduces new paradigms to Singing Voice Synthesis (SVS) through the application of pretrained audio models in both continuous and discrete approaches. Specifically, we explore discrete representations derived from SSL models and audio codecs and offer significant advantages in versatility and intelligence, supporting multi-format in…
▽ More
This research presents Muskits-ESPnet, a versatile toolkit that introduces new paradigms to Singing Voice Synthesis (SVS) through the application of pretrained audio models in both continuous and discrete approaches. Specifically, we explore discrete representations derived from SSL models and audio codecs and offer significant advantages in versatility and intelligence, supporting multi-format inputs and adaptable data processing workflows for various SVS models. The toolkit features automatic music score error detection and correction, as well as a perception auto-evaluation module to imitate human subjective evaluating scores. Muskits-ESPnet is available at \url{https://github.com/espnet/espnet}.
△ Less
Submitted 10 October, 2024; v1 submitted 11 September, 2024;
originally announced September 2024.
-
Retinex-RAWMamba: Bridging Demosaicing and Denoising for Low-Light RAW Image Enhancement
Authors:
Xianmin Chen,
Peiliang Huang,
Xiaoxu Feng,
Dingwen Zhang,
Longfei Han,
Junwei Han
Abstract:
Low-light image enhancement, particularly in cross-domain tasks such as mapping from the raw domain to the sRGB domain, remains a significant challenge. Many deep learning-based methods have been developed to address this issue and have shown promising results in recent years. However, single-stage methods, which attempt to unify the complex mapping across both domains, leading to limited denoisin…
▽ More
Low-light image enhancement, particularly in cross-domain tasks such as mapping from the raw domain to the sRGB domain, remains a significant challenge. Many deep learning-based methods have been developed to address this issue and have shown promising results in recent years. However, single-stage methods, which attempt to unify the complex mapping across both domains, leading to limited denoising performance. In contrast, two-stage approaches typically decompose a raw image with color filter arrays (CFA) into a four-channel RGGB format before feeding it into a neural network. However, this strategy overlooks the critical role of demosaicing within the Image Signal Processing (ISP) pipeline, leading to color distortions under varying lighting conditions, especially in low-light scenarios. To address these issues, we design a novel Mamba scanning mechanism, called RAWMamba, to effectively handle raw images with different CFAs. Furthermore, we present a Retinex Decomposition Module (RDM) grounded in Retinex prior, which decouples illumination from reflectance to facilitate more effective denoising and automatic non-linear exposure correction. By bridging demosaicing and denoising, better raw image enhancement is achieved. Experimental evaluations conducted on public datasets SID and MCR demonstrate that our proposed RAWMamba achieves state-of-the-art performance on cross-domain mapping.
△ Less
Submitted 31 December, 2024; v1 submitted 11 September, 2024;
originally announced September 2024.
-
Convolution Type of Metaplectic Cohen's Distribution Time-Frequency Analysis Theory, Method and Technology
Authors:
Manjun Cui,
Zhichao Zhang,
Jie Han,
Yunjie Chen,
Chunzheng Cao
Abstract:
The conventional Cohen's distribution can't meet the requirement of additive noises jamming signals high-performance denoising under the condition of low signal-to-noise ratio, it is necessary to integrate the metaplectic transform for non-stationary signal fractional domain time-frequency analysis. In this paper, we blend time-frequency operators and coordinate operator fractionizations to formul…
▽ More
The conventional Cohen's distribution can't meet the requirement of additive noises jamming signals high-performance denoising under the condition of low signal-to-noise ratio, it is necessary to integrate the metaplectic transform for non-stationary signal fractional domain time-frequency analysis. In this paper, we blend time-frequency operators and coordinate operator fractionizations to formulate the definition of the metaplectic Wigner distribution, based on which we integrate the generalized metaplectic convolution to address the unified representation issue of the convolution type of metaplectic Cohen's distribution (CMCD), whose special cases and essential properties are also derived. We blend Wiener filter principle and fractional domain filter mechanism of the metaplectic transform to design the least-squares adaptive filter method in the metaplectic Wigner distribution domain, giving birth to the least-squares adaptive filter-based CMCD whose kernel function can be adjusted with the input signal automatically to achieve the minimum mean-square error (MSE) denoising in Wigner distribution domain. We discuss the optimal symplectic matrices selection strategy of the proposed adaptive CMCD through the minimum MSE minimization modeling and solving. Some examples are also carried out to demonstrate that the proposed filtering method outperforms some state-of-the-arts including Wiener filter and fixed kernel functions-based or adaptive Cohen's distribution in noise suppression.
△ Less
Submitted 8 August, 2024;
originally announced August 2024.
-
Adaptive Cohen's Class Time-Frequency Distribution
Authors:
Manjun Cui,
Zhichao Zhang,
Jie Han,
Yunjie Chen,
Chunzheng Cao
Abstract:
The fixed kernel function-based Cohen's class time-frequency distributions (CCTFDs) allow flexibility in denoising for some specific polluted signals. Due to the limitation of fixed kernel functions, however, from the view point of filtering they fail to automatically adjust the response according to the change of signal to adapt to different signal characteristics. In this letter, we integrate Wi…
▽ More
The fixed kernel function-based Cohen's class time-frequency distributions (CCTFDs) allow flexibility in denoising for some specific polluted signals. Due to the limitation of fixed kernel functions, however, from the view point of filtering they fail to automatically adjust the response according to the change of signal to adapt to different signal characteristics. In this letter, we integrate Wiener filter principle and the time-frequency filtering mechanism of CCTFD to design the least-squares adaptive filter method in the Wigner-Ville distribution (WVD) domain, giving birth to the least-squares adaptive filter-based CCTFD whose kernel function can be adjusted with the input signal automatically to achieve the minimum mean-square error denoising in the WVD domain. Some examples are also carried out to demonstrate that the proposed adaptive CCTFD outperforms some state-of-the-arts in noise suppression.
△ Less
Submitted 8 August, 2024;
originally announced August 2024.
-
X-Fake: Juggling Utility Evaluation and Explanation of Simulated SAR Images
Authors:
Zhongling Huang,
Yihan Zhuang,
Zipei Zhong,
Feng Xu,
Gong Cheng,
Junwei Han
Abstract:
SAR image simulation has attracted much attention due to its great potential to supplement the scarce training data for deep learning algorithms. Consequently, evaluating the quality of the simulated SAR image is crucial for practical applications. The current literature primarily uses image quality assessment techniques for evaluation that rely on human observers' perceptions. However, because of…
▽ More
SAR image simulation has attracted much attention due to its great potential to supplement the scarce training data for deep learning algorithms. Consequently, evaluating the quality of the simulated SAR image is crucial for practical applications. The current literature primarily uses image quality assessment techniques for evaluation that rely on human observers' perceptions. However, because of the unique imaging mechanism of SAR, these techniques may produce evaluation results that are not entirely valid. The distribution inconsistency between real and simulated data is the main obstacle that influences the utility of simulated SAR images. To this end, we propose a novel trustworthy utility evaluation framework with a counterfactual explanation for simulated SAR images for the first time, denoted as X-Fake. It unifies a probabilistic evaluator and a causal explainer to achieve a trustworthy utility assessment. We construct the evaluator using a probabilistic Bayesian deep model to learn the posterior distribution, conditioned on real data. Quantitatively, the predicted uncertainty of simulated data can reflect the distribution discrepancy. We build the causal explainer with an introspective variational auto-encoder to generate high-resolution counterfactuals. The latent code of IntroVAE is finally optimized with evaluation indicators and prior information to generate the counterfactual explanation, thus revealing the inauthentic details of simulated data explicitly. The proposed framework is validated on four simulated SAR image datasets obtained from electromagnetic models and generative artificial intelligence approaches. The results demonstrate the proposed X-Fake framework outperforms other IQA methods in terms of utility. Furthermore, the results illustrate that the generated counterfactual explanations are trustworthy, and can further improve the data utility in applications.
△ Less
Submitted 28 July, 2024;
originally announced July 2024.
-
Multi-dimensional Graph Linear Canonical Transform
Authors:
Na Li,
Zhichao Zhang,
Jie Han,
Yunjie Chen,
Chunzheng Cao
Abstract:
Many multi-dimensional (M-D) graph signals appear in the real world, such as digital images, sensor network measurements and temperature records from weather observation stations. It is a key challenge to design a transform method for processing these graph M-D signals in the linear canonical transform domain. This paper proposes the two-dimensional graph linear canonical transform based on the ce…
▽ More
Many multi-dimensional (M-D) graph signals appear in the real world, such as digital images, sensor network measurements and temperature records from weather observation stations. It is a key challenge to design a transform method for processing these graph M-D signals in the linear canonical transform domain. This paper proposes the two-dimensional graph linear canonical transform based on the central discrete dilated Hermite function (2-D CDDHFs-GLCT) and the two-dimensional graph linear canonical transform based on chirp multiplication-chirp convolution-chirp multiplication decomposition (2-D CM-CC-CM-GLCT). Then, extending 2-D CDDHFs-GLCT and 2-D CM-CC-CM-GLCT to M-D CDDHFs-GLCT and M-D CM-CC-CM-GLCT. In terms of the computational complexity, additivity and reversibility, M-D CDDHFs-GLCT and M-D CM-CC-CM-GLCT are compared. Theoretical analysis shows that the computational complexity of M-D CM-CC-CM-GLCT algorithm is obviously reduced. Simulation results indicate that M-D CM-CC-CM-GLCT achieves comparable additivity to M-D CDDHFs-GLCT, while M-D CM-CC-CM-GLCT exhibits better reversibility. Finally, M-D GLCT is applied to data compression to show its application advantages. The experimental results reflect the superiority of M-D GLCT in the algorithm design and implementation of data compression.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Graph Linear Canonical Transform Based on CM-CC-CM Decomposition
Authors:
Na Li,
Zhichao Zhang,
Jie Han,
Yunjie Chen,
Chunzheng Cao
Abstract:
The graph linear canonical transform (GLCT) is presented as an extension of the graph Fourier transform (GFT) and the graph fractional Fourier transform (GFrFT), offering more flexibility as an effective tool for graph signal processing. In this paper, we introduce a GLCT based on chirp multiplication-chirp convolution-chirp multiplication decomposition (CM-CC-CM-GLCT), which irrelevant to samplin…
▽ More
The graph linear canonical transform (GLCT) is presented as an extension of the graph Fourier transform (GFT) and the graph fractional Fourier transform (GFrFT), offering more flexibility as an effective tool for graph signal processing. In this paper, we introduce a GLCT based on chirp multiplication-chirp convolution-chirp multiplication decomposition (CM-CC-CM-GLCT), which irrelevant to sampling periods and without oversampling operation. Various properties and special cases of the CM-CC-CM-GLCT are derived and discussed. In terms of computational complexity, additivity, and reversibility, we compare the CM-CC-CM-GLCT and the GLCT based on the central discrete dilated Hermite function (CDDHFs-GLCT). Theoretical analysis demonstrates that the computational complexity of the CM-CC-CM-GLCT is significantly reduced. Simulation results indicate that the CM-CC-CM-GLCT achieves similar additivity to the CDDHFs-GLCT. Notably, the CM-CC-CM-GLCT exhibits better reversibility.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Knowledge-driven AI-generated data for accurate and interpretable breast ultrasound diagnoses
Authors:
Haojun Yu,
Youcheng Li,
Nan Zhang,
Zihan Niu,
Xuantong Gong,
Yanwen Luo,
Quanlin Wu,
Wangyan Qin,
Mengyuan Zhou,
Jie Han,
Jia Tao,
Ziwei Zhao,
Di Dai,
Di He,
Dong Wang,
Binghui Tang,
Ling Huo,
Qingli Zhu,
Yong Wang,
Liwei Wang
Abstract:
Data-driven deep learning models have shown great capabilities to assist radiologists in breast ultrasound (US) diagnoses. However, their effectiveness is limited by the long-tail distribution of training data, which leads to inaccuracies in rare cases. In this study, we address a long-standing challenge of improving the diagnostic model performance on rare cases using long-tailed data. Specifical…
▽ More
Data-driven deep learning models have shown great capabilities to assist radiologists in breast ultrasound (US) diagnoses. However, their effectiveness is limited by the long-tail distribution of training data, which leads to inaccuracies in rare cases. In this study, we address a long-standing challenge of improving the diagnostic model performance on rare cases using long-tailed data. Specifically, we introduce a pipeline, TAILOR, that builds a knowledge-driven generative model to produce tailored synthetic data. The generative model, using 3,749 lesions as source data, can generate millions of breast-US images, especially for error-prone rare cases. The generated data can be further used to build a diagnostic model for accurate and interpretable diagnoses. In the prospective external evaluation, our diagnostic model outperforms the average performance of nine radiologists by 33.5% in specificity with the same sensitivity, improving their performance by providing predictions with an interpretable decision-making process. Moreover, on ductal carcinoma in situ (DCIS), our diagnostic model outperforms all radiologists by a large margin, with only 34 DCIS lesions in the source data. We believe that TAILOR can potentially be extended to various diseases and imaging modalities.
△ Less
Submitted 23 July, 2024;
originally announced July 2024.