Skip to main content

Showing 1–50 of 203 results for author: Han, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2507.01045  [pdf, ps, other

    cs.LG cs.AI eess.SP

    Sensing Cardiac Health Across Scenarios and Devices: A Multi-Modal Foundation Model Pretrained on Heterogeneous Data from 1.7 Million Individuals

    Authors: Xiao Gu, Wei Tang, Jinpei Han, Veer Sangha, Fenglin Liu, Shreyank N Gowda, Antonio H. Ribeiro, Patrick Schwab, Kim Branson, Lei Clifton, Antonio Luiz P. Ribeiro, Zhangdaihong Liu, David A. Clifton

    Abstract: Cardiac biosignals, such as electrocardiograms (ECG) and photoplethysmograms (PPG), are of paramount importance for the diagnosis, prevention, and management of cardiovascular diseases, and have been extensively used in a variety of clinical tasks. Conventional deep learning approaches for analyzing these signals typically rely on homogeneous datasets and static bespoke models, limiting their robu… ▽ More

    Submitted 23 June, 2025; originally announced July 2025.

  2. arXiv:2506.18623  [pdf, ps, other

    eess.AS

    Efficient and Generalizable Speaker Diarization via Structured Pruning of Self-Supervised Models

    Authors: Jiangyu Han, Petr Pálka, Marc Delcroix, Federico Landini, Johan Rohdin, Jan Cernocký, Lukáš Burget

    Abstract: Self-supervised learning (SSL) models such as WavLM have brought substantial improvements to speaker diarization by providing rich contextual representations. However, the high computational and memory costs of these models hinder their deployment in real-time and resource-constrained scenarios. In this work, we present a comprehensive study on compressing SSL-based diarization models through stru… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

    Comments: 11 pages, 6 figures

  3. arXiv:2506.13414  [pdf, ps, other

    eess.AS

    BUT System for the MLC-SLM Challenge

    Authors: Alexander Polok, Jiangyu Han, Dominik Klement, Samuele Cornell, Jan Černocký, Lukáš Burget

    Abstract: We present a two-speaker automatic speech recognition (ASR) system that combines DiCoW -- a diarization-conditioned variant of Whisper -- with DiariZen, a diarization pipeline built on top of Pyannote. We first evaluate both systems in out-of-domain (OOD) multilingual scenarios without any fine-tuning. In this scenario, DiariZen consistently outperforms the baseline Pyannote diarization model, dem… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  4. arXiv:2506.12006  [pdf, ps, other

    eess.IV cs.CV

    crossMoDA Challenge: Evolution of Cross-Modality Domain Adaptation Techniques for Vestibular Schwannoma and Cochlea Segmentation from 2021 to 2023

    Authors: Navodini Wijethilake, Reuben Dorent, Marina Ivory, Aaron Kujawa, Stefan Cornelissen, Patrick Langenhuizen, Mohamed Okasha, Anna Oviedova, Hexin Dong, Bogyeong Kang, Guillaume Sallé, Luyi Han, Ziyuan Zhao, Han Liu, Tao Yang, Shahad Hardan, Hussain Alasmawi, Santosh Sanjeev, Yuzhou Zhuang, Satoshi Kondo, Maria Baldeon Calisto, Shaikh Muhammad Uzair Noman, Cancan Chen, Ipek Oguz, Rongguo Zhang , et al. (14 additional authors not shown)

    Abstract: The cross-Modality Domain Adaptation (crossMoDA) challenge series, initiated in 2021 in conjunction with the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), focuses on unsupervised cross-modality segmentation, learning from contrast-enhanced T1 (ceT1) and transferring to T2 MRI. The task is an extreme example of domain shift chosen to serve as a mea… ▽ More

    Submitted 24 June, 2025; v1 submitted 13 June, 2025; originally announced June 2025.

  5. arXiv:2505.24111  [pdf, other

    eess.AS

    Fine-tune Before Structured Pruning: Towards Compact and Accurate Self-Supervised Models for Speaker Diarization

    Authors: Jiangyu Han, Federico Landini, Johan Rohdin, Anna Silnova, Mireia Diez, Jan Cernocky, Lukas Burget

    Abstract: Self-supervised learning (SSL) models like WavLM can be effectively utilized when building speaker diarization systems but are often large and slow, limiting their use in resource constrained scenarios. Previous studies have explored compression techniques, but usually for the price of degraded performance at high pruning ratios. In this work, we propose to compress SSL models through structured p… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: Accepted by INTERSPEECH 2025

  6. arXiv:2505.19626  [pdf, ps, other

    cs.SD eess.AS

    Decoding Speaker-Normalized Pitch from EEG for Mandarin Perception

    Authors: Jiaxin Chen, Yiming Wang, Ziyu Zhang, Jiayang Han, Yin-Long Liu, Rui Feng, Xiuyuan Liang, Zhen-Hua Ling, Jiahong Yuan

    Abstract: The same speech content produced by different speakers exhibits significant differences in pitch contour, yet listeners' semantic perception remains unaffected. This phenomenon may stem from the brain's perception of pitch contours being independent of individual speakers' pitch ranges. In this work, we recorded electroencephalogram (EEG) while participants listened to Mandarin monosyllables with… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  7. arXiv:2505.16211  [pdf, ps, other

    cs.SD cs.AI cs.CL eess.AS

    AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

    Authors: Kai Li, Can Shen, Yile Liu, Jirui Han, Kelong Zheng, Xuechao Zou, Zhe Wang, Xingjian Du, Shun Zhang, Hanjun Luo, Yingbin Jin, Xinxin Xing, Ziyang Ma, Yue Liu, Xiaojun Jia, Yifan Zhang, Junfeng Fang, Kun Wang, Yibo Yan, Haoyang Li, Yiming Li, Xiaobin Zhuang, Yang Liu, Haibo Hu, Zhizheng Wu , et al. (6 additional authors not shown)

    Abstract: The rapid advancement and expanding applications of Audio Large Language Models (ALLMs) demand a rigorous understanding of their trustworthiness. However, systematic research on evaluating these models, particularly concerning risks unique to the audio modality, remains largely unexplored. Existing evaluation frameworks primarily focus on the text modality or address only a restricted set of safet… ▽ More

    Submitted 1 July, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

    Comments: Technical Report

  8. arXiv:2505.15320  [pdf, ps, other

    eess.AS cs.SD

    Analysis of ABC Frontend Audio Systems for the NIST-SRE24

    Authors: Sara Barahona, Anna Silnova, Ladislav Mošner, Junyi Peng, Oldřich Plchot, Johan Rohdin, Lin Zhang, Jiangyu Han, Petr Palka, Federico Landini, Lukáš Burget, Themos Stafylakis, Sandro Cumani, Dominik Boboš, Miroslav Hlavaček, Martin Kodovsky, Tomáš Pavlíček

    Abstract: We present a comprehensive analysis of the embedding extractors (frontends) developed by the ABC team for the audio track of NIST SRE 2024. We follow the two scenarios imposed by NIST: using only a provided set of telephone recordings for training (fixed) or adding publicly available data (open condition). Under these constraints, we develop the best possible speaker embedding extractors for the p… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: Accepted at Interspeech 2025

  9. arXiv:2504.09912  [pdf

    eess.SP

    Parameter Convergence Detector Based on VAMP Deep Unfolding: A Novel Radar Constant False Alarm Rate Detection Algorithm

    Authors: Haoyun Zhang, Jianghong Han, Xueqian Wang, Gang Li, Xiao-Ping Zhang

    Abstract: The sub-Nyquist radar framework exploits the sparsity of signals, which effectively alleviates the pressure on system storage and transmission bandwidth. Compressed sensing (CS) algorithms, such as the VAMP algorithm, are used for sparse signal processing in the sub-Nyquist radar framework. By combining deep unfolding techniques with VAMP, faster convergence and higher accuracy than traditional CS… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

  10. arXiv:2504.01279  [pdf, other

    stat.AP eess.IV

    SELIC: Semantic-Enhanced Learned Image Compression via High-Level Textual Guidance

    Authors: Haisheng Fu, Jie Liang, Zhenman Fang, Jingning Han

    Abstract: Learned image compression (LIC) techniques have achieved remarkable progress; however, effectively integrating high-level semantic information remains challenging. In this work, we present a \underline{S}emantic-\underline{E}nhanced \underline{L}earned \underline{I}mage \underline{C}ompression framework, termed \textbf{SELIC}, which leverages high-level textual guidance to improve rate-distortion… ▽ More

    Submitted 1 April, 2025; originally announced April 2025.

    Comments: Accepted by ICME2025

  11. arXiv:2503.21254  [pdf, other

    cs.CV cs.AI cs.MM cs.SD eess.AS

    Vision-to-Music Generation: A Survey

    Authors: Zhaokai Wang, Chenxi Bao, Le Zhuo, Jingrui Han, Yang Yue, Yihong Tang, Victor Shea-Jay Huang, Yue Liao

    Abstract: Vision-to-music Generation, including video-to-music and image-to-music tasks, is a significant branch of multimodal artificial intelligence demonstrating vast application prospects in fields such as film scoring, short video creation, and dance music synthesis. However, compared to the rapid development of modalities like text and images, research in vision-to-music is still in its preliminary st… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

  12. arXiv:2503.13468  [pdf, other

    eess.SP cs.LG

    A CGAN-LSTM-Based Framework for Time-Varying Non-Stationary Channel Modeling

    Authors: Keying Guo, Ruisi He, Mi Yang, Yuxin Zhang, Bo Ai, Haoxiang Zhang, Jiahui Han, Ruifeng Chen

    Abstract: Time-varying non-stationary channels, with complex dynamic variations and temporal evolution characteristics, have significant challenges in channel modeling and communication system performance evaluation. Most existing methods of time-varying channel modeling focus on predicting channel state at a given moment or simulating short-term channel fluctuations, which are unable to capture the long-te… ▽ More

    Submitted 2 March, 2025; originally announced March 2025.

    Comments: 11 pages,7 figures

  13. arXiv:2503.02242  [pdf, other

    cs.CV eess.IV

    $\mathbfΦ$-GAN: Physics-Inspired GAN for Generating SAR Images Under Limited Data

    Authors: Xidan Zhang, Yihan Zhuang, Qian Guo, Haodong Yang, Xuelin Qian, Gong Cheng, Junwei Han, Zhongling Huang

    Abstract: Approaches for improving generative adversarial networks (GANs) training under a few samples have been explored for natural images. However, these methods have limited effectiveness for synthetic aperture radar (SAR) images, as they do not account for the unique electromagnetic scattering properties of SAR. To remedy this, we propose a physics-inspired regularization method dubbed $Φ$-GAN, which i… ▽ More

    Submitted 3 March, 2025; originally announced March 2025.

  14. arXiv:2502.19728  [pdf

    eess.SY

    Transient Stability Analysis and Fault Clearing Angle Estimation of VSG Based on Domain of Attraction Estimated by Trajectory Reversing Method

    Authors: Jiayue Lyu, Tianzhi Fang, Zhiheng Lin, Jingxue Han, Yantao Zhu

    Abstract: The virtual synchronous generator (VSG), with the analogous nonlinear power-angle relationship to the synchronous generator (SG), has attracted much attention as a promising solution for converter-based power systems. In this paper, a large signal model of the grid-connected VSG is first established. The trajectory reversing method (TRM) is then introduced to estimate the domain of attraction (DOA… ▽ More

    Submitted 26 February, 2025; originally announced February 2025.

    Comments: 9 pages,11 figures, references added

  15. arXiv:2502.16914  [pdf, other

    cs.SD cs.AI eess.AS

    ENACT-Heart -- ENsemble-based Assessment Using CNN and Transformer on Heart Sounds

    Authors: Jiho Han, Adnan Shaout

    Abstract: This study explores the application of Vision Transformer (ViT) principles in audio analysis, specifically focusing on heart sounds. This paper introduces ENACT-Heart - a novel ensemble approach that leverages the complementary strengths of Convolutional Neural Networks (CNN) and ViT through a Mixture of Experts (MoE) framework, achieving a remarkable classification accuracy of 97.52%. This outper… ▽ More

    Submitted 24 February, 2025; originally announced February 2025.

    Comments: Accepted but not published in Global Digital Health Knowledge Exchange & Empowerment Conference (gDigiHealth.KEE)

  16. arXiv:2502.13983  [pdf, other

    eess.AS cs.AI

    Gesture-Aware Zero-Shot Speech Recognition for Patients with Language Disorders

    Authors: Seungbae Kim, Daeun Lee, Brielle Stark, Jinyoung Han

    Abstract: Individuals with language disorders often face significant communication challenges due to their limited language processing and comprehension abilities, which also affect their interactions with voice-assisted systems that mostly rely on Automatic Speech Recognition (ASR). Despite advancements in ASR that address disfluencies, there has been little attention on integrating non-verbal communicatio… ▽ More

    Submitted 18 February, 2025; originally announced February 2025.

  17. arXiv:2502.13429  [pdf, other

    eess.SP

    Uplink Coordinated Pilot Design for 1-bit Massive MIMO in Correlated Channel

    Authors: Hyeongtak Yun, Juntaek Han, Kaiming Shen, Jeonghun Park

    Abstract: In this paper, we propose a coordinated pilot design method to minimize the channel estimation mean squared error (MSE) in 1-bit analog-to-digital converters (ADCs) massive multiple-input multiple-output (MIMO). Under the assumption that the well-known Bussgang linear minimum mean square error (BLMMSE) estimator is used for channel estimation, we first observe that the resulting MSE leads to an in… ▽ More

    Submitted 19 February, 2025; originally announced February 2025.

    Comments: 5 pages, 2 figures

  18. arXiv:2502.03502  [pdf, other

    eess.IV cs.AI cs.GR

    DC-VSR: Spatially and Temporally Consistent Video Super-Resolution with Video Diffusion Prior

    Authors: Janghyeok Han, Gyujin Sim, Geonung Kim, Hyun-seung Lee, Kyuha Choi, Youngseok Han, Sunghyun Cho

    Abstract: Video super-resolution (VSR) aims to reconstruct a high-resolution (HR) video from a low-resolution (LR) counterpart. Achieving successful VSR requires producing realistic HR details and ensuring both spatial and temporal consistency. To restore realistic details, diffusion-based VSR approaches have recently been proposed. However, the inherent randomness of diffusion, combined with their tile-bas… ▽ More

    Submitted 26 May, 2025; v1 submitted 5 February, 2025; originally announced February 2025.

    Comments: Equal contributions from first two authors

  19. arXiv:2501.18664  [pdf, other

    eess.IV cs.AI cs.CV

    Rethinking the Upsampling Layer in Hyperspectral Image Super Resolution

    Authors: Haohan Shi, Fei Zhou, Xin Sun, Jungong Han

    Abstract: Deep learning has achieved significant success in single hyperspectral image super-resolution (SHSR); however, the high spectral dimensionality leads to a heavy computational burden, thus making it difficult to deploy in real-time scenarios. To address this issue, this paper proposes a novel lightweight SHSR network, i.e., LKCA-Net, that incorporates channel attention to calibrate multi-scale chan… ▽ More

    Submitted 30 January, 2025; originally announced January 2025.

  20. arXiv:2501.17906  [pdf, other

    cs.CV eess.IV

    Unsupervised Patch-GAN with Targeted Patch Ranking for Fine-Grained Novelty Detection in Medical Imaging

    Authors: Jingkun Chen, Guang Yang, Xiao Zhang, Jingchao Peng, Tianlu Zhang, Jianguo Zhang, Jungong Han, Vicente Grau

    Abstract: Detecting novel anomalies in medical imaging is challenging due to the limited availability of labeled data for rare abnormalities, which often display high variability and subtlety. This challenge is further compounded when small abnormal regions are embedded within larger normal areas, as whole-image predictions frequently overlook these subtle deviations. To address these issues, we propose an… ▽ More

    Submitted 29 January, 2025; originally announced January 2025.

  21. arXiv:2501.15610  [pdf, other

    eess.IV cs.CV

    Radiologist-in-the-Loop Self-Training for Generalizable CT Metal Artifact Reduction

    Authors: Chenglong Ma, Zilong Li, Yuanlin Li, Jing Han, Junping Zhang, Yi Zhang, Jiannan Liu, Hongming Shan

    Abstract: Metal artifacts in computed tomography (CT) images can significantly degrade image quality and impede accurate diagnosis. Supervised metal artifact reduction (MAR) methods, trained using simulated datasets, often struggle to perform well on real clinical CT images due to a substantial domain gap. Although state-of-the-art semi-supervised methods use pseudo ground-truths generated by a prior networ… ▽ More

    Submitted 26 January, 2025; originally announced January 2025.

    Comments: IEEE TMI 2025

  22. arXiv:2501.08868  [pdf, other

    eess.SY cs.HC

    Processing and Analyzing Real-World Driving Data: Insights on Trips, Scenarios, and Human Driving Behaviors

    Authors: Jihun Han, Dominik Karbowski, Ayman Moawad, Namdoo Kim, Aymeric Rousseau, Shihong Fan, Jason Hoon Lee, Jinho Ha

    Abstract: Analyzing large volumes of real-world driving data is essential for providing meaningful and reliable insights into real-world trips, scenarios, and human driving behaviors. To this end, we developed a multi-level data processing approach that adds new information, segments data, and extracts desired parameters. Leveraging a confidential but extensive dataset (over 1 million km), this approach lea… ▽ More

    Submitted 15 January, 2025; originally announced January 2025.

  23. arXiv:2501.07094  [pdf, other

    eess.SP

    Reducing Latency by Eliminating CSIT Feedback: FDD Downlink MIMO Precoding Without CSIT Feedback for Internet-of-Things Communications

    Authors: Juntaek Han, Namhyun Kim, Jeonghun Park

    Abstract: This paper presents a novel framework for low-latency frequency division duplex (FDD) multi-input multi-output (MIMO) transmission with Internet of Things (IoT) communications. Our key idea is eliminating feedback associated with downlink channel state information at the transmitter (CSIT) acquisition. Instead, we propose to reconstruct downlink CSIT from uplink reference signals by exploiting the… ▽ More

    Submitted 13 January, 2025; originally announced January 2025.

    Comments: 13 pages

  24. arXiv:2501.00114  [pdf, other

    eess.AS cs.SD

    DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition

    Authors: Alexander Polok, Dominik Klement, Martin Kocour, Jiangyu Han, Federico Landini, Bolaji Yusuf, Matthew Wiesner, Sanjeev Khudanpur, Jan Černocký, Lukáš Burget

    Abstract: Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a significant challenge, particularly when systems conditioned on speaker embeddings fail to generalize to unseen speakers. In this work, we propose Diarization-Conditioned Whisper (DiCoW), a novel approach to target-speaker ASR that leverages speaker diarization outputs as conditioning information. DiCoW e… ▽ More

    Submitted 30 December, 2024; originally announced January 2025.

  25. arXiv:2412.18566  [pdf, other

    cs.CL eess.AS

    Zero-resource Speech Translation and Recognition with LLMs

    Authors: Karel Mundnich, Xing Niu, Prashant Mathur, Srikanth Ronanki, Brady Houston, Veera Raghavendra Elluru, Nilaksh Das, Zejiang Hou, Goeric Huybrechts, Anshu Bhatia, Daniel Garcia-Romero, Kyu J. Han, Katrin Kirchhoff

    Abstract: Despite recent advancements in speech processing, zero-resource speech translation (ST) and automatic speech recognition (ASR) remain challenging problems. In this work, we propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data. We achieve this by using a pre-trained multilingual speech encoder, a m… ▽ More

    Submitted 30 December, 2024; v1 submitted 24 December, 2024; originally announced December 2024.

    Comments: ICASSP 2025, 5 pages, 2 figures, 2 tables

  26. arXiv:2412.17667  [pdf, other

    cs.SD cs.MM eess.AS

    VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music

    Authors: Jiatong Shi, Hye-jin Shim, Jinchuan Tian, Siddhant Arora, Haibin Wu, Darius Petermann, Jia Qi Yip, You Zhang, Yuxun Tang, Wangyou Zhang, Dareen Safar Alharthi, Yichen Huang, Koichi Saito, Jionghao Han, Yiwen Zhao, Chris Donahue, Shinji Watanabe

    Abstract: In this work, we introduce VERSA, a unified and standardized evaluation toolkit designed for various speech, audio, and music signals. The toolkit features a Pythonic interface with flexible configuration and dependency control, making it user-friendly and efficient. With full installation, VERSA offers 65 metrics with 729 metric variations based on different configurations. These metrics encompas… ▽ More

    Submitted 26 March, 2025; v1 submitted 23 December, 2024; originally announced December 2024.

  27. arXiv:2412.12590  [pdf, ps, other

    cs.IT eess.SP

    Integrated Sensing and Communications in Downlink FDD MIMO without CSI Feedback

    Authors: Namhyun Kim, Juntaek Han, Jinseok Choi, Ahmed Alkhateeb, Chan-Byoung Chae, Jeonghun Park

    Abstract: In this paper, we propose a precoding framework for frequency division duplex (FDD) integrated sensing and communication (ISAC) systems with multiple-input multiple-output (MIMO). Specifically, we aim to maximize ergodic sum spectral efficiency (SE) while satisfying a sensing beam pattern constraint defined by the mean squared error (MSE). Our method reconstructs downlink (DL) channel state inform… ▽ More

    Submitted 10 June, 2025; v1 submitted 17 December, 2024; originally announced December 2024.

    Comments: submitted to possible IEEE publication

  28. arXiv:2412.09428  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation

    Authors: Baisen Wang, Le Zhuo, Zhaokai Wang, Chenxi Bao, Wu Chengjing, Xuecheng Nie, Jiao Dai, Jizhong Han, Yue Liao, Si Liu

    Abstract: Multimodal music generation aims to produce music from diverse input modalities, including text, videos, and images. Existing methods use a common embedding space for multimodal fusion. Despite their effectiveness in other modalities, their application in multimodal music generation faces challenges of data scarcity, weak cross-modal alignment, and limited controllability. This paper addresses the… ▽ More

    Submitted 12 December, 2024; originally announced December 2024.

  29. arXiv:2412.02611  [pdf, other

    cs.CV cs.AI cs.CL cs.MM cs.SD eess.AS

    AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

    Authors: Kaixiong Gong, Kaituo Feng, Bohao Li, Yibing Wang, Mofan Cheng, Shijia Yang, Jiaming Han, Benyou Wang, Yutong Bai, Zhuoran Yang, Xiangyu Yue

    Abstract: Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5 Pro, and Reka Core, have expanded their capabilities to include vision and audio modalities. While these models demonstrate impressive performance across a wide range of audio-visual applications, our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial: 1) determining which of two s… ▽ More

    Submitted 3 December, 2024; originally announced December 2024.

    Comments: Project page: https://av-odyssey.github.io/

  30. arXiv:2411.14567  [pdf

    eess.SY

    Energy Efficient Automated Driving as a GNEP: Vehicle-in-the-loop Experiments

    Authors: Viranjan Bhattacharyya, Tyler Ard, Rongyao Wang, Ardalan Vahidi, Yunyi Jia, Jihun Han

    Abstract: In this paper, a multi-agent motion planning problem is studied aiming to minimize energy consumption of connected automated vehicles (CAVs) in lane change scenarios. We model this interactive motion planning as a generalized Nash equilibrium problem and formalize how vehicle-to-vehicle intention sharing enables solution of the game between multiple CAVs as an optimal control problem for each agen… ▽ More

    Submitted 21 November, 2024; originally announced November 2024.

  31. arXiv:2411.09339  [pdf, other

    cs.SD cs.CL eess.AS

    Re-Parameterization of Lightweight Transformer for On-Device Speech Emotion Recognition

    Authors: Zixing Zhang, Zhongren Dong, Weixiang Xu, Jing Han

    Abstract: With the increasing implementation of machine learning models on edge or Internet-of-Things (IoT) devices, deploying advanced models on resource-constrained IoT devices remains challenging. Transformer models, a currently dominant neural architecture, have achieved great success in broad domains but their complexity hinders its deployment on IoT devices with limited computation capability and stor… ▽ More

    Submitted 14 November, 2024; originally announced November 2024.

  32. arXiv:2411.05361  [pdf, ps, other

    cs.CL eess.AS

    Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

    Authors: Chien-yu Huang, Wei-Chih Chen, Shu-wen Yang, Andy T. Liu, Chen-An Li, Yu-Xiang Lin, Wei-Cheng Tseng, Anuj Diwan, Yi-Jen Shih, Jiatong Shi, William Chen, Chih-Kai Yang, Wenze Ren, Xuanjun Chen, Chi-Yuan Hsiao, Puyuan Peng, Shih-Heng Wang, Chun-Yi Kuan, Ke-Han Lu, Kai-Wei Chang, Fabian Ritter-Gutierrez, Kuan-Po Huang, Siddhant Arora, You-Kuan Lin, Ming To Chuang , et al. (55 additional authors not shown)

    Abstract: Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluati… ▽ More

    Submitted 9 June, 2025; v1 submitted 8 November, 2024; originally announced November 2024.

    Comments: ICLR 2025

  33. arXiv:2411.05027  [pdf, other

    cs.CV cs.AI eess.IV

    Generative Artificial Intelligence Meets Synthetic Aperture Radar: A Survey

    Authors: Zhongling Huang, Xidan Zhang, Zuqian Tang, Feng Xu, Mihai Datcu, Junwei Han

    Abstract: SAR images possess unique attributes that present challenges for both human observers and vision AI models to interpret, owing to their electromagnetic characteristics. The interpretation of SAR images encounters various hurdles, with one of the primary obstacles being the data itself, which includes issues related to both the quantity and quality of the data. The challenges can be addressed using… ▽ More

    Submitted 4 November, 2024; originally announced November 2024.

  34. arXiv:2410.23812  [pdf, other

    cs.LG eess.SP

    Graph Neural Networks Uncover Geometric Neural Representations in Reinforcement-Based Motor Learning

    Authors: Federico Nardi, Jinpei Han, Shlomi Haar, A. Aldo Faisal

    Abstract: Graph Neural Networks (GNN) can capture the geometric properties of neural representations in EEG data. Here we utilise those to study how reinforcement-based motor learning affects neural activity patterns during motor planning, leveraging the inherent graph structure of EEG channels to capture the spatial relationships in brain activity. By exploiting task-specific symmetries, we define differen… ▽ More

    Submitted 31 October, 2024; originally announced October 2024.

    Comments: 19 pages, 7 figures, accepted at the NeurIPS 2024 workshop on Symmetry and Geometry in Neural Representations (NeurReps 2024)

  35. arXiv:2410.12990  [pdf, other

    eess.SY cs.AR

    Low-Power Encoding for PAM-3 DRAM Bus

    Authors: Jonghyeon Nam, Jaeduk Han, Hokeun Kim

    Abstract: The 3-level pulse amplitude modulation (PAM-3) signaling is expected to be widely used in memory interfaces for its greater voltage margins compared to PAM-4. To maximize the benefit of PAM-3, we propose three low-power data encoding algorithms: PAM3-DBI, PAM3-MF, and PAM3-SORT. With the DRAM memory traces from the gem5 computer architecture simulator running benchmarks, we evaluate the energy eff… ▽ More

    Submitted 16 October, 2024; originally announced October 2024.

    Comments: To appear in Proceedings of the 20th International Conference on Synthesis, Modeling, Analysis and Simulation Methods, and Applications to Circuit Design (SMACD 2024)

  36. arXiv:2410.10335  [pdf, ps, other

    cs.IT eess.SP

    Performance of a Threshold-based WDM and ACM for FSO Communication between Mobile Platforms in Maritime Environments

    Authors: Jae-Eun Han, Sung Sik Nam, Duck Dong Hwang, Mohamed-Slim Alouini

    Abstract: In this study, we statistically analyze the performance of a threshold-based multiple optical signal selection scheme (TMOS) for wavelength division multiplexing (WDM) and adaptive coded modulation (ACM) using free space optical (FSO) communication between mobile platforms in maritime environments with fog and 3D pointing errors. Specifically, we derive a new closed-form expression for a composite… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

  37. arXiv:2410.00890  [pdf, ps, other

    cs.CV cs.GR eess.IV

    Flex3D: Feed-Forward 3D Generation with Flexible Reconstruction Model and Input View Curation

    Authors: Junlin Han, Jianyuan Wang, Andrea Vedaldi, Philip Torr, Filippos Kokkinos

    Abstract: Generating high-quality 3D content from text, single images, or sparse view images remains a challenging task with broad applications. Existing methods typically employ multi-view diffusion models to synthesize multi-view images, followed by a feed-forward process for 3D reconstruction. However, these approaches are often constrained by a small and fixed number of input views, limiting their abili… ▽ More

    Submitted 1 June, 2025; v1 submitted 1 October, 2024; originally announced October 2024.

    Comments: ICML 25. Project page: https://junlinhan.github.io/projects/flex3d/

  38. arXiv:2409.16937  [pdf, other

    eess.AS cs.AI cs.CL cs.MM cs.SD

    Semi-Supervised Cognitive State Classification from Speech with Multi-View Pseudo-Labeling

    Authors: Yuanchao Li, Zixing Zhang, Jing Han, Peter Bell, Catherine Lai

    Abstract: The lack of labeled data is a common challenge in speech classification tasks, particularly those requiring extensive subjective assessment, such as cognitive state classification. In this work, we propose a Semi-Supervised Learning (SSL) framework, introducing a novel multi-view pseudo-labeling method that leverages both acoustic and linguistic characteristics to select the most confident data fo… ▽ More

    Submitted 30 April, 2025; v1 submitted 25 September, 2024; originally announced September 2024.

    Comments: Accepted to ICASSP 2025

  39. arXiv:2409.09601  [pdf, other

    cs.SD cs.AI cs.MM eess.AS

    A Survey of Foundation Models for Music Understanding

    Authors: Wenjun Li, Ying Cai, Ziyang Wu, Wenyi Zhang, Yifan Chen, Rundong Qi, Mengqi Dong, Peigen Chen, Xiao Dong, Fenghao Shi, Lei Guo, Junwei Han, Bao Ge, Tianming Liu, Lin Gan, Tuo Zhang

    Abstract: Music is essential in daily life, fulfilling emotional and entertainment needs, and connecting us personally, socially, and culturally. A better understanding of music can enhance our emotions, cognitive skills, and cultural connections. The rapid advancement of artificial intelligence (AI) has introduced new ways to analyze music, aiming to replicate human understanding of music and provide relat… ▽ More

    Submitted 14 September, 2024; originally announced September 2024.

    Comments: 20 pages, 2 figures

  40. arXiv:2409.09506  [pdf, other

    cs.SD cs.AI eess.AS

    ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration

    Authors: Masao Someki, Kwanghee Choi, Siddhant Arora, William Chen, Samuele Cornell, Jionghao Han, Yifan Peng, Jiatong Shi, Vaibhav Srivastav, Shinji Watanabe

    Abstract: We introduce ESPnet-EZ, an extension of the open-source speech processing toolkit ESPnet, aimed at quick and easy development of speech models. ESPnet-EZ focuses on two major aspects: (i) easy fine-tuning and inference of existing ESPnet models on various tasks and (ii) easy integration with popular deep neural network frameworks such as PyTorch-Lightning, Hugging Face transformers and datasets, a… ▽ More

    Submitted 14 September, 2024; originally announced September 2024.

    Comments: Accepted to SLT 2024

  41. arXiv:2409.09408  [pdf, other

    eess.AS cs.SD

    Leveraging Self-Supervised Learning for Speaker Diarization

    Authors: Jiangyu Han, Federico Landini, Johan Rohdin, Anna Silnova, Mireia Diez, Lukas Burget

    Abstract: End-to-end neural diarization has evolved considerably over the past few years, but data scarcity is still a major obstacle for further improvements. Self-supervised learning methods such as WavLM have shown promising performance on several downstream tasks, but their application on speaker diarization is somehow limited. In this work, we explore using WavLM to alleviate the problem of data scarci… ▽ More

    Submitted 21 October, 2024; v1 submitted 14 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP 2025; New results are updated but conclusions are exactly the same as the original one

  42. arXiv:2409.08271  [pdf, other

    cs.CV cs.GR cs.LG eess.IV

    DreamBeast: Distilling 3D Fantastical Animals with Part-Aware Knowledge Transfer

    Authors: Runjia Li, Junlin Han, Luke Melas-Kyriazi, Chunyi Sun, Zhaochong An, Zhongrui Gui, Shuyang Sun, Philip Torr, Tomas Jakab

    Abstract: We present DreamBeast, a novel method based on score distillation sampling (SDS) for generating fantastical 3D animal assets composed of distinct parts. Existing SDS methods often struggle with this generation task due to a limited understanding of part-level semantics in text-to-image diffusion models. While recent diffusion models, such as Stable Diffusion 3, demonstrate a better part-level unde… ▽ More

    Submitted 12 September, 2024; originally announced September 2024.

    Comments: Project page: https://dreambeast3d.github.io/, code: https://github.com/runjiali-rl/threestudio-dreambeast

  43. arXiv:2409.07226  [pdf, other

    cs.SD eess.AS

    Muskits-ESPnet: A Comprehensive Toolkit for Singing Voice Synthesis in New Paradigm

    Authors: Yuning Wu, Jiatong Shi, Yifeng Yu, Yuxun Tang, Tao Qian, Yueqian Lin, Jionghao Han, Xinyi Bai, Shinji Watanabe, Qin Jin

    Abstract: This research presents Muskits-ESPnet, a versatile toolkit that introduces new paradigms to Singing Voice Synthesis (SVS) through the application of pretrained audio models in both continuous and discrete approaches. Specifically, we explore discrete representations derived from SSL models and audio codecs and offer significant advantages in versatility and intelligence, supporting multi-format in… ▽ More

    Submitted 10 October, 2024; v1 submitted 11 September, 2024; originally announced September 2024.

    Comments: Accepted by ACMMM 2024 demo track

  44. arXiv:2409.07040  [pdf, other

    cs.CV eess.IV

    Retinex-RAWMamba: Bridging Demosaicing and Denoising for Low-Light RAW Image Enhancement

    Authors: Xianmin Chen, Peiliang Huang, Xiaoxu Feng, Dingwen Zhang, Longfei Han, Junwei Han

    Abstract: Low-light image enhancement, particularly in cross-domain tasks such as mapping from the raw domain to the sRGB domain, remains a significant challenge. Many deep learning-based methods have been developed to address this issue and have shown promising results in recent years. However, single-stage methods, which attempt to unify the complex mapping across both domains, leading to limited denoisin… ▽ More

    Submitted 31 December, 2024; v1 submitted 11 September, 2024; originally announced September 2024.

  45. arXiv:2408.04214  [pdf, ps, other

    eess.SP

    Convolution Type of Metaplectic Cohen's Distribution Time-Frequency Analysis Theory, Method and Technology

    Authors: Manjun Cui, Zhichao Zhang, Jie Han, Yunjie Chen, Chunzheng Cao

    Abstract: The conventional Cohen's distribution can't meet the requirement of additive noises jamming signals high-performance denoising under the condition of low signal-to-noise ratio, it is necessary to integrate the metaplectic transform for non-stationary signal fractional domain time-frequency analysis. In this paper, we blend time-frequency operators and coordinate operator fractionizations to formul… ▽ More

    Submitted 8 August, 2024; originally announced August 2024.

  46. arXiv:2408.04210  [pdf, ps, other

    eess.SP

    Adaptive Cohen's Class Time-Frequency Distribution

    Authors: Manjun Cui, Zhichao Zhang, Jie Han, Yunjie Chen, Chunzheng Cao

    Abstract: The fixed kernel function-based Cohen's class time-frequency distributions (CCTFDs) allow flexibility in denoising for some specific polluted signals. Due to the limitation of fixed kernel functions, however, from the view point of filtering they fail to automatically adjust the response according to the change of signal to adapt to different signal characteristics. In this letter, we integrate Wi… ▽ More

    Submitted 8 August, 2024; originally announced August 2024.

  47. arXiv:2407.19436  [pdf, other

    cs.CV eess.IV

    X-Fake: Juggling Utility Evaluation and Explanation of Simulated SAR Images

    Authors: Zhongling Huang, Yihan Zhuang, Zipei Zhong, Feng Xu, Gong Cheng, Junwei Han

    Abstract: SAR image simulation has attracted much attention due to its great potential to supplement the scarce training data for deep learning algorithms. Consequently, evaluating the quality of the simulated SAR image is crucial for practical applications. The current literature primarily uses image quality assessment techniques for evaluation that rely on human observers' perceptions. However, because of… ▽ More

    Submitted 28 July, 2024; originally announced July 2024.

  48. arXiv:2407.18931  [pdf, other

    cs.IT eess.SP

    Multi-dimensional Graph Linear Canonical Transform

    Authors: Na Li, Zhichao Zhang, Jie Han, Yunjie Chen, Chunzheng Cao

    Abstract: Many multi-dimensional (M-D) graph signals appear in the real world, such as digital images, sensor network measurements and temperature records from weather observation stations. It is a key challenge to design a transform method for processing these graph M-D signals in the linear canonical transform domain. This paper proposes the two-dimensional graph linear canonical transform based on the ce… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

    Comments: arXiv admin note: text overlap with arXiv:2407.17513

  49. arXiv:2407.17513  [pdf, other

    cs.IT eess.SP

    Graph Linear Canonical Transform Based on CM-CC-CM Decomposition

    Authors: Na Li, Zhichao Zhang, Jie Han, Yunjie Chen, Chunzheng Cao

    Abstract: The graph linear canonical transform (GLCT) is presented as an extension of the graph Fourier transform (GFT) and the graph fractional Fourier transform (GFrFT), offering more flexibility as an effective tool for graph signal processing. In this paper, we introduce a GLCT based on chirp multiplication-chirp convolution-chirp multiplication decomposition (CM-CC-CM-GLCT), which irrelevant to samplin… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  50. arXiv:2407.16634  [pdf, other

    eess.IV cs.AI cs.CV cs.HC

    Knowledge-driven AI-generated data for accurate and interpretable breast ultrasound diagnoses

    Authors: Haojun Yu, Youcheng Li, Nan Zhang, Zihan Niu, Xuantong Gong, Yanwen Luo, Quanlin Wu, Wangyan Qin, Mengyuan Zhou, Jie Han, Jia Tao, Ziwei Zhao, Di Dai, Di He, Dong Wang, Binghui Tang, Ling Huo, Qingli Zhu, Yong Wang, Liwei Wang

    Abstract: Data-driven deep learning models have shown great capabilities to assist radiologists in breast ultrasound (US) diagnoses. However, their effectiveness is limited by the long-tail distribution of training data, which leads to inaccuracies in rare cases. In this study, we address a long-standing challenge of improving the diagnostic model performance on rare cases using long-tailed data. Specifical… ▽ More

    Submitted 23 July, 2024; originally announced July 2024.