Skip to main content

Showing 1–50 of 338 results for author: Zhou, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2507.03421  [pdf, ps, other

    eess.IV cs.CV

    Hybrid-View Attention Network for Clinically Significant Prostate Cancer Classification in Transrectal Ultrasound

    Authors: Zetian Feng, Juan Fu, Xuebin Zou, Hongsheng Ye, Hong Wu, Jianhua Zhou, Yi Wang

    Abstract: Prostate cancer (PCa) is a leading cause of cancer-related mortality in men, and accurate identification of clinically significant PCa (csPCa) is critical for timely intervention. Transrectal ultrasound (TRUS) is widely used for prostate biopsy; however, its low contrast and anisotropic spatial resolution pose diagnostic challenges. To address these limitations, we propose a novel hybrid-view atte… ▽ More

    Submitted 9 July, 2025; v1 submitted 4 July, 2025; originally announced July 2025.

  2. arXiv:2507.00185  [pdf

    eess.IV cs.AI cs.CV

    Multimodal, Multi-Disease Medical Imaging Foundation Model (MerMED-FM)

    Authors: Yang Zhou, Chrystie Wan Ning Quek, Jun Zhou, Yan Wang, Yang Bai, Yuhe Ke, Jie Yao, Laura Gutierrez, Zhen Ling Teo, Darren Shu Jeng Ting, Brian T. Soetikno, Christopher S. Nielsen, Tobias Elze, Zengxiang Li, Linh Le Dinh, Lionel Tim-Ee Cheng, Tran Nguyen Tuan Anh, Chee Leong Cheng, Tien Yin Wong, Nan Liu, Iain Beehuat Tan, Tony Kiat Hon Lim, Rick Siow Mong Goh, Yong Liu, Daniel Shu Wei Ting

    Abstract: Current artificial intelligence models for medical imaging are predominantly single modality and single disease. Attempts to create multimodal and multi-disease models have resulted in inconsistent clinical accuracy. Furthermore, training these models typically requires large, labour-intensive, well-labelled datasets. We developed MerMED-FM, a state-of-the-art multimodal, multi-specialty foundatio… ▽ More

    Submitted 30 June, 2025; originally announced July 2025.

    Comments: 42 pages, 3 composite figures, 4 tables

  3. arXiv:2506.22810  [pdf, ps, other

    cs.SD eess.AS

    A Self-Training Approach for Whisper to Enhance Long Dysarthric Speech Recognition

    Authors: Shiyao Wang, Jiaming Zhou, Shiwan Zhao, Yong Qin

    Abstract: Dysarthric speech recognition (DSR) enhances the accessibility of smart devices for dysarthric speakers with limited mobility. Previously, DSR research was constrained by the fact that existing datasets typically consisted of isolated words, command phrases, and a limited number of sentences spoken by a few individuals. This constrained research to command-interaction systems and speaker adaptatio… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

    Comments: accepted by Interspeech 2025

    Journal ref: INTERSPEECH 2025

  4. arXiv:2506.19299  [pdf, ps, other

    eess.SY

    Online Algorithms for Recovery of Low-Rank Parameter Matrix in Non-stationary Stochastic Systems

    Authors: Yanxin Fu, Junbao Zhou, Yu Hu, Wenxiao Zhao

    Abstract: This paper presents a two-stage online algorithm for recovery of low-rank parameter matrix in non-stationary stochastic systems. The first stage applies the recursive least squares (RLS) estimator combined with its singular value decomposition to estimate the unknown parameter matrix within the system, leveraging RLS for adaptability and SVD to reveal low-rank structure. The second stage introduce… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  5. arXiv:2506.18402  [pdf

    eess.AS

    Infant Cry Emotion Recognition Using Improved ECAPA-TDNN with Multiscale Feature Fusion and Attention Enhancement

    Authors: Junyu Zhou, Yanxiong Li, Haolin Yu

    Abstract: Infant cry emotion recognition is crucial for parenting and medical applications. It faces many challenges, such as subtle emotional variations, noise interference, and limited data. The existing methods lack the ability to effectively integrate multi-scale features and temporal-frequency relationships. In this study, we propose a method for infant cry emotion recognition using an improved Emphasi… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

    Comments: Accepted for publication on Interspeech 2025. 5 pages, 2 tables and 7 figures

  6. arXiv:2506.13419  [pdf, ps, other

    eess.IV cs.CV

    Audio-Visual Driven Compression for Low-Bitrate Talking Head Videos

    Authors: Riku Takahashi, Ryugo Morita, Jinjia Zhou

    Abstract: Talking head video compression has advanced with neural rendering and keypoint-based methods, but challenges remain, especially at low bit rates, including handling large head movements, suboptimal lip synchronization, and distorted facial reconstructions. To address these problems, we propose a novel audio-visual driven video codec that integrates compact 3D motion features and audio signals. Thi… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: Accepted to ICMR2025

  7. arXiv:2506.12570  [pdf, ps, other

    cs.SD cs.CL eess.AS

    StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling

    Authors: Hui Wang, Yifan Yang, Shujie Liu, Jinyu Li, Lingwei Meng, Yanqing Liu, Jiaming Zhou, Haoqin Sun, Yan Lu, Yong Qin

    Abstract: Recent advances in zero-shot text-to-speech (TTS) synthesis have achieved high-quality speech generation for unseen speakers, but most systems remain unsuitable for real-time applications because of their offline design. Current streaming TTS paradigms often rely on multi-stage pipelines and discrete representations, leading to increased computational cost and suboptimal system performance. In thi… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

  8. arXiv:2506.09344  [pdf, ps, other

    cs.AI cs.CL cs.CV cs.LG cs.SD eess.AS

    Ming-Omni: A Unified Multimodal Model for Perception and Generation

    Authors: Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, Furong Xu, GuangMing Yao, Jun Zhou, Jingdong Chen, Jianxin Sun, Jiajia Liu, Jianjiang Zhu, Jun Peng, Kaixiang Ji, Kaiyou Song, Kaimeng Ren, Libin Wang, Lixiang Ru, Lele Xie, Longhua Tan , et al. (33 additional authors not shown)

    Abstract: We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

    Comments: 18 pages,8 figures

  9. arXiv:2506.06824  [pdf, ps, other

    eess.SY

    Deep reinforcement learning-based joint real-time energy scheduling for green buildings with heterogeneous battery energy storage devices

    Authors: Chi Liu, Zhezhuang Xu, Jiawei Zhou, Yazhou Yuan, Kai Ma, Meng Yuan

    Abstract: Green buildings (GBs) with renewable energy and building energy management systems (BEMS) enable efficient energy use and support sustainable development. Electric vehicles (EVs), as flexible storage resources, enhance system flexibility when integrated with stationary energy storage systems (ESS) for real-time scheduling. However, differing degradation and operational characteristics of ESS and E… ▽ More

    Submitted 21 June, 2025; v1 submitted 7 June, 2025; originally announced June 2025.

  10. arXiv:2506.05171  [pdf, other

    eess.SY cs.AI

    Towards provable probabilistic safety for scalable embodied AI systems

    Authors: Linxuan He, Qing-Shan Jia, Ang Li, Hongyan Sang, Ling Wang, Jiwen Lu, Tao Zhang, Jie Zhou, Yi Zhang, Yisen Wang, Peng Wei, Zhongyuan Wang, Henry X. Liu, Shuo Feng

    Abstract: Embodied AI systems, comprising AI models and physical plants, are increasingly prevalent across various applications. Due to the rarity of system failures, ensuring their safety in complex operating environments remains a major challenge, which severely hinders their large-scale deployment in safety-critical domains, such as autonomous vehicles, medical devices, and robotics. While achieving prov… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

  11. arXiv:2506.00466  [pdf, ps, other

    eess.AS cs.SD

    M3ANet: Multi-scale and Multi-Modal Alignment Network for Brain-Assisted Target Speaker Extraction

    Authors: Cunhang Fan, Ying Chen, Jian Zhou, Zexu Pan, Jingjing Zhang, Youdian Gao, Xiaoke Yang, Zhengqi Wen, Zhao Lv

    Abstract: The brain-assisted target speaker extraction (TSE) aims to extract the attended speech from mixed speech by utilizing the brain neural activities, for example Electroencephalography (EEG). However, existing models overlook the issue of temporal misalignment between speech and EEG modalities, which hampers TSE performance. In addition, the speech encoder in current models typically uses basic tempo… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

    Comments: Accepted to IJCAI 2025

  12. arXiv:2505.19437  [pdf, ps, other

    cs.SD eess.AS

    RA-CLAP: Relation-Augmented Emotional Speaking Style Contrastive Language-Audio Pretraining For Speech Retrieval

    Authors: Haoqin Sun, Jingguang Tian, Jiaming Zhou, Hui Wang, Jiabei He, Shiwan Zhao, Xiangyu Kong, Desheng Hu, Xinkang Xu, Xinhui Hu, Yong Qin

    Abstract: The Contrastive Language-Audio Pretraining (CLAP) model has demonstrated excellent performance in general audio description-related tasks, such as audio retrieval. However, in the emerging field of emotional speaking style description (ESSD), cross-modal contrastive pretraining remains largely unexplored. In this paper, we propose a novel speech retrieval task called emotional speaking style retri… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

  13. arXiv:2505.15364  [pdf, ps, other

    cs.HC cs.SD eess.AS

    MHANet: Multi-scale Hybrid Attention Network for Auditory Attention Detection

    Authors: Lu Li, Cunhang Fan, Hongyu Zhang, Jingjing Zhang, Xiaoke Yang, Jian Zhou, Zhao Lv

    Abstract: Auditory attention detection (AAD) aims to detect the target speaker in a multi-talker environment from brain signals, such as electroencephalography (EEG), which has made great progress. However, most AAD methods solely utilize attention mechanisms sequentially and overlook valuable multi-scale contextual information within EEG signals, limiting their ability to capture long-short range spatiotem… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

  14. arXiv:2505.13181  [pdf, other

    cs.CL cs.SD eess.AS

    Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space

    Authors: Zhengrui Ma, Yang Feng, Chenze Shao, Fandong Meng, Jie Zhou, Min Zhang

    Abstract: We introduce SLED, an alternative approach to speech language modeling by encoding speech waveforms into sequences of continuous latent representations and modeling them autoregressively using an energy distance objective. The energy distance offers an analytical measure of the distributional gap by contrasting simulated and target samples, enabling efficient training to capture the underlying con… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    Comments: Demos and code are available at https://github.com/ictnlp/SLED-TTS

  15. arXiv:2505.12258  [pdf, ps, other

    cs.IT eess.SP

    An Information-Theoretic Framework for Receiver Quantization in Communication

    Authors: Jing Zhou, Shuqin Pang, Wenyi Zhang

    Abstract: We investigate information-theoretic limits and design of communication under receiver quantization. Unlike most existing studies, this work is more focused on the impact of resolution reduction from high to low. We consider a standard transceiver architecture, which includes i.i.d. complex Gaussian codebook at the transmitter, and a symmetric quantizer cascaded with a nearest neighbor decoder at… ▽ More

    Submitted 18 May, 2025; originally announced May 2025.

    Comments: 35 pages, 17 figures. The material in this paper will be presented in part at the IEEE International Symposium on Information Theory (ISIT), Ann Arbor, MI, USA, June 2025 (see arXiv:2501.09961)

  16. arXiv:2505.10348  [pdf, ps, other

    cs.HC cs.SD eess.AS

    ListenNet: A Lightweight Spatio-Temporal Enhancement Nested Network for Auditory Attention Detection

    Authors: Cunhang Fan, Xiaoke Yang, Hongyu Zhang, Ying Chen, Lu Li, Jian Zhou, Zhao Lv

    Abstract: Auditory attention detection (AAD) aims to identify the direction of the attended speaker in multi-speaker environments from brain signals, such as Electroencephalography (EEG) signals. However, existing EEG-based AAD methods overlook the spatio-temporal dependencies of EEG signals, limiting their decoding and generalization abilities. To address these issues, this paper proposes a Lightweight Spa… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  17. arXiv:2505.10174  [pdf, ps, other

    eess.SP

    Subspace-Based Super-Resolution Sensing for Bi-Static ISAC with Clock Asynchronism

    Authors: Jingbo Zhao, Zhaoming Lu, J. Andrew Zhang, Jiaxi Zhou, Weicai Li, Tao Gu

    Abstract: Bi-static sensing is an attractive configuration for integrated sensing and communications (ISAC) systems; however, clock asynchronism between widely separated transmitters and receivers introduces time-varying time offsets (TO) and phase offsets (PO), posing significant challenges. This paper introduces a signal-subspace-based framework that estimates decoupled angles, delays, and complex gain se… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

    Comments: 13 pages, 9 figures. This work has been submitted to the IEEE for possible publication

  18. arXiv:2505.06122  [pdf, other

    eess.SY

    Interaction-Aware Parameter Privacy-Preserving Data Sharing in Coupled Systems via Particle Filter Reinforcement Learning

    Authors: Haokun Yu, Jingyuan Zhou, Kaidi Yang

    Abstract: This paper addresses the problem of parameter privacy-preserving data sharing in coupled systems, where a data provider shares data with a data user but wants to protect its sensitive parameters. The shared data affects not only the data user's decision-making but also the data provider's operations through system interactions. To trade off control performance and privacy, we propose an interactio… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

    Comments: 21 pages, 8 figures, accepted at the 7th Annual Learning for Dynamics and Control (L4DC) Conference, 2025

  19. arXiv:2505.03482  [pdf, other

    eess.SY math.OC

    Learning-based Homothetic Tube MPC

    Authors: Yulong Gao, Shuhao Yan, Jian Zhou, Mark Cannon

    Abstract: In this paper, we study homothetic tube model predictive control (MPC) of discrete-time linear systems subject to bounded additive disturbance and mixed constraints on the state and input. Different from most existing work on robust MPC, we assume that the true disturbance set is unknown but a conservative surrogate is available a priori. Leveraging the real-time data, we develop an online learnin… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

    Comments: Accepted for presentation at the 23rd European Control Conference

  20. arXiv:2504.21214  [pdf, other

    cs.CL cs.AI eess.AS

    Pretraining Large Brain Language Model for Active BCI: Silent Speech

    Authors: Jinzhao Zhou, Zehong Cao, Yiqun Duan, Connor Barkley, Daniel Leong, Xiaowei Jiang, Quoc-Toan Nguyen, Ziyi Zhao, Thomas Do, Yu-Cheng Chang, Sheng-Fu Liang, Chin-teng Lin

    Abstract: This paper explores silent speech decoding in active brain-computer interface (BCI) systems, which offer more natural and flexible communication than traditional BCI applications. We collected a new silent speech dataset of over 120 hours of electroencephalogram (EEG) recordings from 12 subjects, capturing 24 commonly used English words for language model pretraining and decoding. Following the re… ▽ More

    Submitted 3 May, 2025; v1 submitted 29 April, 2025; originally announced April 2025.

  21. arXiv:2504.19262  [pdf, other

    eess.SP

    Super-resolution Wideband Beam Training for Near-field Communications with Ultra-low Overhead

    Authors: Cong Zhou, Changsheng You, Shuo Shi, Jiasi Zhou, Chenyu Wu

    Abstract: In this paper, we propose a super-resolution wideband beam training method for near-field communications, which is able to achieve ultra-low overhead. To this end, we first study the multi-beam characteristic of a sparse uniform linear array (S-ULA) in the wideband. Interestingly, we show that this leads to a new beam pattern property, called rainbow blocks, where the S-ULA generates multiple grat… ▽ More

    Submitted 2 May, 2025; v1 submitted 27 April, 2025; originally announced April 2025.

  22. arXiv:2504.13912  [pdf, other

    eess.SY

    Koopman Spectral Analysis and System Identification for Stochastic Dynamical Systems via Yosida Approximation of Generators

    Authors: Jun Zhou, Yiming Meng, Jun Liu

    Abstract: System identification and Koopman spectral analysis are crucial for uncovering physical laws and understanding the long-term behaviour of stochastic dynamical systems governed by stochastic differential equations (SDEs). In this work, we propose a novel method for estimating the Koopman generator of systems of SDEs, based on the theory of resolvent operators and the Yosida approximation. This enab… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

  23. arXiv:2504.11936  [pdf, other

    cs.GR cs.HC eess.SP

    Mind2Matter: Creating 3D Models from EEG Signals

    Authors: Xia Deng, Shen Chen, Jiale Zhou, Lei Li

    Abstract: The reconstruction of 3D objects from brain signals has gained significant attention in brain-computer interface (BCI) research. Current research predominantly utilizes functional magnetic resonance imaging (fMRI) for 3D reconstruction tasks due to its excellent spatial resolution. Nevertheless, the clinical utility of fMRI is limited by its prohibitive costs and inability to support real-time ope… ▽ More

    Submitted 5 May, 2025; v1 submitted 16 April, 2025; originally announced April 2025.

  24. arXiv:2504.07656  [pdf, other

    eess.SP cs.IT

    Integrated Sensing, Computing, and Semantic Communication with Fluid Antenna for Metaverse

    Authors: Yinchao Yang, Jingxuan Zhou, Zhaohui Yang

    Abstract: The integration of sensing and communication (ISAC) is pivotal for the Metaverse but faces challenges like high data volume and privacy concerns. This paper proposes a novel integrated sensing, computing, and semantic communication (ISCSC) framework, which uses semantic communication to transmit only contextual information, reducing data overhead and enhancing efficiency. To address the sensitivit… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

    Comments: Accepted by Infocom workshop 2025

  25. arXiv:2504.04928  [pdf, other

    eess.SP

    Advanced Codebook Design for SCMA-aided NTNs With Randomly Distributed Users

    Authors: Tianyang Hu, Qu Luo, Lixia Xiao, Jiaxi Zhou, Pei Xiao, Tao Jiang

    Abstract: In this letter, a novel class of sparse codebooks is proposed for sparse code multiple access (SCMA) aided non-terrestrial networks (NTN) with randomly distributed users characterized by Rician fading channels. Specifically, we first exploit the upper bound of bit error probability (BEP) of an SCMA-aided NTN with large-scale fading of different users under Rician fading channels. Then, the codeboo… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

  26. arXiv:2504.04475  [pdf, ps, other

    eess.SY

    Distributed Nash Equilibrium Seeking in Coalition Games for Uncertain Euler-Lagrange Systems With Application to USV Swarm Confrontation

    Authors: Cheng Yuwen, Guanghui Wen, Jialing Zhou, Meng Luan, Tingwen Huang

    Abstract: In this paper, a coalition game with local and coupling constraints is studied for uncertain Euler-Lagrange (EL) systems subject to disturbances with unknown bounds. In the coalition game, each agent collaborates with other agents within the same coalition to optimize its coalition's cost function while simultaneously competing against agents in other coalitions. Under a distributed framework wher… ▽ More

    Submitted 6 April, 2025; originally announced April 2025.

  27. arXiv:2504.02880  [pdf

    eess.IV cs.AI cs.CV

    Global Rice Multi-Class Segmentation Dataset (RiceSEG): A Comprehensive and Diverse High-Resolution RGB-Annotated Images for the Development and Benchmarking of Rice Segmentation Algorithms

    Authors: Junchi Zhou, Haozhou Wang, Yoichiro Kato, Tejasri Nampally, P. Rajalakshmi, M. Balram, Keisuke Katsura, Hao Lu, Yue Mu, Wanneng Yang, Yangmingrui Gao, Feng Xiao, Hongtao Chen, Yuhao Chen, Wenjuan Li, Jingwen Wang, Fenghua Yu, Jian Zhou, Wensheng Wang, Xiaochun Hu, Yuanzhu Yang, Yanfeng Ding, Wei Guo, Shouyang Liu

    Abstract: Developing computer vision-based rice phenotyping techniques is crucial for precision field management and accelerating breeding, thereby continuously advancing rice production. Among phenotyping tasks, distinguishing image components is a key prerequisite for characterizing plant growth and development at the organ scale, enabling deeper insights into eco-physiological processes. However, due to… ▽ More

    Submitted 2 April, 2025; originally announced April 2025.

  28. arXiv:2504.01081  [pdf, other

    cs.CV cs.CL eess.IV

    ShieldGemma 2: Robust and Tractable Image Content Moderation

    Authors: Wenjun Zeng, Dana Kurniawan, Ryan Mullins, Yuchi Liu, Tamoghna Saha, Dirichi Ike-Njoku, Jindong Gu, Yiwen Song, Cai Xu, Jingjing Zhou, Aparna Joshi, Shravan Dheep, Mani Malek, Hamid Palangi, Joon Baek, Rick Pereira, Karthik Narasimhan

    Abstract: We introduce ShieldGemma 2, a 4B parameter image content moderation model built on Gemma 3. This model provides robust safety risk predictions across the following key harm categories: Sexually Explicit, Violence \& Gore, and Dangerous Content for synthetic images (e.g. output of any image generation model) and natural images (e.g. any image input to a Vision-Language Model). We evaluated on both… ▽ More

    Submitted 8 April, 2025; v1 submitted 1 April, 2025; originally announced April 2025.

  29. arXiv:2503.20509  [pdf, other

    eess.SY

    Problem-Structure-Informed Quantum Approximate Optimization Algorithm for Large-Scale Unit Commitment with Limited Qubits

    Authors: Jingxian Zhou, Ziqing Zhu, Linghua Zhu, Siqi Bu

    Abstract: As power systems expand, solving the Unit Commitment Problem (UCP) becomes increasingly challenging due to the dimensional catastrophe, and traditional methods often struggle to balance computational efficiency and solution quality. To tackle this issue, we propose a problem-structure-informed Quantum Approximate Optimization Algorithm (QAOA) framework that fully exploits the quantum advantage und… ▽ More

    Submitted 26 March, 2025; originally announced March 2025.

  30. arXiv:2503.17915  [pdf, other

    eess.IV cs.AI cs.CV cs.LG

    Cat-AIR: Content and Task-Aware All-in-One Image Restoration

    Authors: Jiachen Jiang, Tianyu Ding, Ke Zhang, Jinxin Zhou, Tianyi Chen, Ilya Zharkov, Zhihui Zhu, Luming Liang

    Abstract: All-in-one image restoration seeks to recover high-quality images from various types of degradation using a single model, without prior knowledge of the corruption source. However, existing methods often struggle to effectively and efficiently handle multiple degradation types. We present Cat-AIR, a novel \textbf{C}ontent \textbf{A}nd \textbf{T}ask-aware framework for \textbf{A}ll-in-one \textbf{I… ▽ More

    Submitted 22 March, 2025; originally announced March 2025.

  31. arXiv:2503.16578  [pdf, other

    cs.CL cs.SD eess.AS

    SeniorTalk: A Chinese Conversation Dataset with Rich Annotations for Super-Aged Seniors

    Authors: Yang Chen, Hui Wang, Shiyao Wang, Junyang Chen, Jiabei He, Jiaming Zhou, Xi Yang, Yequan Wang, Yonghua Lin, Yong Qin

    Abstract: While voice technologies increasingly serve aging populations, current systems exhibit significant performance gaps due to inadequate training data capturing elderly-specific vocal characteristics like presbyphonia and dialectal variations. The limited data available on super-aged individuals in existing elderly speech datasets, coupled with overly simple recording styles and annotation dimensions… ▽ More

    Submitted 20 March, 2025; originally announced March 2025.

  32. arXiv:2503.13660  [pdf, other

    cs.RO cs.AI cs.FL eess.SY

    INPROVF: Leveraging Large Language Models to Repair High-level Robot Controllers from Assumption Violations

    Authors: Qian Meng, Jin Peng Zhou, Kilian Q. Weinberger, Hadas Kress-Gazit

    Abstract: This paper presents INPROVF, an automatic framework that combines large language models (LLMs) and formal methods to speed up the repair process of high-level robot controllers. Previous approaches based solely on formal methods are computationally expensive and cannot scale to large state spaces. In contrast, INPROVF uses LLMs to generate repair candidates, and formal methods to verify their corr… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

    Comments: To appear in ICLR 2025 Workshop: VerifAI: AI Verification in the Wild; in submission to 2025 IEEE 21th International Conference on Automation Science and Engineering (CASE), Los Angeles, CA, USA: IEEE, Aug. 2025

  33. arXiv:2503.11855  [pdf, other

    cs.RO eess.SY

    Learning-based Estimation of Forward Kinematics for an Orthotic Parallel Robotic Mechanism

    Authors: Jingzong Zhou, Yuhan Zhu, Xiaobin Zhang, Sunil Agrawal, Konstantinos Karydis

    Abstract: This paper introduces a 3D parallel robot with three identical five-degree-of-freedom chains connected to a circular brace end-effector, aimed to serve as an assistive device for patients with cervical spondylosis. The inverse kinematics of the system is solved analytically, whereas learning-based methods are deployed to solve the forward kinematics. The methods considered herein include a Koopman… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

  34. arXiv:2503.11618  [pdf

    physics.optics eess.SP

    Pushing DSP-Free Coherent Interconnect to the Last Inch by Optically Analog Signal Processing

    Authors: Mingming Zhang, Haoze Du, Xuefeng Wang, Junda Chen, Weihao Li, Zihe Hu, Yizhao Chen, Can Zhao, Hao Wu, Jiajun Zhou, Siyang Liu, Siqi Yan, Ming Tang

    Abstract: To support the boosting interconnect capacity of the AI-related data centers, novel techniques enabled high-speed and low-cost optics are continuously emerging. When the baud rate approaches 200 GBaud per lane, the bottle-neck of traditional intensity modulation direct detection (IM-DD) architectures becomes increasingly evident. The simplified coherent solutions are widely discussed and considere… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

  35. arXiv:2503.09787  [pdf, other

    eess.IV cs.CV

    Bidirectional Learned Facial Animation Codec for Low Bitrate Talking Head Videos

    Authors: Riku Takahashi, Ryugo Morita, Fuma Kimishima, Kosuke Iwama, Jinjia Zhou

    Abstract: Existing deep facial animation coding techniques efficiently compress talking head videos by applying deep generative models. Instead of compressing the entire video sequence, these methods focus on compressing only the keyframe and the keypoints of non-keyframes (target frames). The target frames are then reconstructed by utilizing a single keyframe, and the keypoints of the target frame. Althoug… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

    Comments: Accepted to DCC2025

  36. arXiv:2503.09491  [pdf, other

    cs.CV eess.IV

    DAMM-Diffusion: Learning Divergence-Aware Multi-Modal Diffusion Model for Nanoparticles Distribution Prediction

    Authors: Junjie Zhou, Shouju Wang, Yuxia Tang, Qi Zhu, Daoqiang Zhang, Wei Shao

    Abstract: The prediction of nanoparticles (NPs) distribution is crucial for the diagnosis and treatment of tumors. Recent studies indicate that the heterogeneity of tumor microenvironment (TME) highly affects the distribution of NPs across tumors. Hence, it has become a research hotspot to generate the NPs distribution by the aid of multi-modal TME components. However, the distribution divergence among mult… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

    Comments: Accepted by CVPR 2025

  37. arXiv:2502.18913  [pdf, other

    cs.CL cs.SD eess.AS

    CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition

    Authors: Jiaming Zhou, Yujie Guo, Shiwan Zhao, Haoqin Sun, Hui Wang, Jiabei He, Aobo Kong, Shiyao Wang, Xi Yang, Yequan Wang, Yonghua Lin, Yong Qin

    Abstract: Code-switching (CS), the alternation between two or more languages within a single conversation, presents significant challenges for automatic speech recognition (ASR) systems. Existing Mandarin-English code-switching datasets often suffer from limitations in size, spontaneity, and the lack of full-length dialogue recordings with transcriptions, hindering the development of robust ASR models for r… ▽ More

    Submitted 11 March, 2025; v1 submitted 26 February, 2025; originally announced February 2025.

  38. arXiv:2502.11128  [pdf, other

    cs.CL cs.SD eess.AS

    FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching

    Authors: Hui Wang, Shujie Liu, Lingwei Meng, Jinyu Li, Yifan Yang, Shiwan Zhao, Haiyang Sun, Yanqing Liu, Haoqin Sun, Jiaming Zhou, Yan Lu, Yong Qin

    Abstract: To advance continuous-valued token modeling and temporal-coherence enforcement, we propose FELLE, an autoregressive model that integrates language modeling with token-wise flow matching. By leveraging the autoregressive nature of language models and the generative efficacy of flow matching, FELLE effectively predicts continuous-valued tokens (mel-spectrograms). For each continuous-valued token, FE… ▽ More

    Submitted 16 February, 2025; originally announced February 2025.

  39. arXiv:2502.06289  [pdf

    eess.IV cs.AI cs.CV

    Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?

    Authors: Qingshan Hou, Yukun Zhou, Jocelyn Hui Lin Goh, Ke Zou, Samantha Min Er Yew, Sahana Srinivasan, Meng Wang, Thaddaeus Lo, Xiaofeng Lei, Siegfried K. Wagner, Mark A. Chia, Dawei Yang, Hongyang Jiang, AnRan Ran, Rui Santos, Gabor Mark Somfai, Juan Helen Zhou, Haoyu Chen, Qingyu Chen, Carol Yim-Lui Cheung, Pearse A. Keane, Yih Chung Tham

    Abstract: The advent of foundation models (FMs) is transforming medical domain. In ophthalmology, RETFound, a retina-specific FM pre-trained sequentially on 1.4 million natural images and 1.6 million retinal images, has demonstrated high adaptability across clinical applications. Conversely, DINOv2, a general-purpose vision FM pre-trained on 142 million natural images, has shown promise in non-medical domai… ▽ More

    Submitted 10 February, 2025; originally announced February 2025.

  40. arXiv:2502.05845  [pdf

    eess.SY

    Exploiting the Hidden Capacity of MMC Through Accurate Quantification of Modulation Indices

    Authors: Qianhao Sun, Jingwei Meng, Ruofan Li, Mingchao Xia, Qifang Chen, Jiejie Zhou, Meiqi Fan, Peiqian Guo

    Abstract: The modular multilevel converter (MMC) has become increasingly important in voltage-source converter-based high-voltage direct current (VSC-HVDC) systems. Direct and indirect modulation are widely used as mainstream modulation techniques in MMCs. However, due to the challenge of quantitatively evaluating the operation of different modulation schemes, the academic and industrial communities still h… ▽ More

    Submitted 9 February, 2025; originally announced February 2025.

  41. arXiv:2501.11028  [pdf, other

    eess.SP

    Few-shot Human Motion Recognition through Multi-Aspect mmWave FMCW Radar Data

    Authors: Hao Fan, Lingfeng Chen, Chengbai Xu, Jiadong Zhou, Yongpeng Dai, Panhe HU

    Abstract: Radar human motion recognition methods based on deep learning models has been a heated spot of remote sensing in recent years, yet the existing methods are mostly radial-oriented. In practical application, the test data could be multi-aspect and the sample number of each motion could be very limited, causing model overfitting and reduced recognition accuracy. This paper proposed channel-DN4, a mul… ▽ More

    Submitted 19 January, 2025; originally announced January 2025.

  42. arXiv:2501.10811  [pdf, other

    cs.SD eess.AS

    MusicEval: A Generative Music Dataset with Expert Ratings for Automatic Text-to-Music Evaluation

    Authors: Cheng Liu, Hui Wang, Jinghua Zhao, Shiwan Zhao, Hui Bu, Xin Xu, Jiaming Zhou, Haoqin Sun, Yong Qin

    Abstract: The technology for generating music from textual descriptions has seen rapid advancements. However, evaluating text-to-music (TTM) systems remains a significant challenge, primarily due to the difficulty of balancing performance and cost with existing objective and subjective evaluation methods. In this paper, we propose an automatic assessment task for TTM models to align with human perception. T… ▽ More

    Submitted 23 March, 2025; v1 submitted 18 January, 2025; originally announced January 2025.

    Comments: Accepted by ICASSP 2025

  43. arXiv:2501.06282  [pdf, other

    cs.CL cs.AI cs.HC cs.SD eess.AS

    MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

    Authors: Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, Yabin Li, Xiang Lv, Jiaqing Liu, Haoneng Luo, Bin Ma, Chongjia Ni, Xian Shi, Jialong Tang, Hui Wang, Hao Wang, Wen Wang, Yuxuan Wang, Yunlan Xu, Fan Yu, Zhijie Yan , et al. (11 additional authors not shown)

    Abstract: Recent advancements in large language models (LLMs) and multimodal speech-text models have laid the groundwork for seamless voice interactions, enabling real-time, natural, and human-like conversations. Previous models for voice interactions are categorized as native and aligned. Native models integrate speech and text processing in one framework but struggle with issues like differing sequence le… ▽ More

    Submitted 10 January, 2025; originally announced January 2025.

    Comments: Work in progress. Authors are listed in alphabetical order by family name

  44. arXiv:2501.02815  [pdf, other

    cs.RO eess.SY

    Local Reactive Control for Mobile Manipulators with Whole-Body Safety in Complex Environments

    Authors: Chunxin Zheng, Yulin Li, Zhiyuan Song, Zhihai Bi, Jinni Zhou, Boyu Zhou, Jun Ma

    Abstract: Mobile manipulators typically encounter significant challenges in navigating narrow, cluttered environments due to their high-dimensional state spaces and complex kinematics. While reactive methods excel in dynamic settings, they struggle to efficiently incorporate complex, coupled constraints across the entire state space. In this work, we present a novel local reactive controller that reformulat… ▽ More

    Submitted 6 January, 2025; originally announced January 2025.

  45. arXiv:2412.20821  [pdf, other

    eess.AS cs.CL cs.SD

    Enhancing Multimodal Emotion Recognition through Multi-Granularity Cross-Modal Alignment

    Authors: Xuechen Wang, Shiwan Zhao, Haoqin Sun, Hui Wang, Jiaming Zhou, Yong Qin

    Abstract: Multimodal emotion recognition (MER), leveraging speech and text, has emerged as a pivotal domain within human-computer interaction, demanding sophisticated methods for effective multimodal integration. The challenge of aligning features across these modalities is significant, with most existing approaches adopting a singular alignment strategy. Such a narrow focus not only limits model performanc… ▽ More

    Submitted 30 December, 2024; originally announced December 2024.

    Comments: ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  46. arXiv:2412.19099  [pdf, other

    cs.SD eess.AS

    BSDB-Net: Band-Split Dual-Branch Network with Selective State Spaces Mechanism for Monaural Speech Enhancement

    Authors: Cunhang Fan, Enrui Liu, Andong Li, Jianhua Tao, Jian Zhou, Jiahao Li, Chengshi Zheng, Zhao Lv

    Abstract: Although the complex spectrum-based speech enhancement(SE) methods have achieved significant performance, coupling amplitude and phase can lead to a compensation effect, where amplitude information is sacrificed to compensate for the phase that is harmful to SE. In addition, to further improve the performance of SE, many modules are stacked onto SE, resulting in increased model complexity that lim… ▽ More

    Submitted 26 December, 2024; originally announced December 2024.

    Comments: Accepted by AAAI 2025

  47. arXiv:2412.18417  [pdf, other

    eess.IV cs.CV

    Ultra-Low Complexity On-Orbit Compression for Remote Sensing Imagery via Block Modulated Imaging

    Authors: Zhibin Wang, Yanxin Cai, Jiayi Zhou, Yangming Zhang, Tianyu Li, Wei Li, Xun Liu, Guoqing Wang, Yang Yang

    Abstract: The growing field of remote sensing faces a challenge: the ever-increasing size and volume of imagery data are exceeding the storage and transmission capabilities of satellite platforms. Efficient compression of remote sensing imagery is a critical solution to alleviate these burdens on satellites. However, existing compression methods are often too computationally expensive for satellites. With t… ▽ More

    Submitted 12 April, 2025; v1 submitted 24 December, 2024; originally announced December 2024.

  48. arXiv:2412.17062  [pdf, ps, other

    cs.IT eess.SP

    Hybrid Beamforming Design for RSMA-enabled Near-Field Integrated Sensing and Communications

    Authors: Jiasi Zhou, Chintha Tellambura, Geoffrey Ye Li

    Abstract: Integrated sensing and communication (ISAC) networks leverage extremely large antenna arrays and high frequencies. This inevitably extends the Rayleigh distance, making near-field (NF) spherical wave propagation dominant. This unlocks numerous spatial degrees of freedom, raising the challenge of optimizing them for communication and sensing tradeoffs. To this end, we propose a rate-splitting multi… ▽ More

    Submitted 20 April, 2025; v1 submitted 22 December, 2024; originally announced December 2024.

    Comments: 13 pages and 9 figures

  49. arXiv:2412.10117  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

    Authors: Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, Fan Yu, Huadai Liu, Zhengyan Sheng, Yue Gu, Chong Deng, Wen Wang, Shiliang Zhang, Zhijie Yan, Jingren Zhou

    Abstract: In our previous work, we introduced CosyVoice, a multilingual speech synthesis model based on supervised discrete speech tokens. By employing progressive semantic decoding with two popular generative models, language models (LMs) and Flow Matching, CosyVoice demonstrated high prosody naturalness, content consistency, and speaker similarity in speech in-context learning. Recently, significant progr… ▽ More

    Submitted 25 December, 2024; v1 submitted 13 December, 2024; originally announced December 2024.

    Comments: Tech report, work in progress

  50. arXiv:2412.08210  [pdf, other

    cs.CV eess.IV

    Unicorn: Unified Neural Image Compression with One Number Reconstruction

    Authors: Qi Zheng, Haozhi Wang, Zihao Liu, Jiaming Liu, Peiye Liu, Zhijian Hao, Yanheng Lu, Dimin Niu, Jinjia Zhou, Minge Jing, Yibo Fan

    Abstract: Prevalent lossy image compression schemes can be divided into: 1) explicit image compression (EIC), including traditional standards and neural end-to-end algorithms; 2) implicit image compression (IIC) based on implicit neural representations (INR). The former is encountering impasses of either leveling off bitrate reduction at a cost of tremendous complexity while the latter suffers from excessiv… ▽ More

    Submitted 11 December, 2024; originally announced December 2024.