Skip to main content

Showing 1–50 of 126 results for author: Pan, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2506.23584  [pdf, ps, other

    eess.IV cs.AI cs.CV

    A Clinically-Grounded Two-Stage Framework for Renal CT Report Generation

    Authors: Renjie Liang, Zhengkang Fan, Jinqian Pan, Chenkun Sun, Russell Terry, Jie Xu

    Abstract: Generating radiology reports from CT scans remains a complex task due to the nuanced nature of medical imaging and the variability in clinical documentation. In this study, we propose a two-stage framework for generating renal radiology reports from 2D CT slices. First, we extract structured abnormality features using a multi-task learning model trained to identify lesion attributes such as locati… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  2. arXiv:2506.22490  [pdf, ps, other

    eess.SP cs.LG

    MENGLAN: Multiscale Enhanced Nonparametric Gas Analyzer with Lightweight Architecture and Networks

    Authors: Zhenke Duan, Jiqun Pan, Jiani Tu

    Abstract: Accurate detection of ethylene concentrations in mixed gases is crucial in chemical production for safety and health purposes. Traditional methods are hindered by high cost and complexity, limiting their practical application. This study proposes MENGLAN, a Multiscale Enhanced Nonparametric Gas Analyzer that integrates a dual-stream structure, a Hybrid Multi-Head Attention mechanism, and a Feature… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  3. arXiv:2505.23290  [pdf, other

    cs.SD cs.CV eess.AS

    Wav2Sem: Plug-and-Play Audio Semantic Decoupling for 3D Speech-Driven Facial Animation

    Authors: Hao Li, Ju Dai, Xin Zhao, Feng Zhou, Junjun Pan, Lei Li

    Abstract: In 3D speech-driven facial animation generation, existing methods commonly employ pre-trained self-supervised audio models as encoders. However, due to the prevalence of phonetically similar syllables with distinct lip shapes in language, these near-homophone syllables tend to exhibit significant coupling in self-supervised audio feature spaces, leading to the averaging effect in subsequent lip mo… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: Accepted to CVPR 2025

  4. arXiv:2505.17528  [pdf, ps, other

    eess.IV cs.CV

    DECT-based Space-Squeeze Method for Multi-Class Classification of Metastatic Lymph Nodes in Breast Cancer

    Authors: Hai Jiang, Chushan Zheng, Jiawei Pan, Yuanpin Zhou, Qiongting Liu, Xiang Zhang, Jun Shen, Yao Lu

    Abstract: Background: Accurate assessment of metastatic burden in axillary lymph nodes is crucial for guiding breast cancer treatment decisions, yet conventional imaging modalities struggle to differentiate metastatic burden levels and capture comprehensive lymph node characteristics. This study leverages dual-energy computed tomography (DECT) to exploit spectral-spatial information for improved multi-class… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

  5. arXiv:2505.17484  [pdf, ps, other

    eess.IV cs.AI cs.CV

    Anatomy-Guided Multitask Learning for MRI-Based Classification of Placenta Accreta Spectrum and its Subtypes

    Authors: Hai Jiang, Qiongting Liu, Yuanpin Zhou, Jiawei Pan, Ting Song, Yao Lu

    Abstract: Placenta Accreta Spectrum Disorders (PAS) pose significant risks during pregnancy, frequently leading to postpartum hemorrhage during cesarean deliveries and other severe clinical complications, with bleeding severity correlating to the degree of placental invasion. Consequently, accurate prenatal diagnosis of PAS and its subtypes-placenta accreta (PA), placenta increta (PI), and placenta percreta… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

  6. arXiv:2505.01263  [pdf, other

    cs.MM cs.CV cs.SD eess.AS

    FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing

    Authors: Gaoxiang Cong, Liang Li, Jiadong Pan, Zhedong Zhang, Amin Beheshti, Anton van den Hengel, Yuankai Qi, Qingming Huang

    Abstract: Movie Dubbing aims to convert scripts into speeches that align with the given movie clip in both temporal and emotional aspects while preserving the vocal timbre of a given brief reference audio. Existing methods focus primarily on reducing the word error rate while ignoring the importance of lip-sync and acoustic quality. To address these issues, we propose a large language model (LLM) based flow… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

  7. arXiv:2504.13037  [pdf, ps, other

    eess.IV cs.AI cs.CV

    Towards Cardiac MRI Foundation Models: Comprehensive Visual-Tabular Representations for Whole-Heart Assessment and Beyond

    Authors: Yundi Zhang, Paul Hager, Che Liu, Suprosanna Shit, Chen Chen, Daniel Rueckert, Jiazhen Pan

    Abstract: Cardiac magnetic resonance imaging is the gold standard for non-invasive cardiac assessment, offering rich spatio-temporal views of the cardiac anatomy and physiology. Patient-level health factors, such as demographics, metabolic, and lifestyle, are known to substantially influence cardiovascular health and disease risk, yet remain uncaptured by CMR alone. To holistically understand cardiac health… ▽ More

    Submitted 2 July, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

  8. arXiv:2504.12711  [pdf, other

    cs.CV cs.AI eess.IV

    NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results

    Authors: Xin Li, Yeying Jin, Xin Jin, Zongwei Wu, Bingchen Li, Yufei Wang, Wenhan Yang, Yu Li, Zhibo Chen, Bihan Wen, Robby T. Tan, Radu Timofte, Qiyu Rong, Hongyuan Jing, Mengmeng Zhang, Jinglong Li, Xiangyu Lu, Yi Ren, Yuting Liu, Meng Zhang, Xiang Chen, Qiyuan Guan, Jiangxin Dong, Jinshan Pan, Conglin Gou , et al. (112 additional authors not shown)

    Abstract: This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includ… ▽ More

    Submitted 19 April, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

    Comments: Challenge Report of CVPR NTIRE 2025; 26 pages; Methods from 32 teams

  9. arXiv:2504.10686  [pdf, other

    cs.CV eess.IV

    The Tenth NTIRE 2025 Efficient Super-Resolution Challenge Report

    Authors: Bin Ren, Hang Guo, Lei Sun, Zongwei Wu, Radu Timofte, Yawei Li, Yao Zhang, Xinning Chai, Zhengxue Cheng, Yingsheng Qin, Yucai Yang, Li Song, Hongyuan Yu, Pufan Xu, Cheng Wan, Zhijuan Huang, Peng Guo, Shuyuan Cui, Chenjun Li, Xuehai Hu, Pan Pan, Xin Zhang, Heng Zhang, Qing Luo, Linyan Jiang , et al. (122 additional authors not shown)

    Abstract: This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

    Comments: Accepted by CVPR2025 NTIRE Workshop, Efficient Super-Resolution Challenge Report. 50 pages

  10. arXiv:2504.01638  [pdf, other

    eess.SY

    Convex Computations for Controlled Safety Invariant Sets of Black-box Discrete-time Dynamical Systems

    Authors: Taoran Wu, Yiling Xue, Jingduo Pan, Dejin Ren, Arvind Easwaran, Bai Xue

    Abstract: Identifying controlled safety invariant sets (CSISs) is essential in safety-critical applications. This paper tackles the problem of identifying CSISs for black-box discrete-time systems, where the model is unknown and only limited simulation data is accessible. Traditionally, a CSIS is defined as a subset of a safe set, encompassing initial states for which a control input exists that keeps the s… ▽ More

    Submitted 2 April, 2025; originally announced April 2025.

    Comments: 15 pages

  11. arXiv:2504.01260  [pdf, other

    cs.RO cs.HC eess.SY

    The Social Life of Industrial Arms: How Arousal and Attention Shape Human-Robot Interaction

    Authors: Roy El-Helou, Matthew K. X. J Pan

    Abstract: This study explores how human perceptions of a non-anthropomorphic robotic manipulator are shaped by two key dimensions of behaviour: arousal, defined as the robot's movement energy and expressiveness, and attention, defined as the robot's capacity to selectively orient toward and engage with a user. We introduce a novel control architecture that integrates a gaze-like attention engine with an aro… ▽ More

    Submitted 1 April, 2025; originally announced April 2025.

    Comments: 7 pages, 3 figures, 1 table

  12. arXiv:2503.08638  [pdf, other

    eess.AS cs.AI cs.MM cs.SD

    YuE: Scaling Open Foundation Models for Long-Form Music Generation

    Authors: Ruibin Yuan, Hanfeng Lin, Shuyue Guo, Ge Zhang, Jiahao Pan, Yongyi Zang, Haohe Liu, Yiming Liang, Wenye Ma, Xingjian Du, Xinrun Du, Zhen Ye, Tianyu Zheng, Yinghao Ma, Minghao Liu, Zeyue Tian, Ziya Zhou, Liumeng Xue, Xingwei Qu, Yizhi Li, Shangda Wu, Tianhao Shen, Ziyang Ma, Jun Zhan, Chunhui Wang , et al. (32 additional authors not shown)

    Abstract: We tackle the task of long-form music generation--particularly the challenging \textbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

    Comments: https://github.com/multimodal-art-projection/YuE

  13. arXiv:2503.01710  [pdf, other

    cs.SD cs.AI eess.AS

    Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

    Authors: Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li, Zheng Liang, Qixi Zheng, Rui Wang, Xiaoqin Feng, Weizhen Bian, Zhen Ye, Sitong Cheng, Ruibin Yuan, Zhixian Zhao, Xinfa Zhu, Jiahao Pan, Liumeng Xue, Pengcheng Zhu, Yunlin Chen, Zhifei Li, Xie Chen, Lei Xie, Yike Guo, Wei Xue

    Abstract: Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting multiple codebooks, limiting efficiency and integration flexibility. To overcome these challenges, we introduce Spark-TTS, a novel system powered by BiCodec, a sin… ▽ More

    Submitted 3 March, 2025; originally announced March 2025.

    Comments: Submitted to ACL 2025

  14. arXiv:2502.16584  [pdf, other

    cs.SD cs.AI cs.CL cs.MM eess.AS

    Audio-FLAN: A Preliminary Release

    Authors: Liumeng Xue, Ziya Zhou, Jiahao Pan, Zixuan Li, Shuai Fan, Yinghao Ma, Sitong Cheng, Dongchao Yang, Haohan Guo, Yujia Xiao, Xinsheng Wang, Zixuan Shen, Chuanbo Zhu, Xinshen Zhang, Tianchi Liu, Ruibin Yuan, Zeyue Tian, Haohe Liu, Emmanouil Benetos, Ge Zhang, Yike Guo, Wei Xue

    Abstract: Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learnin… ▽ More

    Submitted 23 February, 2025; originally announced February 2025.

  15. arXiv:2502.06100  [pdf, other

    cs.CV eess.SP

    Col-OLHTR: A Novel Framework for Multimodal Online Handwritten Text Recognition

    Authors: Chenyu Liu, Jinshui Hu, Baocai Yin, Jia Pan, Bing Yin, Jun Du, Qingfeng Liu

    Abstract: Online Handwritten Text Recognition (OLHTR) has gained considerable attention for its diverse range of applications. Current approaches usually treat OLHTR as a sequence recognition task, employing either a single trajectory or image encoder, or multi-stream encoders, combined with a CTC or attention-based recognition decoder. However, these approaches face several drawbacks: 1) single encoders ty… ▽ More

    Submitted 9 February, 2025; originally announced February 2025.

    Comments: ICASSP 2025

  16. arXiv:2501.08667  [pdf, ps, other

    eess.IV cs.CV

    TimeFlow: Longitudinal Brain Image Registration and Aging Progression Analysis

    Authors: Bailiang Jian, Jiazhen Pan, Yitong Li, Fabian Bongratz, Ruochen Li, Daniel Rueckert, Benedikt Wiestler, Christian Wachinger

    Abstract: Predicting future brain states is crucial for understanding healthy aging and neurodegenerative diseases. Longitudinal brain MRI registration, a cornerstone for such analyses, has long been limited by its inability to forecast future developments, reliance on extensive dense longitudinal data, and the need to balance registration accuracy with temporal smoothness. In this work, we present \emph{Ti… ▽ More

    Submitted 7 July, 2025; v1 submitted 15 January, 2025; originally announced January 2025.

  17. arXiv:2501.06482  [pdf, other

    eess.SP

    Deep Reinforcement Learning Optimized Intelligent Resource Allocation in Active RIS-Integrated TN-NTN Networks

    Authors: Muhammad Ahmed Mohsin, Hassan Rizwan, Muhammad Jazib, Muhammad Iqbal, Muhammad Bilal, Tabinda Ashraf, Muhammad Farhan Khan, Jen-Yi Pan

    Abstract: This work explores the deployment of active reconfigurable intelligent surfaces (A-RIS) in integrated terrestrial and non-terrestrial networks (TN-NTN) while utilizing coordinated multipoint non-orthogonal multiple access (CoMP-NOMA). Our system model incorporates a UAV-assisted RIS in coordination with a terrestrial RIS which aims for signal enhancement. We aim to maximize the sum rate for all us… ▽ More

    Submitted 11 January, 2025; originally announced January 2025.

    Comments: Accepted to WCNC 2025

  18. arXiv:2412.08988  [pdf, other

    cs.SD cs.MM eess.AS

    EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing

    Authors: Gaoxiang Cong, Jiadong Pan, Liang Li, Yuankai Qi, Yuxin Peng, Anton van den Hengel, Jian Yang, Qingming Huang

    Abstract: Given a piece of text, a video clip, and a reference audio, the movie dubbing task aims to generate speech that aligns with the video while cloning the desired voice. The existing methods have two primary deficiencies: (1) They struggle to simultaneously hold audio-visual sync and achieve clear pronunciation; (2) They lack the capacity to express user-defined emotions. To address these problems, w… ▽ More

    Submitted 24 April, 2025; v1 submitted 12 December, 2024; originally announced December 2024.

    Comments: Accepted to CVPR 2025

  19. arXiv:2411.08904  [pdf, other

    eess.SP

    Generalized Scattering Matrix of Antenna: Moment Solution, Compression Storage and Application

    Authors: Chenbo Shi, Jin Pan, Xin Gu, Shichen Liang, Le Zuo

    Abstract: This paper presents a computation method of generalized scattering matrix (GSM) based on integral equations and the method of moments (MoM), specifically designed for antennas excited through waveguide ports. By leveraging two distinct formulations -- magnetic-type and electric-type integral equations -- we establish concise algebraic relations linking the GSM directly to the impedance matrices ob… ▽ More

    Submitted 23 April, 2025; v1 submitted 29 October, 2024; originally announced November 2024.

  20. arXiv:2411.06667  [pdf, other

    eess.AS cs.SD

    DCF-DS: Deep Cascade Fusion of Diarization and Separation for Speech Recognition under Realistic Single-Channel Conditions

    Authors: Shu-Tong Niu, Jun Du, Ruo-Yu Wang, Gao-Bin Yang, Tian Gao, Jia Pan, Yu Hu

    Abstract: We propose a single-channel Deep Cascade Fusion of Diarization and Separation (DCF-DS) framework for back-end automatic speech recognition (ASR), combining neural speaker diarization (NSD) and speech separation (SS). First, we sequentially integrate the NSD and SS modules within a joint training framework, enabling the separation module to leverage speaker time boundaries from the diarization modu… ▽ More

    Submitted 27 December, 2024; v1 submitted 10 November, 2024; originally announced November 2024.

  21. arXiv:2410.21256  [pdf, other

    cs.AI cs.CV eess.IV

    Multi-modal AI for comprehensive breast cancer prognostication

    Authors: Jan Witowski, Ken G. Zeng, Joseph Cappadona, Jailan Elayoubi, Khalil Choucair, Elena Diana Chiru, Nancy Chan, Young-Joon Kang, Frederick Howard, Irina Ostrovnaya, Carlos Fernandez-Granda, Freya Schnabel, Zoe Steinsnyder, Ugur Ozerdem, Kangning Liu, Waleed Abdulsattar, Yu Zong, Lina Daoud, Rafic Beydoun, Anas Saad, Nitya Thakore, Mohammad Sadic, Frank Yeung, Elisa Liu, Theodore Hill , et al. (26 additional authors not shown)

    Abstract: Treatment selection in breast cancer is guided by molecular subtypes and clinical characteristics. However, current tools including genomic assays lack the accuracy required for optimal clinical decision-making. We developed a novel artificial intelligence (AI)-based approach that integrates digital pathology images with clinical data, providing a more robust and effective method for predicting th… ▽ More

    Submitted 2 March, 2025; v1 submitted 28 October, 2024; originally announced October 2024.

  22. arXiv:2409.16803  [pdf, other

    eess.AS cs.SD

    Incorporating Spatial Cues in Modular Speaker Diarization for Multi-channel Multi-party Meetings

    Authors: Ruoyu Wang, Shutong Niu, Gaobin Yang, Jun Du, Shuangqing Qian, Tian Gao, Jia Pan

    Abstract: Although fully end-to-end speaker diarization systems have made significant progress in recent years, modular systems often achieve superior results in real-world scenarios due to their greater adaptability and robustness. Historically, modular speaker diarization methods have seldom discussed how to leverage spatial cues from multi-channel speech. This paper proposes a three-stage modular system… ▽ More

    Submitted 25 September, 2024; originally announced September 2024.

    Comments: 5 pages, Submitted to ICASSP 2025

  23. arXiv:2409.13913  [pdf, other

    cs.CL cs.SD eess.AS

    Target word activity detector: An approach to obtain ASR word boundaries without lexicon

    Authors: Sunit Sivasankaran, Eric Sun, Jinyu Li, Yan Huang, Jing Pan

    Abstract: Obtaining word timestamp information from end-to-end (E2E) ASR models remains challenging due to the lack of explicit time alignment during training. This issue is further complicated in multilingual models. Existing methods, either rely on lexicons or introduce additional tokens, leading to scalability issues and increased computational costs. In this work, we propose a new approach to estimate w… ▽ More

    Submitted 20 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP 2025

  24. arXiv:2409.02041  [pdf, other

    eess.AS cs.SD

    The USTC-NERCSLIP Systems for the CHiME-8 NOTSOFAR-1 Challenge

    Authors: Shutong Niu, Ruoyu Wang, Jun Du, Gaobin Yang, Yanhui Tu, Siyuan Wu, Shuangqing Qian, Huaxin Wu, Haitao Xu, Xueyang Zhang, Guolong Zhong, Xindi Yu, Jieru Chen, Mengzhi Wang, Di Cai, Tian Gao, Genshun Wan, Feng Ma, Jia Pan, Jianqing Gao

    Abstract: This technical report outlines our submission system for the CHiME-8 NOTSOFAR-1 Challenge. The primary difficulty of this challenge is the dataset recorded across various conference rooms, which captures real-world complexities such as high overlap rates, background noises, a variable number of speakers, and natural conversation styles. To address these issues, we optimized the system in several a… ▽ More

    Submitted 24 October, 2024; v1 submitted 3 September, 2024; originally announced September 2024.

  25. arXiv:2408.17175  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model

    Authors: Zhen Ye, Peiwen Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jianyi Chen, Jiahao Pan, Qifeng Liu, Yike Guo, Wei Xue

    Abstract: Recent advancements in audio generation have been significantly propelled by the capabilities of Large Language Models (LLMs). The existing research on audio LLM has primarily focused on enhancing the architecture and scale of audio language models, as well as leveraging larger datasets, and generally, acoustic codecs, such as EnCodec, are used for audio tokenization. However, these codecs were or… ▽ More

    Submitted 27 November, 2024; v1 submitted 30 August, 2024; originally announced August 2024.

  26. arXiv:2407.20962  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions

    Authors: Xiaowei Chi, Yatian Wang, Aosong Cheng, Pengjun Fang, Zeyue Tian, Yingqing He, Zhaoyang Liu, Xingqun Qi, Jiahao Pan, Rongyu Zhang, Mengfei Li, Ruibin Yuan, Yanbing Jiang, Wei Xue, Wenhan Luo, Qifeng Chen, Shanghang Zhang, Qifeng Liu, Yike Guo

    Abstract: Massive multi-modality datasets play a significant role in facilitating the success of large video-language models. However, current video-language datasets primarily provide text descriptions for visual frames, considering audio to be weakly related information. They usually overlook exploring the potential of inherent audio-visual correlation, leading to monotonous annotation within each modalit… ▽ More

    Submitted 17 December, 2024; v1 submitted 30 July, 2024; originally announced July 2024.

    Comments: 15 Pages. Dataset report

  27. arXiv:2407.20108  [pdf, other

    eess.IV cs.AI cs.CV

    Classification, Regression and Segmentation directly from k-Space in Cardiac MRI

    Authors: Ruochen Li, Jiazhen Pan, Youxiang Zhu, Juncheng Ni, Daniel Rueckert

    Abstract: Cardiac Magnetic Resonance Imaging (CMR) is the gold standard for diagnosing cardiovascular diseases. Clinical diagnoses predominantly rely on magnitude-only Digital Imaging and Communications in Medicine (DICOM) images, omitting crucial phase information that might provide additional diagnostic benefits. In contrast, k-space is complex-valued and encompasses both magnitude and phase information,… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

  28. Mamba? Catch The Hype Or Rethink What Really Helps for Image Registration

    Authors: Bailiang Jian, Jiazhen Pan, Morteza Ghahremani, Daniel Rueckert, Christian Wachinger, Benedikt Wiestler

    Abstract: Our findings indicate that adopting "advanced" computational elements fails to significantly improve registration accuracy. Instead, well-established registration-specific designs offer fair improvements, enhancing results by a marginal 1.5\% over the baseline. Our findings emphasize the importance of rigorous, unbiased evaluation and contribution disentanglement of all low- and high-level registr… ▽ More

    Submitted 27 July, 2024; originally announced July 2024.

    Comments: WBIR 2024 Workshop on Biomedical Imaging Registration

  29. arXiv:2406.16872  [pdf, other

    eess.SP cs.AI

    Multi-channel Time Series Decomposition Network For Generalizable Sensor-Based Activity Recognition

    Authors: Jianguo Pan, Zhengxin Hu, Lingdun Zhang, Xia Cai

    Abstract: Sensor-based human activity recognition is important in daily scenarios such as smart healthcare and homes due to its non-intrusive privacy and low cost advantages, but the problem of out-of-domain generalization caused by differences in focusing individuals and operating environments can lead to significant accuracy degradation on cross-person behavior recognition due to the inconsistent distribu… ▽ More

    Submitted 28 March, 2024; originally announced June 2024.

  30. arXiv:2406.02430  [pdf, other

    eess.AS cs.SD

    Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

    Authors: Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu , et al. (21 additional authors not shown)

    Abstract: We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and sub… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  31. Whole Heart 3D+T Representation Learning Through Sparse 2D Cardiac MR Images

    Authors: Yundi Zhang, Chen Chen, Suprosanna Shit, Sophie Starck, Daniel Rueckert, Jiazhen Pan

    Abstract: Cardiac Magnetic Resonance (CMR) imaging serves as the gold-standard for evaluating cardiac morphology and function. Typically, a multi-view CMR stack, covering short-axis (SA) and 2/3/4-chamber long-axis (LA) views, is acquired for a thorough cardiac assessment. However, efficiently streamlining the complex, high-dimensional 3D+T CMR data and distilling compact, coherent representation remains a… ▽ More

    Submitted 6 June, 2024; v1 submitted 1 June, 2024; originally announced June 2024.

  32. Direct Cardiac Segmentation from Undersampled K-space Using Transformers

    Authors: Yundi Zhang, Nil Stolt-Ansó, Jiazhen Pan, Wenqi Huang, Kerstin Hammernik, Daniel Rueckert

    Abstract: The prevailing deep learning-based methods of predicting cardiac segmentation involve reconstructed magnetic resonance (MR) images. The heavy dependency of segmentation approaches on image quality significantly limits the acceleration rate in fast MR reconstruction. Moreover, the practice of treating reconstruction and segmentation as separate sequential processes leads to artifact generation and… ▽ More

    Submitted 31 May, 2024; originally announced June 2024.

  33. arXiv:2405.16952  [pdf, other

    eess.AS

    A Variance-Preserving Interpolation Approach for Diffusion Models with Applications to Single Channel Speech Enhancement and Recognition

    Authors: Zilu Guo, Qing Wang, Jun Du, Jia Pan, Qing-Feng Liu, Chin-Hui

    Abstract: In this paper, we propose a variance-preserving interpolation framework to improve diffusion models for single-channel speech enhancement (SE) and automatic speech recognition (ASR). This new variance-preserving interpolation diffusion model (VPIDM) approach requires only 25 iterative steps and obviates the need for a corrector, an essential element in the existing variance-exploding interpolation… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  34. arXiv:2404.18081  [pdf, other

    cs.SD cs.AI cs.CL cs.LG cs.MM eess.AS

    ComposerX: Multi-Agent Symbolic Music Composition with LLMs

    Authors: Qixin Deng, Qikai Yang, Ruibin Yuan, Yipeng Huang, Yi Wang, Xubo Liu, Zeyue Tian, Jiahao Pan, Ge Zhang, Hanfeng Lin, Yizhi Li, Yinghao Ma, Jie Fu, Chenghua Lin, Emmanouil Benetos, Wenwu Wang, Guangyu Xia, Wei Xue, Yike Guo

    Abstract: Music composition represents the creative side of humanity, and itself is a complex task that requires abilities to understand and generate information with long dependency and harmony constraints. While demonstrating impressive capabilities in STEM subjects, current LLMs easily fail in this task, generating ill-written music even when equipped with modern techniques like In-Context-Learning and C… ▽ More

    Submitted 30 April, 2024; v1 submitted 28 April, 2024; originally announced April 2024.

  35. arXiv:2404.17621  [pdf, other

    eess.IV cs.CV cs.LG

    Attention-aware non-rigid image registration for accelerated MR imaging

    Authors: Aya Ghoul, Jiazhen Pan, Andreas Lingg, Jens Kübler, Patrick Krumm, Kerstin Hammernik, Daniel Rueckert, Sergios Gatidis, Thomas Küstner

    Abstract: Accurate motion estimation at high acceleration factors enables rapid motion-compensated reconstruction in Magnetic Resonance Imaging (MRI) without compromising the diagnostic image quality. In this work, we introduce an attention-aware deep learning-based framework that can perform non-rigid pairwise registration for fully sampled and accelerated MRI. We extract local visual representations to bu… ▽ More

    Submitted 26 April, 2024; originally announced April 2024.

    Comments: 14 pages, 7 figures

  36. arXiv:2404.16484  [pdf, other

    cs.CV eess.IV

    Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey

    Authors: Marcos V. Conde, Zhijun Lei, Wen Li, Cosmin Stejerean, Ioannis Katsavounidis, Radu Timofte, Kihwan Yoon, Ganzorig Gankhuyag, Jiangtao Lv, Long Sun, Jinshan Pan, Jiangxin Dong, Jinhui Tang, Zhiyuan Li, Hao Wei, Chenyang Ge, Dongyang Zhang, Tianle Liu, Huaian Chen, Yi Jin, Menghan Zhou, Yiqiang Yan, Si Gao, Biao Wu, Shaoli Liu , et al. (50 additional authors not shown)

    Abstract: This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF cod… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: CVPR 2024, AI for Streaming (AIS) Workshop

  37. arXiv:2404.14700  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    FlashSpeech: Efficient Zero-Shot Speech Synthesis

    Authors: Zhen Ye, Zeqian Ju, Haohe Liu, Xu Tan, Jianyi Chen, Yiwen Lu, Peiwen Sun, Jiahao Pan, Weizhen Bian, Shulin He, Wei Xue, Qifeng Liu, Yike Guo

    Abstract: Recent progress in large-scale zero-shot speech synthesis has been significantly advanced by language models and diffusion models. However, the generation process of both methods is slow and computationally intensive. Efficient speech synthesis using a lower computing budget to achieve quality on par with previous work remains a significant challenge. In this paper, we present FlashSpeech, a large… ▽ More

    Submitted 24 October, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

    Comments: Efficient zero-shot speech synthesis

  38. arXiv:2404.10343  [pdf, other

    cs.CV eess.IV

    The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report

    Authors: Bin Ren, Yawei Li, Nancy Mehta, Radu Timofte, Hongyuan Yu, Cheng Wan, Yuxin Hong, Bingnan Han, Zhuoyuan Wu, Yajun Zou, Yuqing Liu, Jizhe Li, Keji He, Chao Fan, Heng Zhang, Xiaolin Zhang, Xuanwu Yin, Kunlong Zuo, Bohao Liao, Peizhe Xia, Long Peng, Zhibo Du, Xin Di, Wangkai Li, Yang Wang , et al. (109 additional authors not shown)

    Abstract: This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such… ▽ More

    Submitted 25 June, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

    Comments: The report paper of NTIRE2024 Efficient Super-resolution, accepted by CVPRW2024

  39. arXiv:2404.08857  [pdf, other

    cs.SD cs.AI eess.AS

    Voice Attribute Editing with Text Prompt

    Authors: Zhengyan Sheng, Yang Ai, Li-Juan Liu, Jia Pan, Zhen-Hua Ling

    Abstract: Despite recent advancements in speech generation with text prompt providing control over speech style, voice attributes in synthesized speech remain elusive and challenging to control. This paper introduces a novel task: voice attribute editing with text prompt, with the goal of making relative modifications to voice attributes according to the actions described in the text prompt. To solve this t… ▽ More

    Submitted 30 November, 2024; v1 submitted 12 April, 2024; originally announced April 2024.

  40. arXiv:2404.01611  [pdf

    cs.LG cs.SD eess.AS

    Audio Simulation for Sound Source Localization in Virtual Evironment

    Authors: Yi Di Yuan, Swee Liang Wong, Jonathan Pan

    Abstract: Non-line-of-sight localization in signal-deprived environments is a challenging yet pertinent problem. Acoustic methods in such predominantly indoor scenarios encounter difficulty due to the reverberant nature. In this study, we aim to locate sound sources to specific locations within a virtual environment by leveraging physically grounded sound propagation simulations and machine learning methods… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

    Comments: 2024 IEEE World Forum on Public Safety Technology

  41. arXiv:2404.00656  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    WavLLM: Towards Robust and Adaptive Speech Large Language Model

    Authors: Shujie Hu, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Hongkun Hao, Jing Pan, Xunying Liu, Jinyu Li, Sunit Sivasankaran, Linquan Liu, Furu Wei

    Abstract: The recent advancements in large language models (LLMs) have revolutionized the field of natural language processing, progressively broadening their scope to multimodal perception and generation. However, effectively integrating listening capabilities into LLMs poses significant challenges, particularly with respect to generalizing across varied contexts and executing complex auditory tasks. In th… ▽ More

    Submitted 21 September, 2024; v1 submitted 31 March, 2024; originally announced April 2024.

    Comments: accepted by EMNLP2024 findings

  42. arXiv:2401.12173  [pdf, other

    eess.SP

    Waveform-Domain Complementary Signal Sets for Interrupted Sampling Repeater Jamming Suppression

    Authors: Hanning Su, Qinglong Bao, Jiameng Pan, Fucheng Guo, Weidong Hu

    Abstract: The interrupted-sampling repeater jamming (ISRJ) is coherent and has the characteristic of suppression and deception to degrade the radar detection capabilities. The study focuses on anti-ISRJ techniques in the waveform domain, primarily capitalizing on waveform design and and anti-jamming signal processing methods in the waveform domain. By exploring the relationship between waveform-domain adapt… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

  43. arXiv:2311.15309  [pdf, other

    eess.IV

    Deep Refinement-Based Joint Source Channel Coding over Time-Varying Channels

    Authors: Junyu Pan, Hanlei Li, Guangyi Zhang, Yunlong Cai, Guanding Yu

    Abstract: In recent developments, deep learning (DL)-based joint source-channel coding (JSCC) for wireless image transmission has made significant strides in performance enhancement. Nonetheless, the majority of existing DL-based JSCC methods are tailored for scenarios featuring stable channel conditions, notably a fixed signal-to-noise ratio (SNR). This specialization poses a limitation, as their performan… ▽ More

    Submitted 26 November, 2023; originally announced November 2023.

  44. arXiv:2311.02248  [pdf, other

    cs.CL cs.AI eess.AS

    COSMIC: Data Efficient Instruction-tuning For Speech In-Context Learning

    Authors: Jing Pan, Jian Wu, Yashesh Gaur, Sunit Sivasankaran, Zhuo Chen, Shujie Liu, Jinyu Li

    Abstract: We present a cost-effective method to integrate speech into a large language model (LLM), resulting in a Contextual Speech Model with Instruction-following/in-context-learning Capabilities (COSMIC) multi-modal LLM. Using GPT-3.5, we generate Speech Comprehension Test Question-Answer (SQA) pairs from speech transcriptions for supervised instruction tuning. With under 30 million trainable parameters… ▽ More

    Submitted 14 June, 2024; v1 submitted 3 November, 2023; originally announced November 2023.

  45. arXiv:2309.16206  [pdf, other

    eess.IV cs.CV

    Alzheimer's Disease Prediction via Brain Structural-Functional Deep Fusing Network

    Authors: Qiankun Zuo, Junren Pan, Shuqiang Wang

    Abstract: Fusing structural-functional images of the brain has shown great potential to analyze the deterioration of Alzheimer's disease (AD). However, it is a big challenge to effectively fuse the correlated and complementary information from multimodal neuroimages. In this paper, a novel model termed cross-modal transformer generative adversarial network (CT-GAN) is proposed to effectively fuse the functi… ▽ More

    Submitted 5 October, 2023; v1 submitted 28 September, 2023; originally announced September 2023.

    Comments: 10 pages

  46. arXiv:2309.14087  [pdf, other

    eess.SP

    Adaptive Three Layer Hybrid Reconfigurable Intelligent Surface for 6G Wireless Communication: Trade-offs and Performance

    Authors: Rashed Hasan Ratul, Muhammad Iqbal, Tabinda Ashraf, Jen-Yi Pan, Yi-Han Wang, Shao-Yu Lien

    Abstract: A potential candidate technology for the development of future 6G networks has been recognized as Reconfigurable Intelligent Surface (RIS). However, due to the variation in radio link quality, traditional passive RISs only accomplish a minimal signal gain in situations with strong direct links between user equipment (UE) and base station (BS). In order to get over this fundamental restriction of s… ▽ More

    Submitted 25 September, 2023; originally announced September 2023.

    Comments: Accepted for presentation and publication at the 8th IEEE Asia Pacific Conference on Wireless and Mobile (APWiMob) Conference

  47. arXiv:2309.10832  [pdf, ps, other

    cs.SD eess.AS

    Efficient Multi-Channel Speech Enhancement with Spherical Harmonics Injection for Directional Encoding

    Authors: Jiahui Pan, Pengjie Shen, Hui Zhang, Xueliang Zhang

    Abstract: Multi-channel speech enhancement extracts speech using multiple microphones that capture spatial cues. Effectively utilizing directional information is key for multi-channel enhancement. Deep learning shows great potential on multi-channel speech enhancement and often takes short-time Fourier Transform (STFT) as inputs directly. To fully leverage the spatial information, we introduce a method usin… ▽ More

    Submitted 19 September, 2023; originally announced September 2023.

    Comments: arXiv admin note: text overlap with arXiv:2309.10393

  48. arXiv:2309.10393  [pdf, ps, other

    cs.SD eess.AS

    Hierarchical Modeling of Spatial Cues via Spherical Harmonics for Multi-Channel Speech Enhancement

    Authors: Jiahui Pan, Shulin He, Hui Zhang, Xueliang Zhang

    Abstract: Multi-channel speech enhancement utilizes spatial information from multiple microphones to extract the target speech. However, most existing methods do not explicitly model spatial cues, instead relying on implicit learning from multi-channel spectra. To better leverage spatial information, we propose explicitly incorporating spatial modeling by applying spherical harmonic transforms (SHT) to the… ▽ More

    Submitted 19 September, 2023; originally announced September 2023.

  49. arXiv:2309.10379  [pdf, ps, other

    cs.SD eess.AS

    PDPCRN: Parallel Dual-Path CRN with Bi-directional Inter-Branch Interactions for Multi-Channel Speech Enhancement

    Authors: Jiahui Pan, Shulin He, Tianci Wu, Hui Zhang, Xueliang Zhang

    Abstract: Multi-channel speech enhancement seeks to utilize spatial information to distinguish target speech from interfering signals. While deep learning approaches like the dual-path convolutional recurrent network (DPCRN) have made strides, challenges persist in effectively modeling inter-channel correlations and amalgamating multi-level information. In response, we introduce the Parallel Dual-Path Convo… ▽ More

    Submitted 19 September, 2023; originally announced September 2023.

  50. arXiv:2309.08643  [pdf, other

    eess.IV

    NISF: Neural Implicit Segmentation Functions

    Authors: Nil Stolt-Ansó, Julian McGinnis, Jiazhen Pan, Kerstin Hammernik, Daniel Rueckert

    Abstract: Segmentation of anatomical shapes from medical images has taken an important role in the automation of clinical measurements. While typical deep-learning segmentation approaches are performed on discrete voxels, the underlying objects being analysed exist in a real-valued continuous space. Approaches that rely on convolutional neural networks (CNNs) are limited to grid-like inputs and not easily a… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.