Skip to main content

Showing 1–20 of 20 results for author: Shen, F

Searching in archive eess. Search in all archives.
.
  1. arXiv:2504.12796  [pdf, other

    cs.MM cs.SD eess.AS

    A Survey on Cross-Modal Interaction Between Music and Multimodal Data

    Authors: Sifei Li, Mining Tan, Feier Shen, Minyan Luo, Zijiao Yin, Fan Tang, Weiming Dong, Changsheng Xu

    Abstract: Multimodal learning has driven innovation across various industries, particularly in the field of music. By enabling more intuitive interaction experiences and enhancing immersion, it not only lowers the entry barriers to the music but also increases its overall appeal. This survey aims to provide a comprehensive review of multimodal tasks related to music, outlining how music contributes to multi… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

    Comments: 34 pages, 7 figures

  2. arXiv:2503.20499  [pdf, other

    cs.SD eess.AS

    FireRedTTS-1S: An Upgraded Streamable Foundation Text-to-Speech System

    Authors: Hao-Han Guo, Yao Hu, Fei-Yu Shen, Xu Tang, Yi-Chen Wu, Feng-Long Xie, Kun Xie

    Abstract: In this work, we upgrade FireRedTTS to a new version, FireRedTTS-1S, a high-quality streaming foundation text-to-speech system. FireRedTTS-1S achieves streaming speech generation via two steps: text-to-semantic decoding and semantic-to-acoustic decoding. In text-to-semantic decoding, a semantic-aware speech tokenizer converts the speech signal into semantic tokens, which can be synthesized from th… ▽ More

    Submitted 26 May, 2025; v1 submitted 26 March, 2025; originally announced March 2025.

  3. arXiv:2502.19906  [pdf, other

    eess.AS cs.SD

    PrimeK-Net: Multi-scale Spectral Learning via Group Prime-Kernel Convolutional Neural Networks for Single Channel Speech Enhancement

    Authors: Zizhen Lin, Junyu Wang, Ruili Li, Fei Shen, Xi Xuan

    Abstract: Single-channel speech enhancement is a challenging ill-posed problem focused on estimating clean speech from degraded signals. Existing studies have demonstrated the competitive performance of combining convolutional neural networks (CNNs) with Transformers in speech enhancement tasks. However, existing frameworks have not sufficiently addressed computational efficiency and have overlooked the nat… ▽ More

    Submitted 27 February, 2025; originally announced February 2025.

    Comments: This paper was accepeted by ICASSP 2025

  4. arXiv:2502.17213  [pdf, other

    q-bio.NC cs.AI cs.LG eess.SP

    Deep Learning-Powered Electrical Brain Signals Analysis: Advancing Neurological Diagnostics

    Authors: Jiahe Li, Xin Chen, Fanqi Shen, Junru Chen, Yuxin Liu, Daoze Zhang, Zhizhang Yuan, Fang Zhao, Meng Li, Yang Yang

    Abstract: Neurological disorders represent significant global health challenges, driving the advancement of brain signal analysis methods. Scalp electroencephalography (EEG) and intracranial electroencephalography (iEEG) are widely used to diagnose and monitor neurological conditions. However, dataset heterogeneity and task variations pose challenges in developing robust deep learning solutions. This review… ▽ More

    Submitted 24 February, 2025; originally announced February 2025.

  5. arXiv:2502.11946  [pdf, other

    cs.CL cs.AI cs.HC cs.SD eess.AS

    Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

    Authors: Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, Peng Liu, Ruihang Miao, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Gong, Zixin Zhang, Hongyu Zhou, Jianjian Sun, Brian Li, Chengting Feng, Changyi Wan, Hanpeng Hu , et al. (120 additional authors not shown)

    Abstract: Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contribu… ▽ More

    Submitted 18 February, 2025; v1 submitted 17 February, 2025; originally announced February 2025.

  6. arXiv:2409.03283  [pdf, other

    cs.SD eess.AS

    FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications

    Authors: Hao-Han Guo, Yao Hu, Kun Liu, Fei-Yu Shen, Xu Tang, Yi-Chen Wu, Feng-Long Xie, Kun Xie, Kai-Tuo Xu

    Abstract: This work proposes FireRedTTS, a foundation text-to-speech framework, to meet the growing demands for personalized and diverse generative speech applications. The framework comprises three parts: data processing, foundation system, and downstream applications. First, we comprehensively present our data processing pipeline, which transforms massive raw audio into a large-scale high-quality TTS data… ▽ More

    Submitted 11 April, 2025; v1 submitted 5 September, 2024; originally announced September 2024.

  7. arXiv:2408.07264  [pdf

    eess.IV cs.CV

    Lesion-aware network for diabetic retinopathy diagnosis

    Authors: Xue Xia, Kun Zhan, Yuming Fang, Wenhui Jiang, Fei Shen

    Abstract: Deep learning brought boosts to auto diabetic retinopathy (DR) diagnosis, thus, greatly helping ophthalmologists for early disease detection, which contributes to preventing disease deterioration that may eventually lead to blindness. It has been proved that convolutional neural network (CNN)-aided lesion identifying or segmentation benefits auto DR screening. The key to fine-grained lesion tasks… ▽ More

    Submitted 13 August, 2024; originally announced August 2024.

    Comments: This is submitted version wihout improvements by reviewers. The final version is published on International Journal of Imaging Systems and Techonology (https://onlinelibrary.wiley.com/doi/10.1002/ima.22933)

  8. arXiv:2407.03892  [pdf, other

    cs.SD cs.AI eess.AS

    On the Effectiveness of Acoustic BPE in Decoder-Only TTS

    Authors: Bohan Li, Feiyu Shen, Yiwei Guo, Shuai Wang, Xie Chen, Kai Yu

    Abstract: Discretizing speech into tokens and generating them by a decoder-only model have been a promising direction for text-to-speech (TTS) and spoken language modeling (SLM). To shorten the sequence length of speech tokens, acoustic byte-pair encoding (BPE) has emerged in SLM that treats speech tokens from self-supervised semantic representations as characters to further compress the token sequence. But… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

    Comments: 5 pages, 3 tables, 1 figures. accepted to Interspeech 2024

    Journal ref: https://www.isca-archive.org/interspeech_2024/li24qa_interspeech.pdf

  9. Low-Complexity Estimation Algorithm and Decoupling Scheme for FRaC System

    Authors: Mengjiang Sun, Peng Chen, Zhenxin Cao, Fei Shen

    Abstract: With the leaping advances in autonomous vehicles and transportation infrastructure, dual function radar-communication (DFRC) systems have become attractive due to the size, cost and resource efficiency. A frequency modulated continuous waveform (FMCW)-based radar-communication system (FRaC) utilizing both sparse multiple-input and multiple-output (MIMO) arrays and index modulation (IM) has been pr… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

    Journal ref: {IEEE Transactions on Intelligent Vehicles, 2024

  10. arXiv:2402.10251  [pdf, other

    q-bio.NC cs.AI cs.LG eess.SP

    BrainWave: A Brain Signal Foundation Model for Clinical Applications

    Authors: Zhizhang Yuan, Fanqi Shen, Meng Li, Yuguo Yu, Chenhao Tan, Yang Yang

    Abstract: Neural electrical activity is fundamental to brain function, underlying a range of cognitive and behavioral processes, including movement, perception, decision-making, and consciousness. Abnormal patterns of neural signaling often indicate the presence of underlying brain diseases. The variability among individuals, the diverse array of clinical symptoms from various brain disorders, and the limit… ▽ More

    Submitted 19 September, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

    Comments: 39 pages, 14 figures

  11. arXiv:2311.15583  [pdf, other

    cs.LG eess.SP

    A Simple Geometric-Aware Indoor Positioning Interpolation Algorithm Based on Manifold Learning

    Authors: Suorong Yang, Geng Zhang, Jian Zhao, Furao Shen

    Abstract: Interpolation methodologies have been widely used within the domain of indoor positioning systems. However, existing indoor positioning interpolation algorithms exhibit several inherent limitations, including reliance on complex mathematical models, limited flexibility, and relatively low precision. To enhance the accuracy and efficiency of indoor positioning interpolation techniques, this paper p… ▽ More

    Submitted 27 November, 2023; originally announced November 2023.

  12. arXiv:2310.14580  [pdf, other

    cs.SD eess.AS

    Acoustic BPE for Speech Generation with Discrete Tokens

    Authors: Feiyu Shen, Yiwei Guo, Chenpeng Du, Xie Chen, Kai Yu

    Abstract: Discrete audio tokens derived from self-supervised learning models have gained widespread usage in speech generation. However, current practice of directly utilizing audio tokens poses challenges for sequence modeling due to the length of the token sequence. Additionally, this approach places the burden on the model to establish correlations between tokens, further complicating the modeling proces… ▽ More

    Submitted 15 January, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

    Comments: 5 pages, 2 figures; accepted to ICASSP 2024

  13. arXiv:2309.07377  [pdf, other

    eess.AS cs.SD

    Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS

    Authors: Yifan Yang, Feiyu Shen, Chenpeng Du, Ziyang Ma, Kai Yu, Daniel Povey, Xie Chen

    Abstract: Self-supervised learning (SSL) proficiency in speech-related tasks has driven research into utilizing discrete tokens for speech tasks like recognition and translation, which offer lower storage requirements and great potential to employ natural language processing techniques. However, these studies, mainly single-task focused, faced challenges like overfitting and performance degradation in speec… ▽ More

    Submitted 14 December, 2023; v1 submitted 13 September, 2023; originally announced September 2023.

    Comments: Accepted in ICASSP 2024

  14. arXiv:2309.04182  [pdf, other

    cs.SD cs.IR eess.AS

    A Long-Tail Friendly Representation Framework for Artist and Music Similarity

    Authors: Haoran Xiang, Junyu Dai, Xuchen Song, Furao Shen

    Abstract: The investigation of the similarity between artists and music is crucial in music retrieval and recommendation, and addressing the challenge of the long-tail phenomenon is increasingly important. This paper proposes a Long-Tail Friendly Representation Framework (LTFRF) that utilizes neural networks to model the similarity relationship. Our approach integrates music, user, metadata, and relationshi… ▽ More

    Submitted 8 September, 2023; originally announced September 2023.

  15. UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding

    Authors: Chenpeng Du, Yiwei Guo, Feiyu Shen, Zhijun Liu, Zheng Liang, Xie Chen, Shuai Wang, Hui Zhang, Kai Yu

    Abstract: The utilization of discrete speech tokens, divided into semantic tokens and acoustic tokens, has been proven superior to traditional acoustic feature mel-spectrograms in terms of naturalness and robustness for text-to-speech (TTS) synthesis. Recent popular models, such as VALL-E and SPEAR-TTS, allow zero-shot speaker adaptation through auto-regressive (AR) continuation of acoustic tokens extracted… ▽ More

    Submitted 28 March, 2024; v1 submitted 13 June, 2023; originally announced June 2023.

    Comments: Accepted to AAAI 2024

  16. Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge

    Authors: Chenpeng Du, Yiwei Guo, Feiyu Shen, Kai Yu

    Abstract: In this paper, we describe the systems developed by the SJTU X-LANCE team for LIMMITS 2023 Challenge, and we mainly focus on the winning system on naturalness for track 1. The aim of this challenge is to build a multi-speaker multi-lingual text-to-speech (TTS) system for Marathi, Hindi and Telugu. Each of the languages has a male and a female speaker in the given dataset. In track 1, only 5 hours… ▽ More

    Submitted 8 November, 2024; v1 submitted 25 April, 2023; originally announced April 2023.

    Comments: Accepted to ICASSP 2023 Special Session for Grand Challenges

  17. arXiv:2101.07429  [pdf, other

    eess.IV cs.CV

    Learning Efficient, Explainable and Discriminative Representations for Pulmonary Nodules Classification

    Authors: Hanliang Jiang, Fuhao Shen, Fei Gao, Weidong Han

    Abstract: Automatic pulmonary nodules classification is significant for early diagnosis of lung cancers. Recently, deep learning techniques have enabled remarkable progress in this field. However, these deep models are typically of high computational complexity and work in a black-box manner. To combat these challenges, in this work, we aim to build an efficient and (partially) explainable classification mo… ▽ More

    Submitted 18 January, 2021; originally announced January 2021.

    Journal ref: Pattern Recognition, 2021

  18. arXiv:2008.02486  [pdf, other

    eess.SP

    3D Spectrum Mapping Based on ROI-Driven UAV Deployment

    Authors: Qihui Wu, Feng Shen, Zheng Wang, Guoru Ding

    Abstract: Given the explosive growth of Internet of Things (IoT) devices ranging from the two-dimensional (2D) ground to the three-dimensional (3D) space, it is a necessity to establish a 3D spectrum map to comprehensively present and effectively manage the 3D spatial spectrum resources in smart city infrastructures. By leveraging the popularity and location flexibility of the unmanned aerial vehicles (UAVs… ▽ More

    Submitted 6 August, 2020; originally announced August 2020.

    Comments: 17 pages, 5 figures. Accepted to appear in IEEE Network

  19. A Global Solution Method for Decentralized Multi-Area SCUC and Savings Allocation Based on MILP Value Functions

    Authors: Xiaodong Zheng, Haoyong Chen, Yan Xu, Feifan Shen, Zipeng Liang

    Abstract: To address the issue that Lagrangian dual function based algorithms cannot guarantee convergence and global optimality for decentralized multi-area security constrained unit commitment (M-SCUC) problems, a novel decomposition and coordination method using MILP (mixed integer linear programming) value functions is proposed in this paper. Each regional system operator sets the tie-line power injecti… ▽ More

    Submitted 17 June, 2019; originally announced June 2019.

    Comments: 8 pages, 7 figures

    Journal ref: IET Generation, Transmission & Distribution, vol. 14, no. 16, pp. 3230-3240, 21 8 2020

  20. arXiv:1809.10871  [pdf, other

    eess.SP

    Understanding the Temporal Fading in Wireless Industrial Networks: Measurements and Analyses

    Authors: Qilong Zhang, Qiwei Zhang, Wuxiong Zhang, Fei Shen, Tian Hong Loh, Fei Qin

    Abstract: The wide deployment of wireless industrial networks still faces the challenge of unreliable service due to severe multipath fading in industrial environments. Such fading effects are not only caused by the massive metal surfaces existing within the industrial environment but also, more significantly, the moving objects including operators and logistical vehicles. As a result, the mature analytical… ▽ More

    Submitted 28 September, 2018; originally announced September 2018.