Audio and Speech Processing

Authors and titles for April 2025

Total of 149 entries : 1-100 101-149

Showing up to 100 entries per page: fewer | more | all

[1] arXiv:2504.00621 [pdf, html, other]: Title: How Cyclic Acoustic Patterns Influence ASMR Perception: A Signal Processing Perspective

Zexin Fang, Bin Han, Henrik H. Sveen, C. Clark Cao, Hans D. Schotten

Comments: Submitted to IEEE Transactions on Cognitive and Developmental Systems

Subjects: Audio and Speech Processing (eess.AS)
[2] arXiv:2504.00742 [pdf, other]: Title: Expanding and Analyzing ODAQ -- the Open Dataset of Audio Quality

Sascha Dick, Christoph Thompson, Chih-Wei Wu, Matteo Torcoli, Pablo Delgado, Phillip A. Williams, Emanuel Habets

Comments: Accepted for presentation at the Audio Engineering Society (AES) 157th Convention, October 2024, New York, USA

Subjects: Audio and Speech Processing (eess.AS)
[3] arXiv:2504.01392 [pdf, html, other]: Title: Spatial-Filter-Bank-Based Neural Method for Multichannel Speech Enhancement

Tianqin Zheng, Jilu Jin, Hanchen Pei, Gongping Huang, Jingdong Chen, Jacob Benesty

Subjects: Audio and Speech Processing (eess.AS)
[4] arXiv:2504.01767 [pdf, html, other]: Title: Leveraging Embedding Techniques in Multimodal Machine Learning for Mental Illness Assessment

Abdelrahaman A. Hassan, Abdelrahman A. Ali, Aya E. Fouda, Radwa J. Hanafy, Mohammed E. Fouda

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
[5] arXiv:2504.03329 [pdf, html, other]: Title: Mind the Prompt: Prompting Strategies in Audio Generations for Improving Sound Classification

Francesca Ronchini, Ho-Hsiang Wu, Wei-Cheng Lin, Fabio Antonacci

Comments: Accepted at Generative Data Augmentation for Real-World Signal Processing Applications Workshop

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD); Signal Processing (eess.SP)
[6] arXiv:2504.04075 [pdf, html, other]: Title: Real-Time Auralization for First-Person Vocal Interaction in Immersive Virtual Environments

Mauricio Flores-Vargas, Enda Bates, Rachel McDonnell

Subjects: Audio and Speech Processing (eess.AS); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
[7] arXiv:2504.04450 [pdf, html, other]: Title: WaveNet-Volterra Neural Networks for Active Noise Control: A Fully Causal Approach

Lu Bai, Mengtong Li, Siyuan Lian, Kai Chen, Jing Lu

Subjects: Audio and Speech Processing (eess.AS)
[8] arXiv:2504.04512 [pdf, html, other]: Title: Trainable Adaptive Score Normalization for Automatic Speaker Verification

Jeong-Hwan Choi, Ju-Seok Seong, Ye-Rin Jeoung, Joon-Hyuk Chang

Comments: Accepted at ICASSP'25

Subjects: Audio and Speech Processing (eess.AS)
[9] arXiv:2504.04721 [pdf, html, other]: Title: Bridging the Gap between Continuous and Informative Discrete Representations by Random Product Quantization

Xueqing Li, Zehan Li, Boyu Zhu, Ruihao Jing, Jian Kang, Jie Li, Xiao-Lei Zhang, Xuelong Li

Subjects: Audio and Speech Processing (eess.AS)
[10] arXiv:2504.04751 [pdf, html, other]: Title: Unsupervised Estimation of Nonlinear Audio Effects: Comparing Diffusion-Based and Adversarial approaches

Eloi Moliner, Michal Švento, Alec Wright, Lauri Juvela, Pavel Rajmic, Vesa Välimäki

Comments: Submitted to the 28th International Conference on Digital Audio Effects (DAFx25)

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
[11] arXiv:2504.05657 [pdf, html, other]: Title: Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-spoofing

Tianchi Liu, Duc-Tuan Truong, Rohan Kumar Das, Kong Aik Lee, Haizhou Li

Comments: This manuscript has been submitted for peer review

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
[12] arXiv:2504.06963 [pdf, html, other]: Title: RNN-Transducer-based Losses for Speech Recognition on Noisy Targets

Vladimir Bataev

Comments: Final Project Report, Bachelor's Degree in Computer Science, University of London, March 2024

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
[13] arXiv:2504.07652 [pdf, html, other]: Title: Categorical Unsupervised Variational Acoustic Clustering

Luan Vinícius Fiorio, Ivana Nikoloska, Ronald M. Aarts

Subjects: Audio and Speech Processing (eess.AS)
[14] arXiv:2504.08524 [pdf, html, other]: Title: USM-VC: Mitigating Timbre Leakage with Universal Semantic Mapping Residual Block for Voice Conversion

Na Li, Chuke Wang, Yu Gu, Zhifeng Li

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
[15] arXiv:2504.08624 [pdf, html, other]: Title: TorchFX: A modern approach to Audio DSP with PyTorch and GPU acceleration

Matteo Spanio, Antonio Rodà

Comments: Submitted to DAFx 2025

Subjects: Audio and Speech Processing (eess.AS); Performance (cs.PF); Sound (cs.SD); Signal Processing (eess.SP)
[16] arXiv:2504.08644 [pdf, html, other]: Title: Reverberation-based Features for Sound Event Localization and Detection with Distance Estimation

Davide Berghi, Philip J. B. Jackson

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)
[17] arXiv:2504.08997 [pdf, other]: Title: Beyond Global Metrics: A Fairness Analysis for Interpretable Voice Disorder Detection Systems

Mariel Estevez, Cyntia Bonomi, Dayana Ribas, Alfonso Ortega, Luciana Ferrer

Comments: 34 pages, 6 figures, 2 tables

Subjects: Audio and Speech Processing (eess.AS)
[18] arXiv:2504.09081 [pdf, other]: Title: SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning

Prabhat Pandey, Rupak Vignesh Swaminathan, K V Vijay Girish, Arunasish Sen, Jian Xie, Grant P. Strimel, Andreas Schwarz

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
[19] arXiv:2504.09381 [pdf, html, other]: Title: DiTSE: High-Fidelity Generative Speech Enhancement via Latent Diffusion Transformers

Heitor R. Guimarães, Jiaqi Su, Rithesh Kumar, Tiago H. Falk, Zeyu Jin

Comments: Manuscript under review

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[20] arXiv:2504.10352 [pdf, html, other]: Title: Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis

Yifan Yang, Shujie Liu, Jinyu Li, Yuxuan Hu, Haibin Wu, Hui Wang, Jianwei Yu, Lingwei Meng, Haiyang Sun, Yanqing Liu, Yan Lu, Kai Yu, Xie Chen

Comments: Accepted in ACMMM 2025

Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
[21] arXiv:2504.11246 [pdf, html, other]: Title: Respiratory Inhaler Sound Event Classification Using Self-Supervised Learning

Davoud Shariat Panah, Alessandro N Franciosi, Cormac McCarthy, Andrew Hines

Comments: Accepted at the IEEE EMBC 2025 Conference

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[22] arXiv:2504.12423 [pdf, html, other]: Title: Benchmarking Audio Deepfake Detection Robustness in Real-world Communication Scenarios

Haohan Shi, Xiyu Shi, Safak Dogan, Saif Alzubi, Tianjin Huang, Yunxiao Zhang

Comments: Accepted by EUSIPCO 2025

Subjects: Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[23] arXiv:2504.12670 [pdf, html, other]: Title: Temporal Attention Pooling for Frequency Dynamic Convolution in Sound Event Detection

Hyeonuk Nam, Yong-Hwa Park

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[24] arXiv:2504.12867 [pdf, html, other]: Title: EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting

Guanrou Yang, Chen Yang, Qian Chen, Ziyang Ma, Wenxi Chen, Wen Wang, Tianrui Wang, Yifan Yang, Zhikang Niu, Wenrui Liu, Fan Yu, Zhihao Du, Zhifu Gao, ShiLiang Zhang, Xie Chen

Comments: Accepted at ACMMM 2025

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
[25] arXiv:2504.12870 [pdf, html, other]: Title: CST-former: Multidimensional Attention-based Transformer for Sound Event Localization and Detection in Real Scenes

Yusun Shul, Dayun Choi, Jung-Woo Choi

Comments: 12 pages, 10 figures, Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

Subjects: Audio and Speech Processing (eess.AS)
[26] arXiv:2504.13765 [pdf, other]: Title: Modeling L1 Influence on L2 Pronunciation: An MFCC-Based Framework for Explainable Machine Learning and Pedagogical Feedback

Peyman Jahanbin

Comments: 27 pages (including references), 4 figures, 1 table. Combines statistical inference and explainable machine learning to model L1 influence in L2 pronunciation using MFCC features. Methodology and code are openly available via Zenodo and OSF: Zenodo: this https URL OSF: this https URL

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[27] arXiv:2504.14183 [pdf, html, other]: Title: The First VoicePrivacy Attacker Challenge

Natalia Tomashenko, Xiaoxiao Miao, Emmanuel Vincent, Junichi Yamagishi

Comments: Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Journal-ref: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025, pp. 1-2

Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
[28] arXiv:2504.14409 [pdf, html, other]: Title: Data Augmentation Using Neural Acoustic Fields With Retrieval-Augmented Pre-training

Christopher Ick, Gordon Wichern, Yoshiki Masuyama, François G. Germain, Jonathan Le Roux

Comments: Presented at ICASSP 2025 GenDA Workshop

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)
[29] arXiv:2504.14437 [pdf, html, other]: Title: Predicting speech intelligibility in older adults for speech enhancement using the Gammachirp Envelope Similarity Index, GESI

Ayako Yamamoto, Fuki Miyazaki, Toshio Irino

Comments: This is a revised manuscript that was submitted to Speech Communication on August 15, 2025

Subjects: Audio and Speech Processing (eess.AS)
[30] arXiv:2504.14817 [pdf, other]: Title: DNN based HRIRs Identification with a Continuously Rotating Speaker Array

Byeong-Yun Ko, Deokki Min, Hyeonuk Nam, Yong-Hwa Park

Subjects: Audio and Speech Processing (eess.AS)
[31] arXiv:2504.14843 [pdf, html, other]: Title: Quantitative Measures for Passive Sonar Texture Analysis

Jarin Ritu, Alexandra Van Dine, Joshua Peeples

Comments: 5 pages, 2 figures

Subjects: Audio and Speech Processing (eess.AS)
[32] arXiv:2504.14906 [pdf, html, other]: Title: OmniAudio: Generating Spatial Audio from 360-Degree Video

Huadai Liu, Tianyi Luo, Kaicheng Luo, Qikai Jiang, Peiwen Sun, Jialei Wang, Rongjie Huang, Qian Chen, Wen Wang, Xiangtai Li, Shiliang Zhang, Zhijie Yan, Zhou Zhao, Wei Xue

Comments: ICML 2025

Subjects: Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
[33] arXiv:2504.14915 [pdf, html, other]: Title: StableQuant: Layer Adaptive Post-Training Quantization for Speech Foundation Models

Yeona Hong, Hyewon Han, Woo-jin Chung, Hong-Goo Kang

Comments: Accepted at ICASSP 2025

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
[34] arXiv:2504.14981 [pdf, html, other]: Title: On feature representations for marmoset vocal communication analysis

Eklavya Sarkar, Kaja Wierucka, Alexandra B. Bosshard, Judith Burkart, Mathew Magimai.-Doss

Journal-ref: Bioacoustics Journal (2025) 1-15

Subjects: Audio and Speech Processing (eess.AS)
[35] arXiv:2504.15575 [pdf, html, other]: Title: Exploring the User Experience of AI-Assisted Sound Searching Systems for Creative Workflows

Haohe Liu, Thomas Deacon, Wenwu Wang, Matt Paradis, Mark D. Plumbley

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[36] arXiv:2504.15663 [pdf, html, other]: Title: FADEL: Uncertainty-aware Fake Audio Detection with Evidential Deep Learning

Ju Yeon Kang, Ji Won Yoon, Semin Kim, Min Hyun Han, Nam Soo Kim

Comments: Accepted at ICASSP 2025

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
[37] arXiv:2504.16223 [pdf, html, other]: Title: Perceptual Audio Coding: A 40-Year Historical Perspective

Jürgen Herre, Schuyler Quackenbush, Minje Kim, Jan Skoglund

Journal-ref: Published in the Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025

Subjects: Audio and Speech Processing (eess.AS)
[38] arXiv:2504.16289 [pdf, html, other]: Title: Deep, data-driven modeling of room acoustics: literature review and research perspectives

Toon van Waterschoot

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[39] arXiv:2504.16441 [pdf, html, other]: Title: SoCov: Semi-Orthogonal Parametric Pooling of Covariance Matrix for Speaker Recognition

Rongjin Li, Weibin Zhang, Dongpeng Chen, Jintao Kang, Xiaofen Xing

Comments: This paper has been accepted by IEEE ICASSP2025

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[40] arXiv:2504.17440 [pdf, html, other]: Title: Generating Localized Audible Zones Using a Single-Channel Parametric Loudspeaker

Tao Zhuang, Shaozhe Li, Feng Niu, Jia-Xin Zhong, Jing Lu

Subjects: Audio and Speech Processing (eess.AS)
[41] arXiv:2504.18004 [pdf, html, other]: Title: Assessing the Utility of Audio Foundation Models for Heart and Respiratory Sound Analysis

Daisuke Niizumi, Daiki Takeuchi, Masahiro Yasuda, Binh Thien Nguyen, Yasunori Ohishi, Noboru Harada

Comments: 4 pages, 1 figure, and 4 tables. Accepted by IEEE EMBC 2025

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[42] arXiv:2504.18157 [pdf, html, other]: Title: DOSE : Drum One-Shot Extraction from Music Mixture

Suntae Hwang, Seonghyeon Kang, Kyungsu Kim, Semin Ahn, Kyogu Lee

Comments: Published in IEEE ICASSP 2025

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[43] arXiv:2504.18425 [pdf, html, other]: Title: Kimi-Audio Technical Report

KimiTeam, Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, Zhengtao Wang, Chu Wei, Yifei Xin, Xinran Xu, Jianwei Yu, Yutao Zhang, Xinyu Zhou, Y. Charles, Jun Chen, Yanru Chen, Yulun Du, Weiran He, Zhenxing Hu, Guokun Lai, Qingcheng Li, Yangyang Liu, Weidong Sun, Jianzhou Wang, Yuzhi Wang, Yuefeng Wu, Yuxin Wu, Dongchao Yang, Hao Yang, Ying Yang, Zhilin Yang, Aoxiong Yin, Ruibin Yuan, Yutong Zhang, Zaida Zhou

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)
[44] arXiv:2504.18502 [pdf, other]: Title: Music Tempo Estimation on Solo Instrumental Performance

Zhanhong He, Roberto Togneri, Xiangyu Zhang

Comments: 4 pages, rejected paper by WASPAA2023

Subjects: Audio and Speech Processing (eess.AS); Information Retrieval (cs.IR)
[45] arXiv:2504.18539 [pdf, html, other]: Title: Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation

Sungnyun Kim, Sungwoo Cho, Sangmin Bae, Kangwook Jang, Se-Young Yun

Comments: ICLR 2025; 22 pages, 6 figures, 14 tables

Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)
[46] arXiv:2504.19046 [pdf, html, other]: Title: Enhancing Cochlear Implant Signal Coding with Scaled Dot-Product Attention

Billel Essaid, Hamza Kheddar, Noureddine Batel

Journal-ref: 2024 International Conference on Telecommunications and Intelligent Systems (ICTIS)

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
[47] arXiv:2504.19062 [pdf, html, other]: Title: Versatile Framework for Song Generation with Prompt-based Control

Yu Zhang, Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Ruiqi Li, Jingyu Lu, Rongjie Huang, Ruiyuan Zhang, Zhiqing Hong, Ziyue Jiang, Zhou Zhao

Comments: Accepted by Findings of EMNLP 2025

Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[48] arXiv:2504.19605 [pdf, html, other]: Title: A Comparative Study on Positional Encoding for Time-frequency Domain Dual-path Transformer-based Source Separation Models

Kohei Saijo, Tetsuji Ogawa

Comments: 5 pages, 3 tables, 2 figures. Accepted to EUSIPCO2025

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[49] arXiv:2504.20334 [pdf, html, other]: Title: Towards Flow-Matching-based TTS without Classifier-Free Guidance

Yuzhe Liang, Wenzhe Liu, Chunyu Qiang, Zhikang Niu, Yushen Chen, Ziyang Ma, Wenxi Chen, Nan Li, Chen Zhang, Xie Chen

Subjects: Audio and Speech Processing (eess.AS)
[50] arXiv:2504.20630 [pdf, html, other]: Title: ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting

Yu Zhang, Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Tao Jin, Zhou Zhao

Comments: Accepted by ACM Multimedia 2025

Subjects: Audio and Speech Processing (eess.AS); Multimedia (cs.MM); Sound (cs.SD)
[51] arXiv:2504.21528 [pdf, html, other]: Title: Impairments are Clustered in Latents of Deep Neural Network-based Speech Quality Models

Fredrik Cumlin, Xinyu Liang, Victor Ungureanu, Chandan K. A. Reddy, Christian Schüldt, Saikat Chatterjee

Subjects: Audio and Speech Processing (eess.AS)
[52] arXiv:2504.21815 [pdf, html, other]: Title: From Aesthetics to Human Preferences: Comparative Perspectives of Evaluating Text-to-Music Systems

Huan Zhang, Jinhua Liang, Huy Phan, Wenwu Wang, Emmanouil Benetos

Subjects: Audio and Speech Processing (eess.AS)
[53] arXiv:2504.00750 (cross-list from cs.SD) [pdf, html, other]: Title: $C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction

Wenxuan Wu, Xueyuan Chen, Shuai Wang, Jiadong Wang, Lingwei Meng, Xixin Wu, Helen Meng, Haizhou Li

Comments: Accepted by IEEE Journal of Selected Topics in Signal Processing (JSTSP)

Subjects: Sound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[54] arXiv:2504.01094 (cross-list from cs.SD) [pdf, html, other]: Title: Multilingual and Multi-Accent Jailbreaking of Audio LLMs

Jaechul Roh, Virat Shejwalkar, Amir Houmansadr

Comments: 21 pages, 6 figures, 15 tables

Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Audio and Speech Processing (eess.AS)
[55] arXiv:2504.01297 (cross-list from cs.RO) [pdf, html, other]: Title: AIM: Acoustic Inertial Measurement for Indoor Drone Localization and Tracking

Yimiao Sun, Weiguo Wang, Luca Mottola, Ruijin Wang, Yuan He

Comments: arXiv admin note: substantial text overlap with arXiv:2504.00445

Subjects: Robotics (cs.RO); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[56] arXiv:2504.01519 (cross-list from cs.CL) [pdf, html, other]: Title: Chain of Correction for Full-text Speech Recognition with Large Language Models

Zhiyuan Tang, Dong Wang, Zhikai Zhou, Yong Liu, Shen Huang, Shidong Shang

Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[57] arXiv:2504.01690 (cross-list from cs.SD) [pdf, html, other]: Title: Token Pruning in Audio Transformers: Optimizing Performance and Decoding Patch Importance

Taehan Lee, Hyukjun Lee

Comments: Accepted at the 28th European Conference on Artificial Intelligence (ECAI 2025). Source code is available at this https URL

Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[58] arXiv:2504.02061 (cross-list from cs.CV) [pdf, html, other]: Title: Aligned Better, Listen Better for Audio-Visual Large Language Models

Yuxin Guo, Shuailei Ma, Shijie Ma, Xiaoyi Bao, Chen-Wei Xie, Kecheng Zheng, Tingyu Weng, Siyang Sun, Yun Zheng, Wei Zou

Comments: Accepted to ICLR 2025

Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[59] arXiv:2504.02302 (cross-list from cs.SD) [pdf, html, other]: Title: Causal Self-supervised Pretrained Frontend with Predictive Code for Speech Separation

Wupeng Wang, Zexu Pan, Xinke Li, Shuai Wang, Haizhou Li

Comments: arXiv admin note: text overlap with arXiv:2411.03085

Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[60] arXiv:2504.02386 (cross-list from cs.CV) [pdf, html, other]: Title: VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models

Kim Sung-Bin, Jeongsoo Choi, Puyuan Peng, Joon Son Chung, Tae-Hyun Oh, David Harwath

Comments: this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
[61] arXiv:2504.02398 (cross-list from cs.CL) [pdf, html, other]: Title: Scaling Analysis of Interleaved Speech-Text Language Models

Gallil Maimon, Michael Hassid, Amit Roth, Yossi Adi

Comments: Accepted at COLM 2025

Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[62] arXiv:2504.02402 (cross-list from cs.SD) [pdf, html, other]: Title: EvMic: Event-based Non-contact sound recovery from effective spatial-temporal modeling

Hao Yin, Shi Guo, Xu Jia, Xudong XU, Lu Zhang, Si Liu, Dong Wang, Huchuan Lu, Tianfan Xue

Comments: Our project page: this https URL

Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[63] arXiv:2504.02407 (cross-list from cs.SD) [pdf, html, other]: Title: F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization

Xiaohui Sun, Ruitong Xiao, Jianye Mo, Bowen Wu, Qun Yu, Baoxun Wang

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[64] arXiv:2504.02586 (cross-list from cs.SD) [pdf, other]: Title: Deep learning for music generation. Four approaches and their comparative evaluation

Razvan Paroiu, Stefan Trausan-Matu

Journal-ref: U.P.B. Scientific Bulletin, Series C, Vol. 85, Issue 4, 2023

Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[65] arXiv:2504.02604 (cross-list from cs.CL) [pdf, html, other]: Title: LinTO Audio and Textual Datasets to Train and Evaluate Automatic Speech Recognition in Tunisian Arabic Dialect

Hedi Naouara, Jean-Pierre Lorré, Jérôme Louradour

Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[66] arXiv:2504.02988 (cross-list from cs.SD) [pdf, html, other]: Title: Generating Diverse Audio-Visual 360 Soundscapes for Sound Event Localization and Detection

Adrian S. Roman, Aiden Chang, Gerardo Meza, Iran R. Roman

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[67] arXiv:2504.03289 (cross-list from cs.SD) [pdf, html, other]: Title: RWKVTTS: Yet another TTS based on RWKV-7

Lin yueyu, Liu Xiao

Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[68] arXiv:2504.03373 (cross-list from cs.SD) [pdf, html, other]: Title: An Efficient GPU-based Implementation for Noise Robust Sound Source Localization

Zirui Lin, Masayuki Takigahira, Naoya Terakado, Haris Gulzar, Monikka Roslianna Busto, Takeharu Eda, Katsutoshi Itoyama, Kazuhiro Nakadai, Hideharu Amano

Comments: 6 pages, 2 figures

Subjects: Sound (cs.SD); Robotics (cs.RO); Audio and Speech Processing (eess.AS)
[69] arXiv:2504.03546 (cross-list from cs.CL) [pdf, html, other]: Title: MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation

Khai Le-Duc, Tuyen Tran, Bach Phan Tat, Nguyen Kim Hai Bui, Quan Dang, Hung-Phong Tran, Thanh-Thuy Nguyen, Ly Nguyen, Tuan-Minh Phan, Thi Thu Phuong Tran, Chris Ngo, Nguyen X. Khanh, Thanh Nguyen-Tang

Comments: Preprint, 122 pages

Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[70] arXiv:2504.03679 (cross-list from eess.SP) [pdf, other]: Title: Continuous Boostlet Transform and Associated Uncertainty Principles

Owais Ahmad, Jasifa Fayaz

Comments: 28pages,6 figures

Subjects: Signal Processing (eess.SP); Sound (cs.SD); Audio and Speech Processing (eess.AS); Functional Analysis (math.FA)
[71] arXiv:2504.03998 (cross-list from cs.SD) [pdf, html, other]: Title: Determined blind source separation via modeling adjacent frequency band correlations in speech signals

Jianyu Wang, Shanzheng Guan, Zhengqiao Zhao, Nicolas Dobigeon, Jingdong Chen

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[72] arXiv:2504.04060 (cross-list from cs.CL) [pdf, html, other]: Title: VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation

Yuhao Wang, Heyang Liu, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang

Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[73] arXiv:2504.04589 (cross-list from cs.SD) [pdf, html, other]: Title: Solid State Bus-Comp: A Large-Scale and Diverse Dataset for Dynamic Range Compressor Virtual Analog Modeling

Yicheng Gu, Runsong Zhang, Lauri Juvela, Zhizheng Wu

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[74] arXiv:2504.05009 (cross-list from cs.SD) [pdf, html, other]: Title: Deconstructing Jazz Piano Style Using Machine Learning

Huw Cheston, Reuben Bance, Peter M. C. Harrison

Comments: Paper: 40 pages, 11 figures, 1 table; added information on training time + computation cost, corrections to Table 1. Supplementary material: 33 pages, 48 figures, 6 tables; corrections to Table S.5

Subjects: Sound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[75] arXiv:2504.05158 (cross-list from cs.SD) [pdf, html, other]: Title: Leveraging Label Potential for Enhanced Multimodal Emotion Recognition

Xuechun Shao, Yinfeng Yu, Liejun Wang

Comments: Main paper (8 pages). Accepted for publication by IJCNN 2025

Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[76] arXiv:2504.05368 (cross-list from cs.SD) [pdf, html, other]: Title: Exploring Local Interpretable Model-Agnostic Explanations for Speech Emotion Recognition with Distribution-Shift

Maja J. Hjuler, Line H. Clemmensen, Sneha Das

Comments: Published in the proceedings of ICASSP 2025

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[77] arXiv:2504.05686 (cross-list from cs.SD) [pdf, html, other]: Title: kNN-SVC: Robust Zero-Shot Singing Voice Conversion with Additive Synthesis and Concatenation Smoothness Optimization

Keren Shao, Ke Chen, Matthew Baas, Shlomo Dubnov

Comments: 5 pages, 6 figures, 1 table, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025

Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[78] arXiv:2504.05690 (cross-list from cs.SD) [pdf, html, other]: Title: STAGE: Stemmed Accompaniment Generation through Prefix-Based Conditioning

Giorgio Strano, Chiara Ballanti, Donato Crisostomi, Michele Mancusi, Luca Cosmo, Emanuele Rodolà

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[79] arXiv:2504.05802 (cross-list from cs.SD) [pdf, html, other]: Title: Mass-Spring Models for Passive Keyword Spotting: A Springtronics Approach

Finn Bohte, Theophile Louvet, Vincent Maillou, Marc Serra Garcia

Comments: 14 pages, 8 figures

Subjects: Sound (cs.SD); Disordered Systems and Neural Networks (cond-mat.dis-nn); Audio and Speech Processing (eess.AS)
[80] arXiv:2504.05847 (cross-list from cs.SD) [pdf, html, other]: Title: Réduire le bruit grâce à la réalité augmentée sonore -- Auditory Concealer

Clara Boukhemia

Comments: 57 pages, in French language, 24 figures

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[81] arXiv:2504.06165 (cross-list from cs.SD) [pdf, other]: Title: Real-Time Pitch/F0 Detection Using Spectrogram Images and Convolutional Neural Networks

Xufang Zhao, Omer Tsimhoni

Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[82] arXiv:2504.06275 (cross-list from cs.IR) [pdf, html, other]: Title: A Cascaded Architecture for Extractive Summarization of Multimedia Content via Audio-to-Text Alignment

Tanzir Hossain, Ar-Rafi Islam, Md. Sabbir Hossain, Annajiat Alim Rasel

Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[83] arXiv:2504.06778 (cross-list from cs.SD) [pdf, html, other]: Title: CAFA: a Controllable Automatic Foley Artist

Roi Benita, Michael Finkelson, Tavi Halperin, Gleb Sterkin, Yossi Adi

Comments: Renamed paper to "CAFA: a Controllable Automatic Foley Artist" from "Controllable Automatic Foley Artist". Updated link to demo page

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[84] arXiv:2504.07053 (cross-list from cs.CL) [pdf, html, other]: Title: TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

Liang-Hsuan Tseng, Yi-Chang Chen, Kuan-Yi Lee, Da-Shan Shiu, Hung-yi Lee

Comments: Preprint

Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[85] arXiv:2504.07229 (cross-list from cs.CL) [pdf, html, other]: Title: Visual-Aware Speech Recognition for Noisy Scenarios

Lakshmipathi Balaji, Karan Singla

Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[86] arXiv:2504.07345 (cross-list from cs.SD) [pdf, html, other]: Title: Quantum-Inspired Genetic Algorithm for Robust Source Separation in Smart City Acoustics

Minh K. Quan, Mayuri Wijayasundara, Sujeeva Setunge, Pubudu N. Pathirana

Comments: 6 pages, 2 figures, IEEE International Conference on Communications (ICC 2025)

Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[87] arXiv:2504.07406 (cross-list from cs.SD) [pdf, html, other]: Title: Towards Generalizability to Tone and Content Variations in the Transcription of Amplifier Rendered Electric Guitar Audio

Yu-Hua Chen, Yuan-Chiao Cheng, Yen-Tung Yeh, Jui-Te Wu, Jyh-Shing Roger Jang, Yi-Hsuan Yang

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[88] arXiv:2504.08024 (cross-list from cs.CL) [pdf, other]: Title: Summarizing Speech: A Comprehensive Survey

Fabian Retkowski, Maike Züfle, Andreas Sudmann, Dinah Pfau, Shinji Watanabe, Jan Niehues, Alexander Waibel

Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[89] arXiv:2504.08274 (cross-list from cs.SD) [pdf, html, other]: Title: Generalized Multilingual Text-to-Speech Generation with Language-Aware Style Adaptation

Haowei Lou, Hye-young Paik, Sheng Li, Wen Hu, Lina Yao

Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[90] arXiv:2504.08365 (cross-list from cs.SD) [pdf, html, other]: Title: Location-Oriented Sound Event Localization and Detection with Spatial Mapping and Regression Localization

Xueping Zhang, Yaxiong Chen, Ruilin Yao, Yunfei Zi, Shengwu Xiong

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[91] arXiv:2504.08371 (cross-list from cs.SD) [pdf, html, other]: Title: Passive Underwater Acoustic Signal Separation based on Feature Decoupling Dual-path Network

Yucheng Liu, Longyu Jiang

Comments: 10pages,4 figures

Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[92] arXiv:2504.08470 (cross-list from cs.SD) [pdf, other]: Title: On the Design of Diffusion-based Neural Speech Codecs

Pietro Foti, Andreas Brendel

Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[93] arXiv:2504.08528 (cross-list from cs.CL) [pdf, html, other]: Title: On The Landscape of Spoken Language Models: A Comprehensive Survey

Siddhant Arora, Kai-Wei Chang, Chung-Ming Chien, Yifan Peng, Haibin Wu, Yossi Adi, Emmanuel Dupoux, Hung-Yi Lee, Karen Livescu, Shinji Watanabe

Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[94] arXiv:2504.08659 (cross-list from cs.SD) [pdf, html, other]: Title: BowelRCNN: Region-based Convolutional Neural Network System for Bowel Sound Auscultation

Igor Matynia, Robert Nowak

Comments: 10 pages, 3 figures

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[95] arXiv:2504.08907 (cross-list from cs.SD) [pdf, html, other]: Title: Spatial Audio Processing with Large Language Model on Wearable Devices

Ayushi Mishra, Yang Bai, Priyadarshan Narayanasamy, Nakul Garg, Nirupam Roy

Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[96] arXiv:2504.09219 (cross-list from cs.SD) [pdf, other]: Title: Generation of Musical Timbres using a Text-Guided Diffusion Model

Weixuan Yuan, Qadeer Khan, Vladimir Golkov

Comments: 10 pages, 5 figures

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[97] arXiv:2504.09225 (cross-list from cs.SD) [pdf, html, other]: Title: AMNet: An Acoustic Model Network for Enhanced Mandarin Speech Synthesis

Yubing Cao, Yinfeng Yu, Yongming Li, Liejun Wang

Comments: Main paper (8 pages). Accepted for publication by IJCNN 2025

Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[98] arXiv:2504.09516 (cross-list from cs.SD) [pdf, html, other]: Title: FSSUAVL: A Discriminative Framework using Vision Models for Federated Self-Supervised Audio and Image Understanding

Yasar Abbas Ur Rehman, Kin Wai Lau, Yuyang Xie, Ma Lan, JiaJun Shen

Comments: 8 pages

Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
[99] arXiv:2504.09885 (cross-list from cs.SD) [pdf, html, other]: Title: Separate to Collaborate: Dual-Stream Diffusion Model for Coordinated Piano Hand Motion Synthesis

Zihao Liu, Mingwen Ou, Zunnan Xu, Jiaqi Huang, Haonan Han, Ronghui Li, Xiu Li

Comments: 15 pages, 7 figures, Accepted to ACMMM 2025

Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
[100] arXiv:2504.09980 (cross-list from cs.CL) [pdf, html, other]: Title: Turn-taking annotation for quantitative and qualitative analyses of conversation

Anneliese Kelterer, Barbara Schuppler

Comments: 41 pages

Subjects: Computation and Language (cs.CL); Databases (cs.DB); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)

Total of 149 entries : 1-100 101-149

Showing up to 100 entries per page: fewer | more | all