Skip to main content
Cornell University
We gratefully acknowledge support from the Simons Foundation, member institutions, and all contributors. Donate
arxiv logo > eess.AS

Help | Advanced Search

arXiv logo
Cornell University Logo

quick links

  • Login
  • Help Pages
  • About

Audio and Speech Processing

Authors and titles for April 2025

Total of 149 entries : 1-100 101-149
Showing up to 100 entries per page: fewer | more | all
[1] arXiv:2504.00621 [pdf, html, other]
Title: How Cyclic Acoustic Patterns Influence ASMR Perception: A Signal Processing Perspective
Zexin Fang, Bin Han, Henrik H. Sveen, C. Clark Cao, Hans D. Schotten
Comments: Submitted to IEEE Transactions on Cognitive and Developmental Systems
Subjects: Audio and Speech Processing (eess.AS)
[2] arXiv:2504.00742 [pdf, other]
Title: Expanding and Analyzing ODAQ -- the Open Dataset of Audio Quality
Sascha Dick, Christoph Thompson, Chih-Wei Wu, Matteo Torcoli, Pablo Delgado, Phillip A. Williams, Emanuel Habets
Comments: Accepted for presentation at the Audio Engineering Society (AES) 157th Convention, October 2024, New York, USA
Subjects: Audio and Speech Processing (eess.AS)
[3] arXiv:2504.01392 [pdf, html, other]
Title: Spatial-Filter-Bank-Based Neural Method for Multichannel Speech Enhancement
Tianqin Zheng, Jilu Jin, Hanchen Pei, Gongping Huang, Jingdong Chen, Jacob Benesty
Subjects: Audio and Speech Processing (eess.AS)
[4] arXiv:2504.01767 [pdf, html, other]
Title: Leveraging Embedding Techniques in Multimodal Machine Learning for Mental Illness Assessment
Abdelrahaman A. Hassan, Abdelrahman A. Ali, Aya E. Fouda, Radwa J. Hanafy, Mohammed E. Fouda
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
[5] arXiv:2504.03329 [pdf, html, other]
Title: Mind the Prompt: Prompting Strategies in Audio Generations for Improving Sound Classification
Francesca Ronchini, Ho-Hsiang Wu, Wei-Cheng Lin, Fabio Antonacci
Comments: Accepted at Generative Data Augmentation for Real-World Signal Processing Applications Workshop
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD); Signal Processing (eess.SP)
[6] arXiv:2504.04075 [pdf, html, other]
Title: Real-Time Auralization for First-Person Vocal Interaction in Immersive Virtual Environments
Mauricio Flores-Vargas, Enda Bates, Rachel McDonnell
Subjects: Audio and Speech Processing (eess.AS); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
[7] arXiv:2504.04450 [pdf, html, other]
Title: WaveNet-Volterra Neural Networks for Active Noise Control: A Fully Causal Approach
Lu Bai, Mengtong Li, Siyuan Lian, Kai Chen, Jing Lu
Subjects: Audio and Speech Processing (eess.AS)
[8] arXiv:2504.04512 [pdf, html, other]
Title: Trainable Adaptive Score Normalization for Automatic Speaker Verification
Jeong-Hwan Choi, Ju-Seok Seong, Ye-Rin Jeoung, Joon-Hyuk Chang
Comments: Accepted at ICASSP'25
Subjects: Audio and Speech Processing (eess.AS)
[9] arXiv:2504.04721 [pdf, html, other]
Title: Bridging the Gap between Continuous and Informative Discrete Representations by Random Product Quantization
Xueqing Li, Zehan Li, Boyu Zhu, Ruihao Jing, Jian Kang, Jie Li, Xiao-Lei Zhang, Xuelong Li
Subjects: Audio and Speech Processing (eess.AS)
[10] arXiv:2504.04751 [pdf, html, other]
Title: Unsupervised Estimation of Nonlinear Audio Effects: Comparing Diffusion-Based and Adversarial approaches
Eloi Moliner, Michal Švento, Alec Wright, Lauri Juvela, Pavel Rajmic, Vesa Välimäki
Comments: Submitted to the 28th International Conference on Digital Audio Effects (DAFx25)
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
[11] arXiv:2504.05657 [pdf, html, other]
Title: Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-spoofing
Tianchi Liu, Duc-Tuan Truong, Rohan Kumar Das, Kong Aik Lee, Haizhou Li
Comments: This manuscript has been submitted for peer review
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
[12] arXiv:2504.06963 [pdf, html, other]
Title: RNN-Transducer-based Losses for Speech Recognition on Noisy Targets
Vladimir Bataev
Comments: Final Project Report, Bachelor's Degree in Computer Science, University of London, March 2024
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
[13] arXiv:2504.07652 [pdf, html, other]
Title: Categorical Unsupervised Variational Acoustic Clustering
Luan Vinícius Fiorio, Ivana Nikoloska, Ronald M. Aarts
Subjects: Audio and Speech Processing (eess.AS)
[14] arXiv:2504.08524 [pdf, html, other]
Title: USM-VC: Mitigating Timbre Leakage with Universal Semantic Mapping Residual Block for Voice Conversion
Na Li, Chuke Wang, Yu Gu, Zhifeng Li
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
[15] arXiv:2504.08624 [pdf, html, other]
Title: TorchFX: A modern approach to Audio DSP with PyTorch and GPU acceleration
Matteo Spanio, Antonio Rodà
Comments: Submitted to DAFx 2025
Subjects: Audio and Speech Processing (eess.AS); Performance (cs.PF); Sound (cs.SD); Signal Processing (eess.SP)
[16] arXiv:2504.08644 [pdf, html, other]
Title: Reverberation-based Features for Sound Event Localization and Detection with Distance Estimation
Davide Berghi, Philip J. B. Jackson
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)
[17] arXiv:2504.08997 [pdf, other]
Title: Beyond Global Metrics: A Fairness Analysis for Interpretable Voice Disorder Detection Systems
Mariel Estevez, Cyntia Bonomi, Dayana Ribas, Alfonso Ortega, Luciana Ferrer
Comments: 34 pages, 6 figures, 2 tables
Subjects: Audio and Speech Processing (eess.AS)
[18] arXiv:2504.09081 [pdf, other]
Title: SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning
Prabhat Pandey, Rupak Vignesh Swaminathan, K V Vijay Girish, Arunasish Sen, Jian Xie, Grant P. Strimel, Andreas Schwarz
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
[19] arXiv:2504.09381 [pdf, html, other]
Title: DiTSE: High-Fidelity Generative Speech Enhancement via Latent Diffusion Transformers
Heitor R. Guimarães, Jiaqi Su, Rithesh Kumar, Tiago H. Falk, Zeyu Jin
Comments: Manuscript under review
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[20] arXiv:2504.10352 [pdf, html, other]
Title: Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis
Yifan Yang, Shujie Liu, Jinyu Li, Yuxuan Hu, Haibin Wu, Hui Wang, Jianwei Yu, Lingwei Meng, Haiyang Sun, Yanqing Liu, Yan Lu, Kai Yu, Xie Chen
Comments: Accepted in ACMMM 2025
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
[21] arXiv:2504.11246 [pdf, html, other]
Title: Respiratory Inhaler Sound Event Classification Using Self-Supervised Learning
Davoud Shariat Panah, Alessandro N Franciosi, Cormac McCarthy, Andrew Hines
Comments: Accepted at the IEEE EMBC 2025 Conference
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[22] arXiv:2504.12423 [pdf, html, other]
Title: Benchmarking Audio Deepfake Detection Robustness in Real-world Communication Scenarios
Haohan Shi, Xiyu Shi, Safak Dogan, Saif Alzubi, Tianjin Huang, Yunxiao Zhang
Comments: Accepted by EUSIPCO 2025
Subjects: Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[23] arXiv:2504.12670 [pdf, html, other]
Title: Temporal Attention Pooling for Frequency Dynamic Convolution in Sound Event Detection
Hyeonuk Nam, Yong-Hwa Park
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[24] arXiv:2504.12867 [pdf, html, other]
Title: EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting
Guanrou Yang, Chen Yang, Qian Chen, Ziyang Ma, Wenxi Chen, Wen Wang, Tianrui Wang, Yifan Yang, Zhikang Niu, Wenrui Liu, Fan Yu, Zhihao Du, Zhifu Gao, ShiLiang Zhang, Xie Chen
Comments: Accepted at ACMMM 2025
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
[25] arXiv:2504.12870 [pdf, html, other]
Title: CST-former: Multidimensional Attention-based Transformer for Sound Event Localization and Detection in Real Scenes
Yusun Shul, Dayun Choi, Jung-Woo Choi
Comments: 12 pages, 10 figures, Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing
Subjects: Audio and Speech Processing (eess.AS)
[26] arXiv:2504.13765 [pdf, other]
Title: Modeling L1 Influence on L2 Pronunciation: An MFCC-Based Framework for Explainable Machine Learning and Pedagogical Feedback
Peyman Jahanbin
Comments: 27 pages (including references), 4 figures, 1 table. Combines statistical inference and explainable machine learning to model L1 influence in L2 pronunciation using MFCC features. Methodology and code are openly available via Zenodo and OSF: Zenodo: this https URL OSF: this https URL
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[27] arXiv:2504.14183 [pdf, html, other]
Title: The First VoicePrivacy Attacker Challenge
Natalia Tomashenko, Xiaoxiao Miao, Emmanuel Vincent, Junichi Yamagishi
Comments: Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Journal-ref: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025, pp. 1-2
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
[28] arXiv:2504.14409 [pdf, html, other]
Title: Data Augmentation Using Neural Acoustic Fields With Retrieval-Augmented Pre-training
Christopher Ick, Gordon Wichern, Yoshiki Masuyama, François G. Germain, Jonathan Le Roux
Comments: Presented at ICASSP 2025 GenDA Workshop
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)
[29] arXiv:2504.14437 [pdf, html, other]
Title: Predicting speech intelligibility in older adults for speech enhancement using the Gammachirp Envelope Similarity Index, GESI
Ayako Yamamoto, Fuki Miyazaki, Toshio Irino
Comments: This is a revised manuscript that was submitted to Speech Communication on August 15, 2025
Subjects: Audio and Speech Processing (eess.AS)
[30] arXiv:2504.14817 [pdf, other]
Title: DNN based HRIRs Identification with a Continuously Rotating Speaker Array
Byeong-Yun Ko, Deokki Min, Hyeonuk Nam, Yong-Hwa Park
Subjects: Audio and Speech Processing (eess.AS)
[31] arXiv:2504.14843 [pdf, html, other]
Title: Quantitative Measures for Passive Sonar Texture Analysis
Jarin Ritu, Alexandra Van Dine, Joshua Peeples
Comments: 5 pages, 2 figures
Subjects: Audio and Speech Processing (eess.AS)
[32] arXiv:2504.14906 [pdf, html, other]
Title: OmniAudio: Generating Spatial Audio from 360-Degree Video
Huadai Liu, Tianyi Luo, Kaicheng Luo, Qikai Jiang, Peiwen Sun, Jialei Wang, Rongjie Huang, Qian Chen, Wen Wang, Xiangtai Li, Shiliang Zhang, Zhijie Yan, Zhou Zhao, Wei Xue
Comments: ICML 2025
Subjects: Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
[33] arXiv:2504.14915 [pdf, html, other]
Title: StableQuant: Layer Adaptive Post-Training Quantization for Speech Foundation Models
Yeona Hong, Hyewon Han, Woo-jin Chung, Hong-Goo Kang
Comments: Accepted at ICASSP 2025
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
[34] arXiv:2504.14981 [pdf, html, other]
Title: On feature representations for marmoset vocal communication analysis
Eklavya Sarkar, Kaja Wierucka, Alexandra B. Bosshard, Judith Burkart, Mathew Magimai.-Doss
Journal-ref: Bioacoustics Journal (2025) 1-15
Subjects: Audio and Speech Processing (eess.AS)
[35] arXiv:2504.15575 [pdf, html, other]
Title: Exploring the User Experience of AI-Assisted Sound Searching Systems for Creative Workflows
Haohe Liu, Thomas Deacon, Wenwu Wang, Matt Paradis, Mark D. Plumbley
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[36] arXiv:2504.15663 [pdf, html, other]
Title: FADEL: Uncertainty-aware Fake Audio Detection with Evidential Deep Learning
Ju Yeon Kang, Ji Won Yoon, Semin Kim, Min Hyun Han, Nam Soo Kim
Comments: Accepted at ICASSP 2025
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
[37] arXiv:2504.16223 [pdf, html, other]
Title: Perceptual Audio Coding: A 40-Year Historical Perspective
Jürgen Herre, Schuyler Quackenbush, Minje Kim, Jan Skoglund
Journal-ref: Published in the Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025
Subjects: Audio and Speech Processing (eess.AS)
[38] arXiv:2504.16289 [pdf, html, other]
Title: Deep, data-driven modeling of room acoustics: literature review and research perspectives
Toon van Waterschoot
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[39] arXiv:2504.16441 [pdf, html, other]
Title: SoCov: Semi-Orthogonal Parametric Pooling of Covariance Matrix for Speaker Recognition
Rongjin Li, Weibin Zhang, Dongpeng Chen, Jintao Kang, Xiaofen Xing
Comments: This paper has been accepted by IEEE ICASSP2025
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[40] arXiv:2504.17440 [pdf, html, other]
Title: Generating Localized Audible Zones Using a Single-Channel Parametric Loudspeaker
Tao Zhuang, Shaozhe Li, Feng Niu, Jia-Xin Zhong, Jing Lu
Subjects: Audio and Speech Processing (eess.AS)
[41] arXiv:2504.18004 [pdf, html, other]
Title: Assessing the Utility of Audio Foundation Models for Heart and Respiratory Sound Analysis
Daisuke Niizumi, Daiki Takeuchi, Masahiro Yasuda, Binh Thien Nguyen, Yasunori Ohishi, Noboru Harada
Comments: 4 pages, 1 figure, and 4 tables. Accepted by IEEE EMBC 2025
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[42] arXiv:2504.18157 [pdf, html, other]
Title: DOSE : Drum One-Shot Extraction from Music Mixture
Suntae Hwang, Seonghyeon Kang, Kyungsu Kim, Semin Ahn, Kyogu Lee
Comments: Published in IEEE ICASSP 2025
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[43] arXiv:2504.18425 [pdf, html, other]
Title: Kimi-Audio Technical Report
KimiTeam, Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, Zhengtao Wang, Chu Wei, Yifei Xin, Xinran Xu, Jianwei Yu, Yutao Zhang, Xinyu Zhou, Y. Charles, Jun Chen, Yanru Chen, Yulun Du, Weiran He, Zhenxing Hu, Guokun Lai, Qingcheng Li, Yangyang Liu, Weidong Sun, Jianzhou Wang, Yuzhi Wang, Yuefeng Wu, Yuxin Wu, Dongchao Yang, Hao Yang, Ying Yang, Zhilin Yang, Aoxiong Yin, Ruibin Yuan, Yutong Zhang, Zaida Zhou
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)
[44] arXiv:2504.18502 [pdf, other]
Title: Music Tempo Estimation on Solo Instrumental Performance
Zhanhong He, Roberto Togneri, Xiangyu Zhang
Comments: 4 pages, rejected paper by WASPAA2023
Subjects: Audio and Speech Processing (eess.AS); Information Retrieval (cs.IR)
[45] arXiv:2504.18539 [pdf, html, other]
Title: Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation
Sungnyun Kim, Sungwoo Cho, Sangmin Bae, Kangwook Jang, Se-Young Yun
Comments: ICLR 2025; 22 pages, 6 figures, 14 tables
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)
[46] arXiv:2504.19046 [pdf, html, other]
Title: Enhancing Cochlear Implant Signal Coding with Scaled Dot-Product Attention
Billel Essaid, Hamza Kheddar, Noureddine Batel
Journal-ref: 2024 International Conference on Telecommunications and Intelligent Systems (ICTIS)
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
[47] arXiv:2504.19062 [pdf, html, other]
Title: Versatile Framework for Song Generation with Prompt-based Control
Yu Zhang, Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Ruiqi Li, Jingyu Lu, Rongjie Huang, Ruiyuan Zhang, Zhiqing Hong, Ziyue Jiang, Zhou Zhao
Comments: Accepted by Findings of EMNLP 2025
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[48] arXiv:2504.19605 [pdf, html, other]
Title: A Comparative Study on Positional Encoding for Time-frequency Domain Dual-path Transformer-based Source Separation Models
Kohei Saijo, Tetsuji Ogawa
Comments: 5 pages, 3 tables, 2 figures. Accepted to EUSIPCO2025
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[49] arXiv:2504.20334 [pdf, html, other]
Title: Towards Flow-Matching-based TTS without Classifier-Free Guidance
Yuzhe Liang, Wenzhe Liu, Chunyu Qiang, Zhikang Niu, Yushen Chen, Ziyang Ma, Wenxi Chen, Nan Li, Chen Zhang, Xie Chen
Subjects: Audio and Speech Processing (eess.AS)
[50] arXiv:2504.20630 [pdf, html, other]
Title: ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting
Yu Zhang, Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Tao Jin, Zhou Zhao
Comments: Accepted by ACM Multimedia 2025
Subjects: Audio and Speech Processing (eess.AS); Multimedia (cs.MM); Sound (cs.SD)
[51] arXiv:2504.21528 [pdf, html, other]
Title: Impairments are Clustered in Latents of Deep Neural Network-based Speech Quality Models
Fredrik Cumlin, Xinyu Liang, Victor Ungureanu, Chandan K. A. Reddy, Christian Schüldt, Saikat Chatterjee
Subjects: Audio and Speech Processing (eess.AS)
[52] arXiv:2504.21815 [pdf, html, other]
Title: From Aesthetics to Human Preferences: Comparative Perspectives of Evaluating Text-to-Music Systems
Huan Zhang, Jinhua Liang, Huy Phan, Wenwu Wang, Emmanouil Benetos
Subjects: Audio and Speech Processing (eess.AS)
[53] arXiv:2504.00750 (cross-list from cs.SD) [pdf, html, other]
Title: $C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction
Wenxuan Wu, Xueyuan Chen, Shuai Wang, Jiadong Wang, Lingwei Meng, Xixin Wu, Helen Meng, Haizhou Li
Comments: Accepted by IEEE Journal of Selected Topics in Signal Processing (JSTSP)
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[54] arXiv:2504.01094 (cross-list from cs.SD) [pdf, html, other]
Title: Multilingual and Multi-Accent Jailbreaking of Audio LLMs
Jaechul Roh, Virat Shejwalkar, Amir Houmansadr
Comments: 21 pages, 6 figures, 15 tables
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Audio and Speech Processing (eess.AS)
[55] arXiv:2504.01297 (cross-list from cs.RO) [pdf, html, other]
Title: AIM: Acoustic Inertial Measurement for Indoor Drone Localization and Tracking
Yimiao Sun, Weiguo Wang, Luca Mottola, Ruijin Wang, Yuan He
Comments: arXiv admin note: substantial text overlap with arXiv:2504.00445
Subjects: Robotics (cs.RO); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[56] arXiv:2504.01519 (cross-list from cs.CL) [pdf, html, other]
Title: Chain of Correction for Full-text Speech Recognition with Large Language Models
Zhiyuan Tang, Dong Wang, Zhikai Zhou, Yong Liu, Shen Huang, Shidong Shang
Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[57] arXiv:2504.01690 (cross-list from cs.SD) [pdf, html, other]
Title: Token Pruning in Audio Transformers: Optimizing Performance and Decoding Patch Importance
Taehan Lee, Hyukjun Lee
Comments: Accepted at the 28th European Conference on Artificial Intelligence (ECAI 2025). Source code is available at this https URL
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[58] arXiv:2504.02061 (cross-list from cs.CV) [pdf, html, other]
Title: Aligned Better, Listen Better for Audio-Visual Large Language Models
Yuxin Guo, Shuailei Ma, Shijie Ma, Xiaoyi Bao, Chen-Wei Xie, Kecheng Zheng, Tingyu Weng, Siyang Sun, Yun Zheng, Wei Zou
Comments: Accepted to ICLR 2025
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[59] arXiv:2504.02302 (cross-list from cs.SD) [pdf, html, other]
Title: Causal Self-supervised Pretrained Frontend with Predictive Code for Speech Separation
Wupeng Wang, Zexu Pan, Xinke Li, Shuai Wang, Haizhou Li
Comments: arXiv admin note: text overlap with arXiv:2411.03085
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[60] arXiv:2504.02386 (cross-list from cs.CV) [pdf, html, other]
Title: VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models
Kim Sung-Bin, Jeongsoo Choi, Puyuan Peng, Joon Son Chung, Tae-Hyun Oh, David Harwath
Comments: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
[61] arXiv:2504.02398 (cross-list from cs.CL) [pdf, html, other]
Title: Scaling Analysis of Interleaved Speech-Text Language Models
Gallil Maimon, Michael Hassid, Amit Roth, Yossi Adi
Comments: Accepted at COLM 2025
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[62] arXiv:2504.02402 (cross-list from cs.SD) [pdf, html, other]
Title: EvMic: Event-based Non-contact sound recovery from effective spatial-temporal modeling
Hao Yin, Shi Guo, Xu Jia, Xudong XU, Lu Zhang, Si Liu, Dong Wang, Huchuan Lu, Tianfan Xue
Comments: Our project page: this https URL
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[63] arXiv:2504.02407 (cross-list from cs.SD) [pdf, html, other]
Title: F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization
Xiaohui Sun, Ruitong Xiao, Jianye Mo, Bowen Wu, Qun Yu, Baoxun Wang
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[64] arXiv:2504.02586 (cross-list from cs.SD) [pdf, other]
Title: Deep learning for music generation. Four approaches and their comparative evaluation
Razvan Paroiu, Stefan Trausan-Matu
Journal-ref: U.P.B. Scientific Bulletin, Series C, Vol. 85, Issue 4, 2023
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[65] arXiv:2504.02604 (cross-list from cs.CL) [pdf, html, other]
Title: LinTO Audio and Textual Datasets to Train and Evaluate Automatic Speech Recognition in Tunisian Arabic Dialect
Hedi Naouara, Jean-Pierre Lorré, Jérôme Louradour
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[66] arXiv:2504.02988 (cross-list from cs.SD) [pdf, html, other]
Title: Generating Diverse Audio-Visual 360 Soundscapes for Sound Event Localization and Detection
Adrian S. Roman, Aiden Chang, Gerardo Meza, Iran R. Roman
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[67] arXiv:2504.03289 (cross-list from cs.SD) [pdf, html, other]
Title: RWKVTTS: Yet another TTS based on RWKV-7
Lin yueyu, Liu Xiao
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[68] arXiv:2504.03373 (cross-list from cs.SD) [pdf, html, other]
Title: An Efficient GPU-based Implementation for Noise Robust Sound Source Localization
Zirui Lin, Masayuki Takigahira, Naoya Terakado, Haris Gulzar, Monikka Roslianna Busto, Takeharu Eda, Katsutoshi Itoyama, Kazuhiro Nakadai, Hideharu Amano
Comments: 6 pages, 2 figures
Subjects: Sound (cs.SD); Robotics (cs.RO); Audio and Speech Processing (eess.AS)
[69] arXiv:2504.03546 (cross-list from cs.CL) [pdf, html, other]
Title: MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation
Khai Le-Duc, Tuyen Tran, Bach Phan Tat, Nguyen Kim Hai Bui, Quan Dang, Hung-Phong Tran, Thanh-Thuy Nguyen, Ly Nguyen, Tuan-Minh Phan, Thi Thu Phuong Tran, Chris Ngo, Nguyen X. Khanh, Thanh Nguyen-Tang
Comments: Preprint, 122 pages
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[70] arXiv:2504.03679 (cross-list from eess.SP) [pdf, other]
Title: Continuous Boostlet Transform and Associated Uncertainty Principles
Owais Ahmad, Jasifa Fayaz
Comments: 28pages,6 figures
Subjects: Signal Processing (eess.SP); Sound (cs.SD); Audio and Speech Processing (eess.AS); Functional Analysis (math.FA)
[71] arXiv:2504.03998 (cross-list from cs.SD) [pdf, html, other]
Title: Determined blind source separation via modeling adjacent frequency band correlations in speech signals
Jianyu Wang, Shanzheng Guan, Zhengqiao Zhao, Nicolas Dobigeon, Jingdong Chen
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[72] arXiv:2504.04060 (cross-list from cs.CL) [pdf, html, other]
Title: VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation
Yuhao Wang, Heyang Liu, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[73] arXiv:2504.04589 (cross-list from cs.SD) [pdf, html, other]
Title: Solid State Bus-Comp: A Large-Scale and Diverse Dataset for Dynamic Range Compressor Virtual Analog Modeling
Yicheng Gu, Runsong Zhang, Lauri Juvela, Zhizheng Wu
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[74] arXiv:2504.05009 (cross-list from cs.SD) [pdf, html, other]
Title: Deconstructing Jazz Piano Style Using Machine Learning
Huw Cheston, Reuben Bance, Peter M. C. Harrison
Comments: Paper: 40 pages, 11 figures, 1 table; added information on training time + computation cost, corrections to Table 1. Supplementary material: 33 pages, 48 figures, 6 tables; corrections to Table S.5
Subjects: Sound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[75] arXiv:2504.05158 (cross-list from cs.SD) [pdf, html, other]
Title: Leveraging Label Potential for Enhanced Multimodal Emotion Recognition
Xuechun Shao, Yinfeng Yu, Liejun Wang
Comments: Main paper (8 pages). Accepted for publication by IJCNN 2025
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[76] arXiv:2504.05368 (cross-list from cs.SD) [pdf, html, other]
Title: Exploring Local Interpretable Model-Agnostic Explanations for Speech Emotion Recognition with Distribution-Shift
Maja J. Hjuler, Line H. Clemmensen, Sneha Das
Comments: Published in the proceedings of ICASSP 2025
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[77] arXiv:2504.05686 (cross-list from cs.SD) [pdf, html, other]
Title: kNN-SVC: Robust Zero-Shot Singing Voice Conversion with Additive Synthesis and Concatenation Smoothness Optimization
Keren Shao, Ke Chen, Matthew Baas, Shlomo Dubnov
Comments: 5 pages, 6 figures, 1 table, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[78] arXiv:2504.05690 (cross-list from cs.SD) [pdf, html, other]
Title: STAGE: Stemmed Accompaniment Generation through Prefix-Based Conditioning
Giorgio Strano, Chiara Ballanti, Donato Crisostomi, Michele Mancusi, Luca Cosmo, Emanuele Rodolà
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[79] arXiv:2504.05802 (cross-list from cs.SD) [pdf, html, other]
Title: Mass-Spring Models for Passive Keyword Spotting: A Springtronics Approach
Finn Bohte, Theophile Louvet, Vincent Maillou, Marc Serra Garcia
Comments: 14 pages, 8 figures
Subjects: Sound (cs.SD); Disordered Systems and Neural Networks (cond-mat.dis-nn); Audio and Speech Processing (eess.AS)
[80] arXiv:2504.05847 (cross-list from cs.SD) [pdf, html, other]
Title: Réduire le bruit grâce à la réalité augmentée sonore -- Auditory Concealer
Clara Boukhemia
Comments: 57 pages, in French language, 24 figures
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[81] arXiv:2504.06165 (cross-list from cs.SD) [pdf, other]
Title: Real-Time Pitch/F0 Detection Using Spectrogram Images and Convolutional Neural Networks
Xufang Zhao, Omer Tsimhoni
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[82] arXiv:2504.06275 (cross-list from cs.IR) [pdf, html, other]
Title: A Cascaded Architecture for Extractive Summarization of Multimedia Content via Audio-to-Text Alignment
Tanzir Hossain, Ar-Rafi Islam, Md. Sabbir Hossain, Annajiat Alim Rasel
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[83] arXiv:2504.06778 (cross-list from cs.SD) [pdf, html, other]
Title: CAFA: a Controllable Automatic Foley Artist
Roi Benita, Michael Finkelson, Tavi Halperin, Gleb Sterkin, Yossi Adi
Comments: Renamed paper to "CAFA: a Controllable Automatic Foley Artist" from "Controllable Automatic Foley Artist". Updated link to demo page
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[84] arXiv:2504.07053 (cross-list from cs.CL) [pdf, html, other]
Title: TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling
Liang-Hsuan Tseng, Yi-Chang Chen, Kuan-Yi Lee, Da-Shan Shiu, Hung-yi Lee
Comments: Preprint
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[85] arXiv:2504.07229 (cross-list from cs.CL) [pdf, html, other]
Title: Visual-Aware Speech Recognition for Noisy Scenarios
Lakshmipathi Balaji, Karan Singla
Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[86] arXiv:2504.07345 (cross-list from cs.SD) [pdf, html, other]
Title: Quantum-Inspired Genetic Algorithm for Robust Source Separation in Smart City Acoustics
Minh K. Quan, Mayuri Wijayasundara, Sujeeva Setunge, Pubudu N. Pathirana
Comments: 6 pages, 2 figures, IEEE International Conference on Communications (ICC 2025)
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[87] arXiv:2504.07406 (cross-list from cs.SD) [pdf, html, other]
Title: Towards Generalizability to Tone and Content Variations in the Transcription of Amplifier Rendered Electric Guitar Audio
Yu-Hua Chen, Yuan-Chiao Cheng, Yen-Tung Yeh, Jui-Te Wu, Jyh-Shing Roger Jang, Yi-Hsuan Yang
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[88] arXiv:2504.08024 (cross-list from cs.CL) [pdf, other]
Title: Summarizing Speech: A Comprehensive Survey
Fabian Retkowski, Maike Züfle, Andreas Sudmann, Dinah Pfau, Shinji Watanabe, Jan Niehues, Alexander Waibel
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[89] arXiv:2504.08274 (cross-list from cs.SD) [pdf, html, other]
Title: Generalized Multilingual Text-to-Speech Generation with Language-Aware Style Adaptation
Haowei Lou, Hye-young Paik, Sheng Li, Wen Hu, Lina Yao
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[90] arXiv:2504.08365 (cross-list from cs.SD) [pdf, html, other]
Title: Location-Oriented Sound Event Localization and Detection with Spatial Mapping and Regression Localization
Xueping Zhang, Yaxiong Chen, Ruilin Yao, Yunfei Zi, Shengwu Xiong
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[91] arXiv:2504.08371 (cross-list from cs.SD) [pdf, html, other]
Title: Passive Underwater Acoustic Signal Separation based on Feature Decoupling Dual-path Network
Yucheng Liu, Longyu Jiang
Comments: 10pages,4 figures
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[92] arXiv:2504.08470 (cross-list from cs.SD) [pdf, other]
Title: On the Design of Diffusion-based Neural Speech Codecs
Pietro Foti, Andreas Brendel
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[93] arXiv:2504.08528 (cross-list from cs.CL) [pdf, html, other]
Title: On The Landscape of Spoken Language Models: A Comprehensive Survey
Siddhant Arora, Kai-Wei Chang, Chung-Ming Chien, Yifan Peng, Haibin Wu, Yossi Adi, Emmanuel Dupoux, Hung-Yi Lee, Karen Livescu, Shinji Watanabe
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[94] arXiv:2504.08659 (cross-list from cs.SD) [pdf, html, other]
Title: BowelRCNN: Region-based Convolutional Neural Network System for Bowel Sound Auscultation
Igor Matynia, Robert Nowak
Comments: 10 pages, 3 figures
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[95] arXiv:2504.08907 (cross-list from cs.SD) [pdf, html, other]
Title: Spatial Audio Processing with Large Language Model on Wearable Devices
Ayushi Mishra, Yang Bai, Priyadarshan Narayanasamy, Nakul Garg, Nirupam Roy
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[96] arXiv:2504.09219 (cross-list from cs.SD) [pdf, other]
Title: Generation of Musical Timbres using a Text-Guided Diffusion Model
Weixuan Yuan, Qadeer Khan, Vladimir Golkov
Comments: 10 pages, 5 figures
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[97] arXiv:2504.09225 (cross-list from cs.SD) [pdf, html, other]
Title: AMNet: An Acoustic Model Network for Enhanced Mandarin Speech Synthesis
Yubing Cao, Yinfeng Yu, Yongming Li, Liejun Wang
Comments: Main paper (8 pages). Accepted for publication by IJCNN 2025
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[98] arXiv:2504.09516 (cross-list from cs.SD) [pdf, html, other]
Title: FSSUAVL: A Discriminative Framework using Vision Models for Federated Self-Supervised Audio and Image Understanding
Yasar Abbas Ur Rehman, Kin Wai Lau, Yuyang Xie, Ma Lan, JiaJun Shen
Comments: 8 pages
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
[99] arXiv:2504.09885 (cross-list from cs.SD) [pdf, html, other]
Title: Separate to Collaborate: Dual-Stream Diffusion Model for Coordinated Piano Hand Motion Synthesis
Zihao Liu, Mingwen Ou, Zunnan Xu, Jiaqi Huang, Haonan Han, Ronghui Li, Xiu Li
Comments: 15 pages, 7 figures, Accepted to ACMMM 2025
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
[100] arXiv:2504.09980 (cross-list from cs.CL) [pdf, html, other]
Title: Turn-taking annotation for quantitative and qualitative analyses of conversation
Anneliese Kelterer, Barbara Schuppler
Comments: 41 pages
Subjects: Computation and Language (cs.CL); Databases (cs.DB); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
Total of 149 entries : 1-100 101-149
Showing up to 100 entries per page: fewer | more | all
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status
    Get status notifications via email or slack