Skip to main content
Cornell University
We gratefully acknowledge support from the Simons Foundation, member institutions, and all contributors. Donate
arxiv logo > eess.AS

Help | Advanced Search

arXiv logo
Cornell University Logo

quick links

  • Login
  • Help Pages
  • About

Audio and Speech Processing

Authors and titles for October 2024

Total of 358 entries : 1-100 101-200 201-300 301-358
Showing up to 100 entries per page: fewer | more | all
[1] arXiv:2410.00035 [pdf, html, other]
Title: FeruzaSpeech: A 60 Hour Uzbek Read Speech Corpus with Punctuation, Casing, and Context
Anna Povey, Katherine Povey
Comments: 5 Pages, 1 Figure, Preprint of Paper Accepted in ICNLSP 2024
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG)
[2] arXiv:2410.00037 [pdf, html, other]
Title: Moshi: a speech-text foundation model for real-time dialogue
Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, Neil Zeghidour
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
[3] arXiv:2410.00070 [pdf, html, other]
Title: Mamba for Streaming ASR Combined with Unimodal Aggregation
Ying Fang, Xiaofei Li
Comments: Accepted by ICASSP 2025
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[4] arXiv:2410.00390 [pdf, html, other]
Title: Multi-Scale Temporal Transformer For Speech Emotion Recognition
Zhipeng Li, Xiaofen Xing, Yuanbo Fang, Weibin Zhang, Hengsheng Fan, Xiangmin Xu
Subjects: Audio and Speech Processing (eess.AS)
[5] arXiv:2410.00511 [pdf, html, other]
Title: Pre-training with Synthetic Patterns for Audio
Yuchi Ishikawa, Tatsuya Komatsu, Yoshimitsu Aoki
Comments: Submitted to ICASSP'25
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
[6] arXiv:2410.00527 [pdf, html, other]
Title: Wanna hear your voice? A sample is all we need!
The Hieu Pham, Phuong Thanh Tran Nguyen, Xuan Tho Nguyen, Tan Dat Nguyen, Duc Dung Nguyen
Comments: work in progress
Subjects: Audio and Speech Processing (eess.AS)
[7] arXiv:2410.00528 [pdf, other]
Title: End-to-End Speech Recognition with Pre-trained Masked Language Model
Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi, Shinji Watanabe
Subjects: Audio and Speech Processing (eess.AS)
[8] arXiv:2410.00680 [pdf, html, other]
Title: The Conformer Encoder May Reverse the Time Dimension
Robin Schmitt, Albert Zeyer, Mohammad Zeineldeen, Ralf Schlüter, Hermann Ney
Comments: Accepted at ICASSP 2025
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Machine Learning (stat.ML)
[9] arXiv:2410.01108 [pdf, html, other]
Title: Augmentation through Laundering Attacks for Audio Spoof Detection
Hashim Ali, Surya Subramani, Hafiz Malik
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
[10] arXiv:2410.01150 [pdf, html, other]
Title: Restorative Speech Enhancement: A Progressive Approach Using SE and Codec Modules
Hsin-Tien Chiang, Hao Zhang, Yong Xu, Meng Yu, Dong Yu
Comments: Paper in submission
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[11] arXiv:2410.01162 [pdf, html, other]
Title: Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech
Wonjune Kang, Junteng Jia, Chunyang Wu, Wei Zhou, Egor Lakomkin, Yashesh Gaur, Leda Sari, Suyoun Kim, Ke Li, Jay Mahadeokar, Ozlem Kalinli
Comments: Accepted to Interspeech 2025
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[12] arXiv:2410.01562 [pdf, html, other]
Title: HRTF Estimation using a Score-based Prior
Etienne Thuillier, Jean-Marie Lemercier, Eloi Moliner, Timo Gerkmann, Vesa Välimäki
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[13] arXiv:2410.01841 [pdf, other]
Title: A GEN AI Framework for Medical Note Generation
Hui Yi Leong, Yi Fan Gao, Shuai Ji, Bora Kalaycioglu, Uktu Pamuksuz
Comments: 8 Figures, 7 page, IEEE standard research paper
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Sound (cs.SD)
[14] arXiv:2410.02056 [pdf, html, other]
Title: Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data
Sreyan Ghosh, Sonal Kumar, Zhifeng Kong, Rafael Valle, Bryan Catanzaro, Dinesh Manocha
Comments: Accepted at ICLR 2025. Code and Checkpoints available here: this https URL
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
[15] arXiv:2410.02364 [pdf, html, other]
Title: State-of-the-art Embeddings with Video-free Segmentation of the Source VoxCeleb Data
Sara Barahona, Ladislav Mošner, Themos Stafylakis, Oldřich Plchot, Junyi Peng, Lukáš Burget, Jan Černocký
Comments: This work has been submitted to the IEEE for possible publication
Subjects: Audio and Speech Processing (eess.AS)
[16] arXiv:2410.02371 [pdf, html, other]
Title: NTU-NPU System for Voice Privacy 2024 Challenge
Nikita Kuzmin, Hieu-Thi Luong, Jixun Yao, Lei Xie, Kong Aik Lee, Eng Siong Chng
Comments: System description for VPC 2024
Journal-ref: 2024 Challenge. Proc. 4th Symposium on Security and Privacy in Speech Communication, 72-79
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
[17] arXiv:2410.03007 [pdf, html, other]
Title: FastAdaSP: Multitask-Adapted Efficient Inference for Large Speech Language Model
Yichen Lu, Jiaqi Song, Chao-Han Huck Yang, Shinji Watanabe
Comments: EMNLP 2024 Industry Track
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
[18] arXiv:2410.03139 [pdf, html, other]
Title: How does the teacher rate? Observations from the NeuroPiano dataset
Huan Zhang, Vincent Cheung, Hayato Nishioka, Simon Dixon, Shinichi Furuya
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[19] arXiv:2410.03192 [pdf, html, other]
Title: MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech
Taejun Bak, Youngsik Eom, SeungJae Choi, Young-Sun Joo
Comments: Accepted to EMNLP 2024 Findings
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
[20] arXiv:2410.03280 [pdf, other]
Title: Manikin-Recorded Cardiopulmonary Sounds Dataset Using Digital Stethoscope
Yasaman Torabi, Shahram Shirani, James P. Reilly
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
[21] arXiv:2410.03298 [pdf, html, other]
Title: Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens
Jinzheng Zhao, Niko Moritz, Egor Lakomkin, Ruiming Xie, Zhiping Xiu, Katerina Zmolikova, Zeeshan Ahmed, Yashesh Gaur, Duc Le, Christian Fuegen
Comments: Submitted to ICASSP 2025
Subjects: Audio and Speech Processing (eess.AS)
[22] arXiv:2410.04017 [pdf, html, other]
Title: Adversarial Attacks and Robust Defenses in Speaker Embedding based Zero-Shot Text-to-Speech System
Ze Li, Yao Shi, Yunfei Xu, Ming Li
Subjects: Audio and Speech Processing (eess.AS)
[23] arXiv:2410.04092 [pdf, html, other]
Title: Enhancement of Dysarthric Speech Reconstruction by Contrastive Learning
Keshvari Fatemeh, Mahdian Toroghi Rahil, Zareian Hassan
Subjects: Audio and Speech Processing (eess.AS)
[24] arXiv:2410.04198 [pdf, html, other]
Title: DJ Mix Transcription with Multi-Pass Non-Negative Matrix Factorization
Étienne Paul André, Dominique Fourer, Diemo Schwarz
Comments: Submitted to ICASSP 2025
Subjects: Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[25] arXiv:2410.04380 [pdf, html, other]
Title: HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis
Yuto Nishimura, Takumi Hirose, Masanari Ohi, Hideki Nakayama, Nakamasa Inoue
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[26] arXiv:2410.04690 [pdf, html, other]
Title: SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech
Minchan Kim, Myeonghun Jeong, Joun Yeop Lee, Nam Soo Kim
Comments: This work has been submitted to the IEEE for possible publication
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
[27] arXiv:2410.04785 [pdf, html, other]
Title: Towards Ultra-Low-Power Neuromorphic Speech Enhancement with Spiking-FullSubNet
Xiang Hao, Chenxiang Ma, Qu Yang, Jibin Wu, Kay Chen Tan
Comments: under review
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[28] arXiv:2410.04951 [pdf, html, other]
Title: A decade of DCASE: Achievements, practices, evaluations and future challenges
Annamaria Mesaros, Romain Serizel, Toni Heittola, Tuomas Virtanen, Mark D. Plumbley
Comments: Submitted to ICASSP 2025
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[29] arXiv:2410.05101 [pdf, html, other]
Title: CR-CTC: Consistency regularization on CTC for improved speech recognition
Zengwei Yao, Wei Kang, Xiaoyu Yang, Fangjun Kuang, Liyong Guo, Han Zhu, Zengrui Jin, Zhaoqing Li, Long Lin, Daniel Povey
Comments: Published as a conference paper at ICLR 2025
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[30] arXiv:2410.05151 [pdf, html, other]
Title: Editing Music with Melody and Text: Using ControlNet for Diffusion Transformer
Siyuan Hou, Shansong Liu, Ruibin Yuan, Wei Xue, Ying Shan, Mangsuo Zhao, Chao Zhang
Comments: Accepted for publication at ICASSP 2025
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[31] arXiv:2410.05302 [pdf, other]
Title: Episodic fine-tuning prototypical networks for optimization-based few-shot learning: Application to audio classification
Xuanyu Zhuang (LTCI, IP Paris, S2A, IDS), Geoffroy Peeters (LTCI, IP Paris, S2A, IDS), Gaël Richard (S2A, IDS, LTCI, IP Paris)
Comments: Accepted at MLSP 2024
Journal-ref: 2024 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 2024), Sep 2024, London (UK), United Kingdom
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Signal Processing (eess.SP)
[32] arXiv:2410.05320 [pdf, html, other]
Title: The OCON model: an old but gold solution for distributable supervised classification
Stefano Giacomelli, Marco Giordano, Claudia Rinaldi
Comments: Accepted at "2024 29th IEEE Symposium on Computers and Communications (ISCC): workshop on Next-Generation Multimedia Services at the Edge: Leveraging 5G and Beyond (NGMSE2024)". arXiv admin note: text overlap with arXiv:2410.04098
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG); Sound (cs.SD)
[33] arXiv:2410.05620 [pdf, html, other]
Title: Improving Data Augmentation-based Cross-Speaker Style Transfer for TTS with Singing Voice, Style Filtering, and F0 Matching
Leonardo B. de M. M. Marques, Lucas H. Ueda, Mário U. Neto, Flávio O. Simões, Fernando Runstein, Bianca Dal Bó, Paula D. P. Costa
Comments: Submitted to INTERSPEECH 2024
Subjects: Audio and Speech Processing (eess.AS)
[34] arXiv:2410.05724 [pdf, html, other]
Title: Exploring rhythm formant analysis for Indic language classification
Parismita Gogoi, Sishir Kalita, Priyankoo Sarmah, S.R Mahadeva Prasanna
Comments: Submitted to ICASSP 2025
Subjects: Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[35] arXiv:2410.05986 [pdf, html, other]
Title: The USTC-NERCSLIP Systems for the CHiME-8 MMCSG Challenge
Ya Jiang, Hongbo Lan, Jun Du, Qing Wang, Shutong Niu
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[36] arXiv:2410.05997 [pdf, html, other]
Title: An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment
Hugo Malard, Michel Olvera, Stéphane Lathuiliere, Slim Essid
Subjects: Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)
[37] arXiv:2410.06670 [pdf, html, other]
Title: LS-EEND: Long-Form Streaming End-to-End Neural Diarization with Online Attractor Extraction
Di Liang, Xiaofei Li
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[38] arXiv:2410.06787 [pdf, html, other]
Title: Efficient training strategies for natural sounding speech synthesis and speaker adaptation based on FastPitch
Teodora Răgman, Adriana Stan
Comments: Accepted at 2024 IEEE 20th International Conference on Intelligent Computer Communication and Processing (ICCP 2024)
Subjects: Audio and Speech Processing (eess.AS)
[39] arXiv:2410.06885 [pdf, html, other]
Title: F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, Xie Chen
Comments: 17 pages, 9 tables, 3 figures
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[40] arXiv:2410.07277 [pdf, html, other]
Title: Swin-BERT: A Feature Fusion System designed for Speech-based Alzheimer's Dementia Detection
Yilin Pan, Yanpei Shi, Yijia Zhang, Mingyu Lu
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
[41] arXiv:2410.07379 [pdf, html, other]
Title: Learn from Real: Reality Defender's Submission to ASVspoof5 Challenge
Yi Zhu, Chirag Goel, Surya Koppisetti, Trang Tran, Ankur Kumar, Gaurav Bharaj
Comments: Accepted into ASVspoof5 workshop
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
[42] arXiv:2410.07428 [pdf, html, other]
Title: The First VoicePrivacy Attacker Challenge Evaluation Plan
Natalia Tomashenko, Xiaoxiao Miao, Emmanuel Vincent, Junichi Yamagishi
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
[43] arXiv:2410.07935 [pdf, html, other]
Title: Robust Fixed-Filter Sound Zone Control with Audio-Based Position Tracking
Sankha Subhra Bhattacharjee, Andreas Jonas Fuglsig, Flemming Christensen, Jesper Rindom Jensen, Mads Græsbøll Christensen
Comments: Equal contribution by Sankha Subhra Bhattacharjee and Andreas Jonas Fuglsig. Accepted at ICASSP 2025
Subjects: Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[44] arXiv:2410.07978 [pdf, html, other]
Title: Sound Zone Control Robust To Sound Speed Change
Sankha Subhra Bhattacharjee, Jesper Rindom Jensen, Mads Græsbøll Christensen
Comments: 5 pages, 4 figures, submitted to ICASSP 2025
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[45] arXiv:2410.07982 [pdf, html, other]
Title: Window Function-less DFT with Reduced Noise and Latency for Real-Time Music Analysis
Cai Biesinger, Hiromitsu Awano, Masanori Hashimoto
Comments: 5 pages, 4 figures, Submitted to EUSIPCO 2025. TeX-generated PDF exemption due to formatting problems on arXiv. This version: clarified text throughout, updated data after further optimization work, added more comparisons and a table, added references
Subjects: Audio and Speech Processing (eess.AS)
[46] arXiv:2410.08250 [pdf, html, other]
Title: Exploring ASR-Based Wav2Vec2 for Automated Speech Disorder Assessment: Insights and Analysis
Tuan Nguyen, Corinne Fredouille, Alain Ghio, Mathieu Balaguer, Virginie Woisard
Comments: Accepted at the Spoken Language Technology (SLT) Conference 2024
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
[47] arXiv:2410.08325 [pdf, html, other]
Title: Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer
Slava Shechtman, Avihu Dekel
Comments: You can download the model from this https URL
Journal-ref: Proc. Interspeech 2024, 4174-4178
Subjects: Audio and Speech Processing (eess.AS)
[48] arXiv:2410.08919 [pdf, html, other]
Title: Low-complexity Attention-based Unsupervised Anomalous Sound Detection exploiting Separable Convolutions and Angular Loss
Michael Neri, Marco Carli
Comments: Accepted for publication in IEEE Sensors Letters. 4 pages, 4 figures
Subjects: Audio and Speech Processing (eess.AS)
[49] arXiv:2410.09236 [pdf, other]
Title: Enhancing Infant Crying Detection with Gradient Boosting for Improved Emotional and Mental Health Diagnostics
Kyunghun Lee, Lauren M. Henry, Eleanor Hansen, Elizabeth Tandilashvili, Lauren S. Wakschlag, Elizabeth Norton, Daniel S. Pine, Melissa A. Brotman, Francisco Pereira
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[50] arXiv:2410.09503 [pdf, html, other]
Title: SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs
Wenxi Chen, Ziyang Ma, Xiquan Li, Xuenan Xu, Yuzhe Liang, Zhisheng Zheng, Kai Yu, Xie Chen
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[51] arXiv:2410.09636 [pdf, html, other]
Title: Can We Estimate Purchase Intention Based on Zero-shot Speech Emotion Recognition?
Ryotaro Nagase, Takashi Sumiyoshi, Natsuo Yamashita, Kota Dohi, Yohei Kawaguchi
Comments: 5 pages, 3 figures, accepted for APSIPA 2024 ASC
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[52] arXiv:2410.10434 [pdf, other]
Title: In-Materia Speech Recognition
Mohamadreza Zolfagharinejad, Julian Büchel, Lorenzo Cassola, Sachin Kinge, Ghazi Sarwat Syed, Abu Sebastian, Wilfred G. van der Wiel
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[53] arXiv:2410.11025 [pdf, html, other]
Title: Code Drift: Towards Idempotent Neural Audio Codecs
Patrick O'Reilly, Prem Seetharaman, Jiaqi Su, Zeyu Jin, Bryan Pardo
Comments: ICASSP 2025
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[54] arXiv:2410.11097 [pdf, html, other]
Title: DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis
Yingahao Aaron Li, Rithesh Kumar, Zeyu Jin
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
[55] arXiv:2410.11181 [pdf, html, other]
Title: DARNet: Dual Attention Refinement Network with Spatiotemporal Construction for Auditory Attention Detection
Sheng Yan, Cunhang fan, Hongyu Zhang, Xiaoke Yang, Jianhua Tao, Zhao Lv
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
[56] arXiv:2410.11190 [pdf, html, other]
Title: Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities
Zhifei Xie, Changqiao Wu
Comments: Technical report, work in progress. Demo and code: this https URL
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)
[57] arXiv:2410.11453 [pdf, other]
Title: The importance of spatial and spectral information in multiple speaker tracking
Hanan Beit-On, Vladimir Tourbabin, Boaz Rafaely
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[58] arXiv:2410.11865 [pdf, html, other]
Title: Automatic Screening for Children with Speech Disorder using Automatic Speech Recognition: Opportunities and Challenges
Dancheng Liu, Jason Yang, Ishan Albrecht-Buehler, Helen Qin, Sophie Li, Yuting Hu, Amir Nassereldine, Jinjun Xiong
Comments: AAAI-FSS 24
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
[59] arXiv:2410.12182 [pdf, html, other]
Title: Guided Speaker Embedding
Shota Horiguchi, Takafumi Moriya, Atsushi Ando, Takanori Ashihara, Hiroshi Sato, Naohiro Tawara, Marc Delcroix
Comments: Accepted to ICASSP 2025
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[60] arXiv:2410.12266 [pdf, html, other]
Title: FlashAudio: Rectified Flows for Fast and High-Fidelity Text-to-Audio Generation
Huadai Liu, Jialei Wang, Rongjie Huang, Yang Liu, Heng Lu, Zhou Zhao, Wei Xue
Comments: ACL 2025 Main
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[61] arXiv:2410.12279 [pdf, html, other]
Title: Beyond Oversmoothing: Evaluating DDPM and MSE for Scalable Speech Synthesis in ASR
Christoph Minixhofer, Ondrej Klejch, Peter Bell
Comments: Under review at ICASSP 2025
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
[62] arXiv:2410.12359 [pdf, html, other]
Title: ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs
Rui-Chen Zheng, Hui-Peng Du, Xiao-Hang Jiang, Yang Ai, Zhen-Hua Ling
Subjects: Audio and Speech Processing (eess.AS)
[63] arXiv:2410.12536 [pdf, other]
Title: SiFiSinger: A High-Fidelity End-to-End Singing Voice Synthesizer based on Source-filter Model
Jianwei Cui, Yu Gu, Chao Weng, Jie Zhang, Liping Chen, Lirong Dai
Comments: Accepted by ICASSP 2024, Synthesized audio samples are available at: this https URL
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[64] arXiv:2410.12567 [pdf, html, other]
Title: SeQuiFi: Mitigating Catastrophic Forgetting in Speech Emotion Recognition with Sequential Class-Finetuning
Sarthak Jain, Orchid Chetia Phukan, Swarup Ranjan Behera, Arun Balaji Buduru, Rajesh Sharma
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[65] arXiv:2410.12645 [pdf, html, other]
Title: Beyond Speech and More: Investigating the Emergent Ability of Speech Foundation Models for Classifying Physiological Time-Series Signals
Orchid Chetia Phukan, Swarup Ranjan Behera, Girish, Mohd Mujtaba Akhtar, Arun Balaji Buduru, Rajesh Sharma
Subjects: Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[66] arXiv:2410.12675 [pdf, html, other]
Title: AttentiveMOS: A Lightweight Attention-Only Model for Speech Quality Prediction
Imran E Kibria, Donald S. Williamson
Comments: Submitted to Interspeech
Subjects: Audio and Speech Processing (eess.AS)
[67] arXiv:2410.12885 [pdf, html, other]
Title: Exploiting Longitudinal Speech Sessions via Voice Assistant Systems for Early Detection of Cognitive Decline
Kristin Qi, Jiatong Shi, Caroline Summerour, John A. Batsis, Xiaohui Liang
Comments: IEEE International Conference on E-health Networking, Application & Services
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
[68] arXiv:2410.12897 [pdf, html, other]
Title: AI-Enhanced Acoustic Analysis for Comprehensive Biodiversity Monitoring and Assessment
Kumar Srinivas Bobba, Kartheeban K, Vamsi Krishna Sai, Dinesh Bugga, Vijaya Mani Surendra Bolla
Subjects: Audio and Speech Processing (eess.AS)
[69] arXiv:2410.12947 [pdf, html, other]
Title: Multi-View Multi-Task Modeling with Speech Foundation Models for Speech Forensic Tasks
Orchid Chetia Phukan, Devyani Koshal, Swarup Ranjan Behera, Arun Balaji Buduru, Rajesh Sharma
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[70] arXiv:2410.13182 [pdf, html, other]
Title: Using RLHF to align speech enhancement approaches to mean-opinion quality scores
Anurag Kumar, Andrew Perrault, Donald S. Williamson
Comments: Submitted to ICASSP 2025
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[71] arXiv:2410.13198 [pdf, html, other]
Title: Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation
Sreyan Ghosh, Mohammad Sadegh Rasooli, Michael Levit, Peidong Wang, Jian Xue, Dinesh Manocha, Jinyu Li
Comments: Preprint. Under Review
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[72] arXiv:2410.13221 [pdf, html, other]
Title: Investigating Effective Speaker Property Privacy Protection in Federated Learning for Speech Emotion Recognition
Chao Tan, Sheng Li, Yang Cao, Zhao Ren, Tanja Schultz
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[73] arXiv:2410.13288 [pdf, html, other]
Title: DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis
Yu Gu, Qiushi Zhu, Guangzhi Lei, Chao Weng, Dan Su
Comments: Accepted by ICASSP2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[74] arXiv:2410.13342 [pdf, html, other]
Title: DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech
Jan Melechovsky, Ambuj Mehrish, Berrak Sisman, Dorien Herremans
Comments: Accepted in Audio Imagination workshop of NeurIPS 2024
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
[75] arXiv:2410.13357 [pdf, html, other]
Title: Enhancing Crowdsourced Audio for Text-to-Speech Models
José Giraldo, Martí Llopart-Font, Alex Peiró-Lilja, Carme Armentano-Oller, Gerard Sant, Baybars Külebi
Comments: Submitted to Iberspeech 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[76] arXiv:2410.13385 [pdf, html, other]
Title: On the Use of Audio to Improve Dialogue Policies
Daniel Roncel, Federico Costa, Javier Hernando
Comments: IberSpeech 2024
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[77] arXiv:2410.13411 [pdf, html, other]
Title: STCON System for the CHiME-8 Challenge
Anton Mitrofanov, Tatiana Prisyach, Tatiana Timofeeva, Sergei Novoselov, Maxim Korenevsky, Yuri Khokhlov, Artem Akulov, Alexander Anikin, Roman Khalili, Iurii Lezhenin, Aleksandr Melnikov, Dmitriy Miroshnichenko, Nikita Mamaev, Ilya Odegov, Olga Rudnitskaya, Aleksei Romanenko
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[78] arXiv:2410.13599 [pdf, html, other]
Title: GAN-Based Speech Enhancement for Low SNR Using Latent Feature Conditioning
Shrishti Saha Shetu, Emanuël A. P. Habets, Andreas Brendel
Comments: 5 pages, 2 figures
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)
[79] arXiv:2410.13620 [pdf, html, other]
Title: Align-ULCNet: Towards Low-Complexity and Robust Acoustic Echo and Noise Reduction
Shrishti Saha Shetu, Naveen Kumar Desiraju, Wolfgang Mack, Emanuël A. P. Habets
Comments: 5 pages, 4 figures
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)
[80] arXiv:2410.14197 [pdf, html, other]
Title: A Unified Framework for Collecting Text-to-Speech Synthesis Datasets for 22 Indian Languages
Sujitha Sathiyamoorthy (1), N Mohana (1), Anusha Prakash (3), Hema A Murthy (1 and 2) ((1) Dept of Computer Science & Engineering, Indian Institute of Technology Madras, Chennai, India (2) Shiv Nadar University Chennai, India, (3) Independent Researcher Bengaluru, India)
Comments: Submitted to ICASSP 2025
Subjects: Audio and Speech Processing (eess.AS)
[81] arXiv:2410.14910 [pdf, html, other]
Title: AC-Mix: Self-Supervised Adaptation for Low-Resource Automatic Speech Recognition using Agnostic Contrastive Mixup
Carlos Carvalho, Alberto Abad
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[82] arXiv:2410.15078 [pdf, html, other]
Title: Independent Feature Enhanced Crossmodal Fusion for Match-Mismatch Classification of Speech Stimulus and EEG Response
Shitong Fan, Wenbo Wang, Feiyang Xiao, Shiheng Zhang, Qiaoxi Zhu, Jian Guan
Comments: Shitong Fan and Wenbo Wang contributed equally. Accepted by the International Symposium on Chinese Spoken Language Processing (ISCSLP) 2024
Subjects: Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[83] arXiv:2410.15764 [pdf, html, other]
Title: LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec
Yiwei Guo, Zhihan Li, Chenpeng Du, Hankun Wang, Xie Chen, Kai Yu
Comments: 5 pages, 2 figures, 3 tables. Demo page: this https URL. Accepted to Interspeech 2025
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
[84] arXiv:2410.16048 [pdf, html, other]
Title: Continuous Speech Synthesis using per-token Latent Diffusion
Arnon Turetzky, Nimrod Shabtay, Slava Shechtman, Hagai Aronowitz, David Haws, Ron Hoory, Avihu Dekel
Comments: Preprint, Under review
Subjects: Audio and Speech Processing (eess.AS)
[85] arXiv:2410.16059 [pdf, html, other]
Title: Multi-Level Speaker Representation for Target Speaker Extraction
Ke Zhang, Junjie Li, Shuai Wang, Yangjie Wei, Yi Wang, Yannan Wang, Haizhou Li
Comments: 5 pages. Submitted to ICASSP 2025. Implementation will be released at this https URL
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[86] arXiv:2410.16130 [pdf, html, other]
Title: Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning
Chun-Yi Kuan, Hung-yi Lee
Comments: Accepted to ICASSP 2025. Project Website: this https URL
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[87] arXiv:2410.16330 [pdf, html, other]
Title: End-to-End Transformer-based Automatic Speech Recognition for Northern Kurdish: A Pioneering Approach
Abdulhady Abas Abdullah, Shima Tabibian, Hadi Veisi, Aso Mahmudi, Tarik Rashid
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
[88] arXiv:2410.16647 [pdf, html, other]
Title: GE2E-KWS: Generalized End-to-End Training and Evaluation for Zero-shot Keyword Spotting
Pai Zhu, Jacob W. Bartel, Dhruuv Agarwal, Kurt Partridge, Hyun Jin Park, Quan Wang
Comments: 8 pages, 6 figures, 2 tables The paper is accepted in IEEE Spoken Language Technology (SLT) 2024
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[89] arXiv:2410.16726 [pdf, html, other]
Title: Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap
Guanrou Yang, Fan Yu, Ziyang Ma, Zhihao Du, Zhifu Gao, Shiliang Zhang, Xie Chen
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
[90] arXiv:2410.17028 [pdf, html, other]
Title: Can a Machine Distinguish High and Low Amount of Social Creak in Speech?
Anne-Maria Laukkanen, Sudarsana Reddy Kadiri, Shrikanth Narayanan, Paavo Alku
Comments: Accepted in Journal of Voice
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
[91] arXiv:2410.17033 [pdf, html, other]
Title: Prototype and Instance Contrastive Learning for Unsupervised Domain Adaptation in Speaker Verification
Wen Huang, Bing Han, Zhengyang Chen, Shuai Wang, Yanmin Qian
Comments: Accepted to ISCSLP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[92] arXiv:2410.17437 [pdf, html, other]
Title: Improving Automatic Speech Recognition with Decoder-Centric Regularisation in Encoder-Decoder Models
Alexander Polok, Santosh Kesiraju, Karel Beneš, Lukáš Burget, Jan Černocký
Subjects: Audio and Speech Processing (eess.AS)
[93] arXiv:2410.17790 [pdf, other]
Title: Regularized autoregressive modeling and its application to audio signal declipping
Ondřej Mokrý, Pavel Rajmic
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[94] arXiv:2410.17834 [pdf, html, other]
Title: Non-intrusive Speech Quality Assessment with Diffusion Models Trained on Clean Speech
Danilo de Oliveira, Julius Richter, Jean-Marie Lemercier, Simon Welker, Timo Gerkmann
Comments: Accepted at Interspeech 2025
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[95] arXiv:2410.18908 [pdf, html, other]
Title: A Survey on Speech Large Language Models for Understanding
Jing Peng, Yucheng Wang, Bohan Li, Yiwei Guo, Hankun Wang, Yangui Fang, Yu Xi, Haoyu Li, Xu Li, Ke Zhang, Shuai Wang, Kai Yu
Comments: This paper is submitted as an invited overview to IEEE JSTSP
Subjects: Audio and Speech Processing (eess.AS)
[96] arXiv:2410.19168 [pdf, html, other]
Title: MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, Dinesh Manocha
Comments: Project Website: this https URL
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
[97] arXiv:2410.19595 [pdf, html, other]
Title: Mask-Weighted Spatial Likelihood Coding for Speaker-Independent Joint Localization and Mask Estimation
Jakob Kienegger, Alina Mannanova, Timo Gerkmann
Comments: ©2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[98] arXiv:2410.20095 [pdf, html, other]
Title: Analyzing long-term rhythm variations in Mising and Assamese using frequency domain correlates
Parismita Gogoi, Priyankoo Sarmah, S. R. M. Prasanna
Comments: Submitted to International Journal of Asian Language Processing (IJALP)
Subjects: Audio and Speech Processing (eess.AS)
[99] arXiv:2410.20578 [pdf, html, other]
Title: Meta-Learning Approaches for Improving Detection of Unseen Speech Deepfakes
Ivan Kukanov, Janne Laakkonen, Tomi Kinnunen, Ville Hautamäki
Comments: 6 pages, accepted to the IEEE Spoken Language Technology Workshop (SLT) 2024
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
[100] arXiv:2410.21455 [pdf, html, other]
Title: Simultaneous Diarization and Separation of Meetings through the Integration of Statistical Mixture Models
Tobias Cord-Landwehr, Christoph Boeddeker, Reinhold Haeb-Umbach
Comments: Accepted at ICASSP2025
Subjects: Audio and Speech Processing (eess.AS)
Total of 358 entries : 1-100 101-200 201-300 301-358
Showing up to 100 entries per page: fewer | more | all
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status
    Get status notifications via email or slack