Audio and Speech Processing

Authors and titles for October 2024

Total of 358 entries : 1-100 101-200 201-300 301-358

Showing up to 100 entries per page: fewer | more | all

[1] arXiv:2410.00035 [pdf, html, other]: Title: FeruzaSpeech: A 60 Hour Uzbek Read Speech Corpus with Punctuation, Casing, and Context

Anna Povey, Katherine Povey

Comments: 5 Pages, 1 Figure, Preprint of Paper Accepted in ICNLSP 2024

Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG)
[2] arXiv:2410.00037 [pdf, html, other]: Title: Moshi: a speech-text foundation model for real-time dialogue

Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, Neil Zeghidour

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
[3] arXiv:2410.00070 [pdf, html, other]: Title: Mamba for Streaming ASR Combined with Unimodal Aggregation

Ying Fang, Xiaofei Li

Comments: Accepted by ICASSP 2025

Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[4] arXiv:2410.00390 [pdf, html, other]: Title: Multi-Scale Temporal Transformer For Speech Emotion Recognition

Zhipeng Li, Xiaofen Xing, Yuanbo Fang, Weibin Zhang, Hengsheng Fan, Xiangmin Xu

Subjects: Audio and Speech Processing (eess.AS)
[5] arXiv:2410.00511 [pdf, html, other]: Title: Pre-training with Synthetic Patterns for Audio

Yuchi Ishikawa, Tatsuya Komatsu, Yoshimitsu Aoki

Comments: Submitted to ICASSP'25

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
[6] arXiv:2410.00527 [pdf, html, other]: Title: Wanna hear your voice? A sample is all we need!

The Hieu Pham, Phuong Thanh Tran Nguyen, Xuan Tho Nguyen, Tan Dat Nguyen, Duc Dung Nguyen

Comments: work in progress

Subjects: Audio and Speech Processing (eess.AS)
[7] arXiv:2410.00528 [pdf, other]: Title: End-to-End Speech Recognition with Pre-trained Masked Language Model

Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi, Shinji Watanabe

Subjects: Audio and Speech Processing (eess.AS)
[8] arXiv:2410.00680 [pdf, html, other]: Title: The Conformer Encoder May Reverse the Time Dimension

Robin Schmitt, Albert Zeyer, Mohammad Zeineldeen, Ralf Schlüter, Hermann Ney

Comments: Accepted at ICASSP 2025

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Machine Learning (stat.ML)
[9] arXiv:2410.01108 [pdf, html, other]: Title: Augmentation through Laundering Attacks for Audio Spoof Detection

Hashim Ali, Surya Subramani, Hafiz Malik

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
[10] arXiv:2410.01150 [pdf, html, other]: Title: Restorative Speech Enhancement: A Progressive Approach Using SE and Codec Modules

Hsin-Tien Chiang, Hao Zhang, Yong Xu, Meng Yu, Dong Yu

Comments: Paper in submission

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[11] arXiv:2410.01162 [pdf, html, other]: Title: Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech

Wonjune Kang, Junteng Jia, Chunyang Wu, Wei Zhou, Egor Lakomkin, Yashesh Gaur, Leda Sari, Suyoun Kim, Ke Li, Jay Mahadeokar, Ozlem Kalinli

Comments: Accepted to Interspeech 2025

Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[12] arXiv:2410.01562 [pdf, html, other]: Title: HRTF Estimation using a Score-based Prior

Etienne Thuillier, Jean-Marie Lemercier, Eloi Moliner, Timo Gerkmann, Vesa Välimäki

Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[13] arXiv:2410.01841 [pdf, other]: Title: A GEN AI Framework for Medical Note Generation

Hui Yi Leong, Yi Fan Gao, Shuai Ji, Bora Kalaycioglu, Uktu Pamuksuz

Comments: 8 Figures, 7 page, IEEE standard research paper

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Sound (cs.SD)
[14] arXiv:2410.02056 [pdf, html, other]: Title: Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data

Sreyan Ghosh, Sonal Kumar, Zhifeng Kong, Rafael Valle, Bryan Catanzaro, Dinesh Manocha

Comments: Accepted at ICLR 2025. Code and Checkpoints available here: this https URL

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
[15] arXiv:2410.02364 [pdf, html, other]: Title: State-of-the-art Embeddings with Video-free Segmentation of the Source VoxCeleb Data

Sara Barahona, Ladislav Mošner, Themos Stafylakis, Oldřich Plchot, Junyi Peng, Lukáš Burget, Jan Černocký

Comments: This work has been submitted to the IEEE for possible publication

Subjects: Audio and Speech Processing (eess.AS)
[16] arXiv:2410.02371 [pdf, html, other]: Title: NTU-NPU System for Voice Privacy 2024 Challenge

Nikita Kuzmin, Hieu-Thi Luong, Jixun Yao, Lei Xie, Kong Aik Lee, Eng Siong Chng

Comments: System description for VPC 2024

Journal-ref: 2024 Challenge. Proc. 4th Symposium on Security and Privacy in Speech Communication, 72-79

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
[17] arXiv:2410.03007 [pdf, html, other]: Title: FastAdaSP: Multitask-Adapted Efficient Inference for Large Speech Language Model

Yichen Lu, Jiaqi Song, Chao-Han Huck Yang, Shinji Watanabe

Comments: EMNLP 2024 Industry Track

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
[18] arXiv:2410.03139 [pdf, html, other]: Title: How does the teacher rate? Observations from the NeuroPiano dataset

Huan Zhang, Vincent Cheung, Hayato Nishioka, Simon Dixon, Shinichi Furuya

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[19] arXiv:2410.03192 [pdf, html, other]: Title: MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech

Taejun Bak, Youngsik Eom, SeungJae Choi, Young-Sun Joo

Comments: Accepted to EMNLP 2024 Findings

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
[20] arXiv:2410.03280 [pdf, other]: Title: Manikin-Recorded Cardiopulmonary Sounds Dataset Using Digital Stethoscope

Yasaman Torabi, Shahram Shirani, James P. Reilly

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
[21] arXiv:2410.03298 [pdf, html, other]: Title: Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens

Jinzheng Zhao, Niko Moritz, Egor Lakomkin, Ruiming Xie, Zhiping Xiu, Katerina Zmolikova, Zeeshan Ahmed, Yashesh Gaur, Duc Le, Christian Fuegen

Comments: Submitted to ICASSP 2025

Subjects: Audio and Speech Processing (eess.AS)
[22] arXiv:2410.04017 [pdf, html, other]: Title: Adversarial Attacks and Robust Defenses in Speaker Embedding based Zero-Shot Text-to-Speech System

Ze Li, Yao Shi, Yunfei Xu, Ming Li

Subjects: Audio and Speech Processing (eess.AS)
[23] arXiv:2410.04092 [pdf, html, other]: Title: Enhancement of Dysarthric Speech Reconstruction by Contrastive Learning

Keshvari Fatemeh, Mahdian Toroghi Rahil, Zareian Hassan

Subjects: Audio and Speech Processing (eess.AS)
[24] arXiv:2410.04198 [pdf, html, other]: Title: DJ Mix Transcription with Multi-Pass Non-Negative Matrix Factorization

Étienne Paul André, Dominique Fourer, Diemo Schwarz

Comments: Submitted to ICASSP 2025

Subjects: Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[25] arXiv:2410.04380 [pdf, html, other]: Title: HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis

Yuto Nishimura, Takumi Hirose, Masanari Ohi, Hideki Nakayama, Nakamasa Inoue

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[26] arXiv:2410.04690 [pdf, html, other]: Title: SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech

Minchan Kim, Myeonghun Jeong, Joun Yeop Lee, Nam Soo Kim

Comments: This work has been submitted to the IEEE for possible publication

Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
[27] arXiv:2410.04785 [pdf, html, other]: Title: Towards Ultra-Low-Power Neuromorphic Speech Enhancement with Spiking-FullSubNet

Xiang Hao, Chenxiang Ma, Qu Yang, Jibin Wu, Kay Chen Tan

Comments: under review

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[28] arXiv:2410.04951 [pdf, html, other]: Title: A decade of DCASE: Achievements, practices, evaluations and future challenges

Annamaria Mesaros, Romain Serizel, Toni Heittola, Tuomas Virtanen, Mark D. Plumbley

Comments: Submitted to ICASSP 2025

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[29] arXiv:2410.05101 [pdf, html, other]: Title: CR-CTC: Consistency regularization on CTC for improved speech recognition

Zengwei Yao, Wei Kang, Xiaoyu Yang, Fangjun Kuang, Liyong Guo, Han Zhu, Zengrui Jin, Zhaoqing Li, Long Lin, Daniel Povey

Comments: Published as a conference paper at ICLR 2025

Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[30] arXiv:2410.05151 [pdf, html, other]: Title: Editing Music with Melody and Text: Using ControlNet for Diffusion Transformer

Siyuan Hou, Shansong Liu, Ruibin Yuan, Wei Xue, Ying Shan, Mangsuo Zhao, Chao Zhang

Comments: Accepted for publication at ICASSP 2025

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[31] arXiv:2410.05302 [pdf, other]: Title: Episodic fine-tuning prototypical networks for optimization-based few-shot learning: Application to audio classification

Xuanyu Zhuang (LTCI, IP Paris, S2A, IDS), Geoffroy Peeters (LTCI, IP Paris, S2A, IDS), Gaël Richard (S2A, IDS, LTCI, IP Paris)

Comments: Accepted at MLSP 2024

Journal-ref: 2024 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 2024), Sep 2024, London (UK), United Kingdom

Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Signal Processing (eess.SP)
[32] arXiv:2410.05320 [pdf, html, other]: Title: The OCON model: an old but gold solution for distributable supervised classification

Stefano Giacomelli, Marco Giordano, Claudia Rinaldi

Comments: Accepted at "2024 29th IEEE Symposium on Computers and Communications (ISCC): workshop on Next-Generation Multimedia Services at the Edge: Leveraging 5G and Beyond (NGMSE2024)". arXiv admin note: text overlap with arXiv:2410.04098

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG); Sound (cs.SD)
[33] arXiv:2410.05620 [pdf, html, other]: Title: Improving Data Augmentation-based Cross-Speaker Style Transfer for TTS with Singing Voice, Style Filtering, and F0 Matching

Leonardo B. de M. M. Marques, Lucas H. Ueda, Mário U. Neto, Flávio O. Simões, Fernando Runstein, Bianca Dal Bó, Paula D. P. Costa

Comments: Submitted to INTERSPEECH 2024

Subjects: Audio and Speech Processing (eess.AS)
[34] arXiv:2410.05724 [pdf, html, other]: Title: Exploring rhythm formant analysis for Indic language classification

Parismita Gogoi, Sishir Kalita, Priyankoo Sarmah, S.R Mahadeva Prasanna

Comments: Submitted to ICASSP 2025

Subjects: Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[35] arXiv:2410.05986 [pdf, html, other]: Title: The USTC-NERCSLIP Systems for the CHiME-8 MMCSG Challenge

Ya Jiang, Hongbo Lan, Jun Du, Qing Wang, Shutong Niu

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[36] arXiv:2410.05997 [pdf, html, other]: Title: An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment

Hugo Malard, Michel Olvera, Stéphane Lathuiliere, Slim Essid

Subjects: Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)
[37] arXiv:2410.06670 [pdf, html, other]: Title: LS-EEND: Long-Form Streaming End-to-End Neural Diarization with Online Attractor Extraction

Di Liang, Xiaofei Li

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[38] arXiv:2410.06787 [pdf, html, other]: Title: Efficient training strategies for natural sounding speech synthesis and speaker adaptation based on FastPitch

Teodora Răgman, Adriana Stan

Comments: Accepted at 2024 IEEE 20th International Conference on Intelligent Computer Communication and Processing (ICCP 2024)

Subjects: Audio and Speech Processing (eess.AS)
[39] arXiv:2410.06885 [pdf, html, other]: Title: F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, Xie Chen

Comments: 17 pages, 9 tables, 3 figures

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[40] arXiv:2410.07277 [pdf, html, other]: Title: Swin-BERT: A Feature Fusion System designed for Speech-based Alzheimer's Dementia Detection

Yilin Pan, Yanpei Shi, Yijia Zhang, Mingyu Lu

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
[41] arXiv:2410.07379 [pdf, html, other]: Title: Learn from Real: Reality Defender's Submission to ASVspoof5 Challenge

Yi Zhu, Chirag Goel, Surya Koppisetti, Trang Tran, Ankur Kumar, Gaurav Bharaj

Comments: Accepted into ASVspoof5 workshop

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
[42] arXiv:2410.07428 [pdf, html, other]: Title: The First VoicePrivacy Attacker Challenge Evaluation Plan

Natalia Tomashenko, Xiaoxiao Miao, Emmanuel Vincent, Junichi Yamagishi

Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
[43] arXiv:2410.07935 [pdf, html, other]: Title: Robust Fixed-Filter Sound Zone Control with Audio-Based Position Tracking

Sankha Subhra Bhattacharjee, Andreas Jonas Fuglsig, Flemming Christensen, Jesper Rindom Jensen, Mads Græsbøll Christensen

Comments: Equal contribution by Sankha Subhra Bhattacharjee and Andreas Jonas Fuglsig. Accepted at ICASSP 2025

Subjects: Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[44] arXiv:2410.07978 [pdf, html, other]: Title: Sound Zone Control Robust To Sound Speed Change

Sankha Subhra Bhattacharjee, Jesper Rindom Jensen, Mads Græsbøll Christensen

Comments: 5 pages, 4 figures, submitted to ICASSP 2025

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[45] arXiv:2410.07982 [pdf, html, other]: Title: Window Function-less DFT with Reduced Noise and Latency for Real-Time Music Analysis

Cai Biesinger, Hiromitsu Awano, Masanori Hashimoto

Comments: 5 pages, 4 figures, Submitted to EUSIPCO 2025. TeX-generated PDF exemption due to formatting problems on arXiv. This version: clarified text throughout, updated data after further optimization work, added more comparisons and a table, added references

Subjects: Audio and Speech Processing (eess.AS)
[46] arXiv:2410.08250 [pdf, html, other]: Title: Exploring ASR-Based Wav2Vec2 for Automated Speech Disorder Assessment: Insights and Analysis

Tuan Nguyen, Corinne Fredouille, Alain Ghio, Mathieu Balaguer, Virginie Woisard

Comments: Accepted at the Spoken Language Technology (SLT) Conference 2024

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
[47] arXiv:2410.08325 [pdf, html, other]: Title: Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer

Slava Shechtman, Avihu Dekel

Comments: You can download the model from this https URL

Journal-ref: Proc. Interspeech 2024, 4174-4178

Subjects: Audio and Speech Processing (eess.AS)
[48] arXiv:2410.08919 [pdf, html, other]: Title: Low-complexity Attention-based Unsupervised Anomalous Sound Detection exploiting Separable Convolutions and Angular Loss

Michael Neri, Marco Carli

Comments: Accepted for publication in IEEE Sensors Letters. 4 pages, 4 figures

Subjects: Audio and Speech Processing (eess.AS)
[49] arXiv:2410.09236 [pdf, other]: Title: Enhancing Infant Crying Detection with Gradient Boosting for Improved Emotional and Mental Health Diagnostics

Kyunghun Lee, Lauren M. Henry, Eleanor Hansen, Elizabeth Tandilashvili, Lauren S. Wakschlag, Elizabeth Norton, Daniel S. Pine, Melissa A. Brotman, Francisco Pereira

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[50] arXiv:2410.09503 [pdf, html, other]: Title: SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs

Wenxi Chen, Ziyang Ma, Xiquan Li, Xuenan Xu, Yuzhe Liang, Zhisheng Zheng, Kai Yu, Xie Chen

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[51] arXiv:2410.09636 [pdf, html, other]: Title: Can We Estimate Purchase Intention Based on Zero-shot Speech Emotion Recognition?

Ryotaro Nagase, Takashi Sumiyoshi, Natsuo Yamashita, Kota Dohi, Yohei Kawaguchi

Comments: 5 pages, 3 figures, accepted for APSIPA 2024 ASC

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[52] arXiv:2410.10434 [pdf, other]: Title: In-Materia Speech Recognition

Mohamadreza Zolfagharinejad, Julian Büchel, Lorenzo Cassola, Sachin Kinge, Ghazi Sarwat Syed, Abu Sebastian, Wilfred G. van der Wiel

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[53] arXiv:2410.11025 [pdf, html, other]: Title: Code Drift: Towards Idempotent Neural Audio Codecs

Patrick O'Reilly, Prem Seetharaman, Jiaqi Su, Zeyu Jin, Bryan Pardo

Comments: ICASSP 2025

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[54] arXiv:2410.11097 [pdf, html, other]: Title: DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis

Yingahao Aaron Li, Rithesh Kumar, Zeyu Jin

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
[55] arXiv:2410.11181 [pdf, html, other]: Title: DARNet: Dual Attention Refinement Network with Spatiotemporal Construction for Auditory Attention Detection

Sheng Yan, Cunhang fan, Hongyu Zhang, Xiaoke Yang, Jianhua Tao, Zhao Lv

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
[56] arXiv:2410.11190 [pdf, html, other]: Title: Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities

Zhifei Xie, Changqiao Wu

Comments: Technical report, work in progress. Demo and code: this https URL

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)
[57] arXiv:2410.11453 [pdf, other]: Title: The importance of spatial and spectral information in multiple speaker tracking

Hanan Beit-On, Vladimir Tourbabin, Boaz Rafaely

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[58] arXiv:2410.11865 [pdf, html, other]: Title: Automatic Screening for Children with Speech Disorder using Automatic Speech Recognition: Opportunities and Challenges

Dancheng Liu, Jason Yang, Ishan Albrecht-Buehler, Helen Qin, Sophie Li, Yuting Hu, Amir Nassereldine, Jinjun Xiong

Comments: AAAI-FSS 24

Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
[59] arXiv:2410.12182 [pdf, html, other]: Title: Guided Speaker Embedding

Shota Horiguchi, Takafumi Moriya, Atsushi Ando, Takanori Ashihara, Hiroshi Sato, Naohiro Tawara, Marc Delcroix

Comments: Accepted to ICASSP 2025

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[60] arXiv:2410.12266 [pdf, html, other]: Title: FlashAudio: Rectified Flows for Fast and High-Fidelity Text-to-Audio Generation

Huadai Liu, Jialei Wang, Rongjie Huang, Yang Liu, Heng Lu, Zhou Zhao, Wei Xue

Comments: ACL 2025 Main

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[61] arXiv:2410.12279 [pdf, html, other]: Title: Beyond Oversmoothing: Evaluating DDPM and MSE for Scalable Speech Synthesis in ASR

Christoph Minixhofer, Ondrej Klejch, Peter Bell

Comments: Under review at ICASSP 2025

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
[62] arXiv:2410.12359 [pdf, html, other]: Title: ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs

Rui-Chen Zheng, Hui-Peng Du, Xiao-Hang Jiang, Yang Ai, Zhen-Hua Ling

Subjects: Audio and Speech Processing (eess.AS)
[63] arXiv:2410.12536 [pdf, other]: Title: SiFiSinger: A High-Fidelity End-to-End Singing Voice Synthesizer based on Source-filter Model

Jianwei Cui, Yu Gu, Chao Weng, Jie Zhang, Liping Chen, Lirong Dai

Comments: Accepted by ICASSP 2024, Synthesized audio samples are available at: this https URL

Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[64] arXiv:2410.12567 [pdf, html, other]: Title: SeQuiFi: Mitigating Catastrophic Forgetting in Speech Emotion Recognition with Sequential Class-Finetuning

Sarthak Jain, Orchid Chetia Phukan, Swarup Ranjan Behera, Arun Balaji Buduru, Rajesh Sharma

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[65] arXiv:2410.12645 [pdf, html, other]: Title: Beyond Speech and More: Investigating the Emergent Ability of Speech Foundation Models for Classifying Physiological Time-Series Signals

Orchid Chetia Phukan, Swarup Ranjan Behera, Girish, Mohd Mujtaba Akhtar, Arun Balaji Buduru, Rajesh Sharma

Subjects: Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[66] arXiv:2410.12675 [pdf, html, other]: Title: AttentiveMOS: A Lightweight Attention-Only Model for Speech Quality Prediction

Imran E Kibria, Donald S. Williamson

Comments: Submitted to Interspeech

Subjects: Audio and Speech Processing (eess.AS)
[67] arXiv:2410.12885 [pdf, html, other]: Title: Exploiting Longitudinal Speech Sessions via Voice Assistant Systems for Early Detection of Cognitive Decline

Kristin Qi, Jiatong Shi, Caroline Summerour, John A. Batsis, Xiaohui Liang

Comments: IEEE International Conference on E-health Networking, Application & Services

Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
[68] arXiv:2410.12897 [pdf, html, other]: Title: AI-Enhanced Acoustic Analysis for Comprehensive Biodiversity Monitoring and Assessment

Kumar Srinivas Bobba, Kartheeban K, Vamsi Krishna Sai, Dinesh Bugga, Vijaya Mani Surendra Bolla

Subjects: Audio and Speech Processing (eess.AS)
[69] arXiv:2410.12947 [pdf, html, other]: Title: Multi-View Multi-Task Modeling with Speech Foundation Models for Speech Forensic Tasks

Orchid Chetia Phukan, Devyani Koshal, Swarup Ranjan Behera, Arun Balaji Buduru, Rajesh Sharma

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[70] arXiv:2410.13182 [pdf, html, other]: Title: Using RLHF to align speech enhancement approaches to mean-opinion quality scores

Anurag Kumar, Andrew Perrault, Donald S. Williamson

Comments: Submitted to ICASSP 2025

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[71] arXiv:2410.13198 [pdf, html, other]: Title: Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation

Sreyan Ghosh, Mohammad Sadegh Rasooli, Michael Levit, Peidong Wang, Jian Xue, Dinesh Manocha, Jinyu Li

Comments: Preprint. Under Review

Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[72] arXiv:2410.13221 [pdf, html, other]: Title: Investigating Effective Speaker Property Privacy Protection in Federated Learning for Speech Emotion Recognition

Chao Tan, Sheng Li, Yang Cao, Zhao Ren, Tanja Schultz

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[73] arXiv:2410.13288 [pdf, html, other]: Title: DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis

Yu Gu, Qiushi Zhu, Guangzhi Lei, Chao Weng, Dan Su

Comments: Accepted by ICASSP2024

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[74] arXiv:2410.13342 [pdf, html, other]: Title: DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech

Jan Melechovsky, Ambuj Mehrish, Berrak Sisman, Dorien Herremans

Comments: Accepted in Audio Imagination workshop of NeurIPS 2024

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
[75] arXiv:2410.13357 [pdf, html, other]: Title: Enhancing Crowdsourced Audio for Text-to-Speech Models

José Giraldo, Martí Llopart-Font, Alex Peiró-Lilja, Carme Armentano-Oller, Gerard Sant, Baybars Külebi

Comments: Submitted to Iberspeech 2024

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[76] arXiv:2410.13385 [pdf, html, other]: Title: On the Use of Audio to Improve Dialogue Policies

Daniel Roncel, Federico Costa, Javier Hernando

Comments: IberSpeech 2024

Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[77] arXiv:2410.13411 [pdf, html, other]: Title: STCON System for the CHiME-8 Challenge

Anton Mitrofanov, Tatiana Prisyach, Tatiana Timofeeva, Sergei Novoselov, Maxim Korenevsky, Yuri Khokhlov, Artem Akulov, Alexander Anikin, Roman Khalili, Iurii Lezhenin, Aleksandr Melnikov, Dmitriy Miroshnichenko, Nikita Mamaev, Ilya Odegov, Olga Rudnitskaya, Aleksei Romanenko

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[78] arXiv:2410.13599 [pdf, html, other]: Title: GAN-Based Speech Enhancement for Low SNR Using Latent Feature Conditioning

Shrishti Saha Shetu, Emanuël A. P. Habets, Andreas Brendel

Comments: 5 pages, 2 figures

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)
[79] arXiv:2410.13620 [pdf, html, other]: Title: Align-ULCNet: Towards Low-Complexity and Robust Acoustic Echo and Noise Reduction

Shrishti Saha Shetu, Naveen Kumar Desiraju, Wolfgang Mack, Emanuël A. P. Habets

Comments: 5 pages, 4 figures

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)
[80] arXiv:2410.14197 [pdf, html, other]: Title: A Unified Framework for Collecting Text-to-Speech Synthesis Datasets for 22 Indian Languages

Sujitha Sathiyamoorthy (1), N Mohana (1), Anusha Prakash (3), Hema A Murthy (1 and 2) ((1) Dept of Computer Science & Engineering, Indian Institute of Technology Madras, Chennai, India (2) Shiv Nadar University Chennai, India, (3) Independent Researcher Bengaluru, India)

Comments: Submitted to ICASSP 2025

Subjects: Audio and Speech Processing (eess.AS)
[81] arXiv:2410.14910 [pdf, html, other]: Title: AC-Mix: Self-Supervised Adaptation for Low-Resource Automatic Speech Recognition using Agnostic Contrastive Mixup

Carlos Carvalho, Alberto Abad

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[82] arXiv:2410.15078 [pdf, html, other]: Title: Independent Feature Enhanced Crossmodal Fusion for Match-Mismatch Classification of Speech Stimulus and EEG Response

Shitong Fan, Wenbo Wang, Feiyang Xiao, Shiheng Zhang, Qiaoxi Zhu, Jian Guan

Comments: Shitong Fan and Wenbo Wang contributed equally. Accepted by the International Symposium on Chinese Spoken Language Processing (ISCSLP) 2024

Subjects: Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[83] arXiv:2410.15764 [pdf, html, other]: Title: LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec

Yiwei Guo, Zhihan Li, Chenpeng Du, Hankun Wang, Xie Chen, Kai Yu

Comments: 5 pages, 2 figures, 3 tables. Demo page: this https URL. Accepted to Interspeech 2025

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
[84] arXiv:2410.16048 [pdf, html, other]: Title: Continuous Speech Synthesis using per-token Latent Diffusion

Arnon Turetzky, Nimrod Shabtay, Slava Shechtman, Hagai Aronowitz, David Haws, Ron Hoory, Avihu Dekel

Comments: Preprint, Under review

Subjects: Audio and Speech Processing (eess.AS)
[85] arXiv:2410.16059 [pdf, html, other]: Title: Multi-Level Speaker Representation for Target Speaker Extraction

Ke Zhang, Junjie Li, Shuai Wang, Yangjie Wei, Yi Wang, Yannan Wang, Haizhou Li

Comments: 5 pages. Submitted to ICASSP 2025. Implementation will be released at this https URL

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[86] arXiv:2410.16130 [pdf, html, other]: Title: Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning

Chun-Yi Kuan, Hung-yi Lee

Comments: Accepted to ICASSP 2025. Project Website: this https URL

Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[87] arXiv:2410.16330 [pdf, html, other]: Title: End-to-End Transformer-based Automatic Speech Recognition for Northern Kurdish: A Pioneering Approach

Abdulhady Abas Abdullah, Shima Tabibian, Hadi Veisi, Aso Mahmudi, Tarik Rashid

Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
[88] arXiv:2410.16647 [pdf, html, other]: Title: GE2E-KWS: Generalized End-to-End Training and Evaluation for Zero-shot Keyword Spotting

Pai Zhu, Jacob W. Bartel, Dhruuv Agarwal, Kurt Partridge, Hyun Jin Park, Quan Wang

Comments: 8 pages, 6 figures, 2 tables The paper is accepted in IEEE Spoken Language Technology (SLT) 2024

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[89] arXiv:2410.16726 [pdf, html, other]: Title: Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap

Guanrou Yang, Fan Yu, Ziyang Ma, Zhihao Du, Zhifu Gao, Shiliang Zhang, Xie Chen

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
[90] arXiv:2410.17028 [pdf, html, other]: Title: Can a Machine Distinguish High and Low Amount of Social Creak in Speech?

Anne-Maria Laukkanen, Sudarsana Reddy Kadiri, Shrikanth Narayanan, Paavo Alku

Comments: Accepted in Journal of Voice

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
[91] arXiv:2410.17033 [pdf, html, other]: Title: Prototype and Instance Contrastive Learning for Unsupervised Domain Adaptation in Speaker Verification

Wen Huang, Bing Han, Zhengyang Chen, Shuai Wang, Yanmin Qian

Comments: Accepted to ISCSLP 2024

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[92] arXiv:2410.17437 [pdf, html, other]: Title: Improving Automatic Speech Recognition with Decoder-Centric Regularisation in Encoder-Decoder Models

Alexander Polok, Santosh Kesiraju, Karel Beneš, Lukáš Burget, Jan Černocký

Subjects: Audio and Speech Processing (eess.AS)
[93] arXiv:2410.17790 [pdf, other]: Title: Regularized autoregressive modeling and its application to audio signal declipping

Ondřej Mokrý, Pavel Rajmic

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[94] arXiv:2410.17834 [pdf, html, other]: Title: Non-intrusive Speech Quality Assessment with Diffusion Models Trained on Clean Speech

Danilo de Oliveira, Julius Richter, Jean-Marie Lemercier, Simon Welker, Timo Gerkmann

Comments: Accepted at Interspeech 2025

Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[95] arXiv:2410.18908 [pdf, html, other]: Title: A Survey on Speech Large Language Models for Understanding

Jing Peng, Yucheng Wang, Bohan Li, Yiwei Guo, Hankun Wang, Yangui Fang, Yu Xi, Haoyu Li, Xu Li, Ke Zhang, Shuai Wang, Kai Yu

Comments: This paper is submitted as an invited overview to IEEE JSTSP

Subjects: Audio and Speech Processing (eess.AS)
[96] arXiv:2410.19168 [pdf, html, other]: Title: MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, Dinesh Manocha

Comments: Project Website: this https URL

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
[97] arXiv:2410.19595 [pdf, html, other]: Title: Mask-Weighted Spatial Likelihood Coding for Speaker-Independent Joint Localization and Mask Estimation

Jakob Kienegger, Alina Mannanova, Timo Gerkmann

Comments: ©2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[98] arXiv:2410.20095 [pdf, html, other]: Title: Analyzing long-term rhythm variations in Mising and Assamese using frequency domain correlates

Parismita Gogoi, Priyankoo Sarmah, S. R. M. Prasanna

Comments: Submitted to International Journal of Asian Language Processing (IJALP)

Subjects: Audio and Speech Processing (eess.AS)
[99] arXiv:2410.20578 [pdf, html, other]: Title: Meta-Learning Approaches for Improving Detection of Unseen Speech Deepfakes

Ivan Kukanov, Janne Laakkonen, Tomi Kinnunen, Ville Hautamäki

Comments: 6 pages, accepted to the IEEE Spoken Language Technology Workshop (SLT) 2024

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
[100] arXiv:2410.21455 [pdf, html, other]: Title: Simultaneous Diarization and Separation of Meetings through the Integration of Statistical Mixture Models

Tobias Cord-Landwehr, Christoph Boeddeker, Reinhold Haeb-Umbach

Comments: Accepted at ICASSP2025

Subjects: Audio and Speech Processing (eess.AS)

Total of 358 entries : 1-100 101-200 201-300 301-358

Showing up to 100 entries per page: fewer | more | all