Audio and Speech Processing

Authors and titles for March 2024

Total of 213 entries : 1-50 51-100 101-150 151-200 201-213

Showing up to 50 entries per page: fewer | more | all

[151] arXiv:2403.10961 (cross-list from cs.LG) [pdf, html, other]: Title: Energy-Based Models with Applications to Speech and Language Processing

Zhijian Ou

Comments: The version before publisher editing

Journal-ref: Foundations and Trends in Signal Processing: Vol. 18: No. 1-2, pp 1-199

Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[152] arXiv:2403.11074 (cross-list from cs.CV) [pdf, html, other]: Title: Audio-Visual Segmentation via Unlabeled Frame Exploitation

Jinxiang Liu, Yikun Liu, Fei Zhang, Chen Ju, Ya Zhang, Yanfeng Wang

Comments: Accepted by CVPR 2024

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[153] arXiv:2403.11091 (cross-list from cs.SD) [pdf, html, other]: Title: Multitask frame-level learning for few-shot sound event detection

Liang Zou, Genwei Yan, Ruoyu Wang, Jun Du, Meng Lei, Tian Gao, Xin Fang

Comments: 6 pages, 4 figures, conference

Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
[154] arXiv:2403.11626 (cross-list from cs.GR) [pdf, html, other]: Title: QEAN: Quaternion-Enhanced Attention Network for Visual Dance Generation

Zhizhen Zhou, Yejing Huo, Guoheng Huang, An Zeng, Xuhang Chen, Lian Huang, Zinuo Li

Comments: Accepted by The Visual Computer Journal

Subjects: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[155] arXiv:2403.11706 (cross-list from cs.SD) [pdf, html, other]: Title: Generalized Multi-Source Inference for Text Conditioned Music Diffusion Models

Emilian Postolache, Giorgio Mariani, Luca Cosmo, Emmanouil Benetos, Emanuele Rodolà

Comments: Accepted at ICASSP 2024

Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[156] arXiv:2403.11732 (cross-list from cs.SD) [pdf, html, other]: Title: Hallucination in Perceptual Metric-Driven Speech Enhancement Networks

George Close, Thomas Hain, Stefan Goetze

Comments: Accepted for EUSIPCO 2024

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[157] arXiv:2403.11757 (cross-list from cs.MM) [pdf, html, other]: Title: Efficient Feature Extraction and Late Fusion Strategy for Audiovisual Emotional Mimicry Intensity Estimation

Jun Yu, Wangyuan Zhu, Jichao Zhu

Subjects: Multimedia (cs.MM); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[158] arXiv:2403.11778 (cross-list from cs.SD) [pdf, html, other]: Title: Towards the Development of a Real-Time Deepfake Audio Detection System in Communication Platforms

Jonat John Mathew, Rakin Ahsan, Sae Furukawa, Jagdish Gautham Krishna Kumar, Huzaifa Pallan, Agamjeet Singh Padda, Sara Adamski, Madhu Reddiboina, Arjun Pankajakshan

Subjects: Sound (cs.SD); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[159] arXiv:2403.11780 (cross-list from cs.SD) [pdf, html, other]: Title: Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt

Yongqi Wang, Ruofan Hu, Rongjie Huang, Zhiqing Hong, Ruiqi Li, Wenrui Liu, Fuming You, Tao Jin, Zhou Zhao

Comments: Accepted by NAACL 2024 (main conference)

Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[160] arXiv:2403.11827 (cross-list from cs.SD) [pdf, html, other]: Title: Sound Event Detection and Localization with Distance Estimation

Daniel Aleksander Krause, Archontis Politis, Annamaria Mesaros

Comments: This paper has been accepted for the 32nd European Signal Processing Conference EUSIPCO 2024 in Lyon

Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[161] arXiv:2403.11879 (cross-list from cs.SD) [pdf, html, other]: Title: Unimodal Multi-Task Fusion for Emotional Mimicry Intensity Prediction

Tobias Hallmen, Fabian Deuser, Norbert Oswald, Elisabeth André

Journal-ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 4657-4665

Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[162] arXiv:2403.12000 (cross-list from cs.SD) [pdf, html, other]: Title: Notochord: a Flexible Probabilistic Model for Real-Time MIDI Performance

Victor Shepardson, Jack Armitage, Thor Magnusson

Comments: 12 pages, 6 figures. Proceedings of the 3rd Conference on AI Music Creativity (2022, September 17)

Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[163] arXiv:2403.12402 (cross-list from cs.CL) [pdf, html, other]: Title: An Empirical Study of Speech Language Models for Prompt-Conditioned Speech Synthesis

Yifan Peng, Ilia Kulikov, Yilin Yang, Sravya Popuri, Hui Lu, Changhan Wang, Hongyu Gong

Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[164] arXiv:2403.12408 (cross-list from cs.CL) [pdf, html, other]: Title: MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation

Yifan Peng, Ilia Kulikov, Yilin Yang, Sravya Popuri, Hui Lu, Changhan Wang, Hongyu Gong

Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[165] arXiv:2403.12425 (cross-list from cs.CV) [pdf, html, other]: Title: Multimodal Fusion Method with Spatiotemporal Sequences and Relationship Learning for Valence-Arousal Estimation

Jun Yu, Gongpeng Zhao, Yongqi Wang, Zhihong Wei, Yang Zheng, Zerui Zhang, Zhongpeng Cai, Guochen Xie, Jichao Zhu, Wangyuan Zhu

Comments: 8 pages,3 figures

Subjects: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[166] arXiv:2403.12477 (cross-list from cs.SD) [pdf, html, other]: Title: Real-time Speech Extraction Using Spatially Regularized Independent Low-rank Matrix Analysis and Rank-constrained Spatial Covariance Matrix Estimation

Yuto Ishikawa, Kohei Konaka, Tomohiko Nakamura, Norihiro Takamune, Hiroshi Saruwatari

Comments: 5 pages, 3 figures, accepted at HSCMA 2024

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[167] arXiv:2403.13086 (cross-list from cs.SD) [pdf, html, other]: Title: Listenable Maps for Audio Classifiers

Francesco Paissan, Mirco Ravanelli, Cem Subakan

Comments: Accepted to ICML 2024 (Oral)

Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[168] arXiv:2403.13252 (cross-list from cs.SD) [pdf, html, other]: Title: Frequency-aware convolution for sound event detection

Tao Song, WenWen Zhang

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[169] arXiv:2403.13253 (cross-list from cs.CL) [pdf, html, other]: Title: Document Author Classification Using Parsed Language Structure

Todd K Moon, Jacob H. Gunther

Journal-ref: International Journal on Natural Language Computing (IJNLC), Feb. 24, 2024

Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[170] arXiv:2403.13254 (cross-list from cs.SD) [pdf, html, other]: Title: Onset and offset weighted loss function for sound event detection

Tao Song

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[171] arXiv:2403.13353 (cross-list from cs.SD) [pdf, html, other]: Title: Building speech corpus with diverse voice characteristics for its prompt-based representation

Aya Watanabe, Shinnosuke Takamichi, Yuki Saito, Wataru Nakata, Detai Xin, Hiroshi Saruwatari

Comments: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing. arXiv admin note: text overlap with arXiv:2309.13509

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[172] arXiv:2403.13423 (cross-list from cs.SD) [pdf, html, other]: Title: Advanced Long-Content Speech Recognition With Factorized Neural Transducer

Xun Gong, Yu Wu, Jinyu Li, Shujie Liu, Rui Zhao, Xie Chen, Yanmin Qian

Comments: Accepted by TASLP 2024

Journal-ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 1803-1815, 2024

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[173] arXiv:2403.13659 (cross-list from cs.CV) [pdf, html, other]: Title: Recursive Joint Cross-Modal Attention for Multimodal Fusion in Dimensional Emotion Recognition

R. Gnana Praveen, Jahangir Alam

Subjects: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[174] arXiv:2403.13720 (cross-list from cs.SD) [pdf, html, other]: Title: UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge

Wataru Nakata, Kazuki Yamauchi, Dong Yang, Hiroaki Hyodo, Yuki Saito

Comments: 5 pages, 3 figures

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[175] arXiv:2403.13922 (cross-list from cs.CL) [pdf, html, other]: Title: Visually Grounded Speech Models have a Mutual Exclusivity Bias

Leanne Nortje, Dan Oneaţă, Yevgen Matusevych, Herman Kamper

Comments: Accepted to TACL, pre-MIT Press publication version

Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[176] arXiv:2403.14048 (cross-list from cs.SD) [pdf, html, other]: Title: The NeurIPS 2023 Machine Learning for Audio Workshop: Affective Audio Benchmarks and Novel Data

Alice Baird, Rachel Manzelli, Panagiotis Tzirakis, Chris Gagne, Haoqi Li, Sadie Allen, Sander Dieleman, Brian Kulis, Shrikanth S. Narayanan, Alan Cowen

Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[177] arXiv:2403.14083 (cross-list from cs.SD) [pdf, html, other]: Title: emoDARTS: Joint Optimisation of CNN & Sequential Neural Network Architectures for Superior Speech Emotion Recognition

Thejan Rajapakshe, Rajib Rana, Sara Khalifa, Berrak Sisman, Bjorn W. Schuller, Carlos Busso

Comments: Submitted to IEEE Transactions on Affective Computing on February 19, 2024. arXiv admin note: text overlap with arXiv:2305.14402

Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[178] arXiv:2403.14286 (cross-list from cs.SD) [pdf, html, other]: Title: Assessing the Robustness of Spectral Clustering for Deep Speaker Diarization

Nikhil Raghav, Md Sahidullah

Comments: Manuscript Under Review

Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[179] arXiv:2403.14290 (cross-list from cs.SD) [pdf, html, other]: Title: Exploring Green AI for Audio Deepfake Detection

Subhajit Saha, Md Sahidullah, Swagatam Das

Comments: This manuscript is under review in a conference

Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[180] arXiv:2403.14402 (cross-list from cs.SD) [pdf, html, other]: Title: XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception

HyoJung Han, Mohamed Anwar, Juan Pino, Wei-Ning Hsu, Marine Carpuat, Bowen Shi, Changhan Wang

Comments: ACL2024

Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[181] arXiv:2403.14438 (cross-list from cs.CL) [pdf, html, other]: Title: A Multimodal Approach to Device-Directed Speech Detection with Large Language Models

Dominik Wagner, Alexander Churchill, Siddharth Sigtia, Panayiotis Georgiou, Matt Mirsamadi, Aarshee Mishra, Erik Marchi

Comments: arXiv admin note: text overlap with arXiv:2312.03632

Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[182] arXiv:2403.15469 (cross-list from cs.CL) [pdf, html, other]: Title: Isometric Neural Machine Translation using Phoneme Count Ratio Reward-based Reinforcement Learning

Shivam Ratnakant Mhaskar, Nirmesh J. Shah, Mohammadi Zaki, Ashishkumar P. Gudmalwar, Pankaj Wasnik, Rajiv Ratn Shah

Comments: Accepted in NAACL2024 Findings

Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[183] arXiv:2403.15510 (cross-list from cs.CR) [pdf, html, other]: Title: Privacy-Preserving End-to-End Spoken Language Understanding

Yinggui Wang, Wei Huang, Le Yang

Comments: Accepted by IJCAI

Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[184] arXiv:2403.15523 (cross-list from q-bio.NC) [pdf, html, other]: Title: Towards auditory attention decoding with noise-tagging: A pilot study

H. A. Scheppink, S. Ahmadi, P. Desain, M. Tangermann, J. Thielen

Comments: 6 pages, 2 figures, 9th Graz Brain-Computer Interface Conference 2024

Journal-ref: 9th Graz Brain-Computer Interface Conference (2024) 337-342

Subjects: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[185] arXiv:2403.15569 (cross-list from cs.SD) [pdf, other]: Title: Music to Dance as Language Translation using Sequence Models

André Correia, Luís A. Alexandre

Subjects: Sound (cs.SD); Robotics (cs.RO); Audio and Speech Processing (eess.AS)
[186] arXiv:2403.16078 (cross-list from cs.SD) [pdf, html, other]: Title: Target Speech Extraction with Pre-trained AV-HuBERT and Mask-And-Recover Strategy

Wenxuan Wu, Xueyuan Chen, Xixin Wu, Haizhou Li, Helen Meng

Comments: Accepted by IJCNN 2024

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[187] arXiv:2403.16331 (cross-list from cs.SD) [pdf, html, other]: Title: Modeling Analog Dynamic Range Compressors using Deep Learning and State-space Models

Hanzhi Yin, Gang Cheng, Christian J. Steinmetz, Ruibin Yuan, Richard M. Stern, Roger B. Dannenberg

Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[188] arXiv:2403.16464 (cross-list from cs.SD) [pdf, html, other]: Title: Training Generative Adversarial Network-Based Vocoder with Limited Data Using Augmentation-Conditional Discriminator

Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka

Comments: Accepted to ICASSP 2024. Project page: this https URL

Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[189] arXiv:2403.16760 (cross-list from cs.HC) [pdf, other]: Title: As Good As A Coin Toss: Human detection of AI-generated images, videos, audio, and audiovisual stimuli

Di Cooke, Abigail Edwards, Sophia Barkoff, Kathryn Kelly

Comments: For study pre-registration, see this https URL V5: expanded on ecological validation in Introduction; revised table in Results to add OR & OR CI, previous data unchanged; added further details on study design in Methods; added Appendix with survey screenshots; migrated list of dataset sources from footnotes into references

Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[190] arXiv:2403.16865 (cross-list from cs.CL) [pdf, html, other]: Title: Encoding of lexical tone in self-supervised models of spoken language

Gaofei Shen, Michaela Watkins, Afra Alishahi, Arianna Bisazza, Grzegorz Chrupała

Comments: Accepted to NAACL 2024

Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[191] arXiv:2403.17327 (cross-list from cs.SD) [pdf, html, other]: Title: Accuracy enhancement method for speech emotion recognition from spectrogram using temporal frequency correlation and positional information learning through knowledge transfer

Jeong-Yoon Kim, Seung-Ho Lee

Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
[192] arXiv:2403.17376 (cross-list from cs.SD) [pdf, other]: Title: Theoretical Analysis of Quality of Conventional Beamforming for Phased Microphone Arrays

Dheepak Khatri, Kenneth Granlund

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[193] arXiv:2403.17378 (cross-list from cs.SD) [pdf, html, other]: Title: Low-Latency Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks

Yang Ai, Zhen-Hua Ling

Comments: Accepted by IEEE Transactions on Audio, Speech and Language Processing. arXiv admin note: substantial text overlap with arXiv:2211.15974

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[194] arXiv:2403.17379 (cross-list from cs.SD) [pdf, html, other]: Title: Exploring and Applying Audio-Based Sentiment Analysis in Music

Etash Jhanji

Comments: 5 pages, 7 figures, 2 tables. For source code, see this https URL

Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[195] arXiv:2403.17420 (cross-list from cs.CV) [pdf, html, other]: Title: Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge

Dongjin Kim, Sung Jin Um, Sangmin Lee, Jung Uk Kim

Comments: Accepted at CVPR 2024

Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[196] arXiv:2403.17508 (cross-list from cs.SD) [pdf, html, other]: Title: Correlation of Fréchet Audio Distance With Human Perception of Environmental Audio Is Embedding Dependant

Modan Tailleur, Junwon Lee, Mathieu Lagrange, Keunwoo Choi, Laurie M. Heller, Keisuke Imoto, Yuki Okamoto

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[197] arXiv:2403.17529 (cross-list from cs.SD) [pdf, html, other]: Title: Detection of Deepfake Environmental Audio

Hafsa Ouajdi, Oussama Hadder, Modan Tailleur, Mathieu Lagrange, Laurie M. Heller

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[198] arXiv:2403.17562 (cross-list from cs.SD) [pdf, html, other]: Title: Deep functional multiple index models with an application to SER

Matthieu Saumard, Abir El Haj, Thibault Napoleon

Comments: 5 pages, 1 figure

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS); Applications (stat.AP)
[199] arXiv:2403.18572 (cross-list from cs.SD) [pdf, html, other]: Title: ACES: Evaluating Automated Audio Captioning Models on the Semantics of Sounds

Gijs Wijngaard, Elia Formisano, Bruno L. Giordano, Michel Dumontier

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[200] arXiv:2403.18635 (cross-list from cs.LG) [pdf, html, other]: Title: Fusion approaches for emotion recognition from speech using acoustic and text-based features

Leonardo Pepino, Pablo Riera, Luciana Ferrer, Agustin Gravano

Comments: 5 pages. Accepted in ICASSP 2020

Subjects: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)

Total of 213 entries : 1-50 51-100 101-150 151-200 201-213

Showing up to 50 entries per page: fewer | more | all