Skip to main content

Showing 1–50 of 97 results for author: Jung, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2507.07643  [pdf, ps, other

    eess.SP

    RIS-assisted ISAC Systems for Industrial Revolution 6.0: Exploring the Near-field and Far-field Coexistence

    Authors: Seonghoon Yoo, Jaemin Jung, Seongah Jeong, Jinkyu Kang, Markku Juntti, Joonhyuk Kang

    Abstract: The Industrial Internet of Things (IIoT) has emerged as a key technology for realizing the vision of Industry 6.0, requiring the seamless integration of diverse connected devices. In particular, integrated sensing and communication (ISAC) plays a critical role in supporting real-time control and automation within IIoT systems. In this paper, we explore reconfigurable intelligent surface (RIS)-assi… ▽ More

    Submitted 10 July, 2025; originally announced July 2025.

  2. arXiv:2506.23552  [pdf, ps, other

    cs.CV cs.SD eess.AS

    JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching

    Authors: Mingi Kwon, Joonghyuk Shin, Jaeseok Jung, Jaesik Park, Youngjung Uh

    Abstract: The intrinsic link between facial motion and speech is often overlooked in generative modeling, where talking head synthesis and text-to-speech (TTS) are typically addressed as separate tasks. This paper introduces JAM-Flow, a unified framework to simultaneously synthesize and condition on both facial motion and speech. Our approach leverages flow matching and a novel Multi-Modal Diffusion Transfo… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: project page: https://joonghyuk.com/jamflow-web Under review. Preprint published on arXiv

  3. arXiv:2506.13595  [pdf, ps, other

    cs.SD cs.CG eess.AS

    Persistent Homology of Music Network with Three Different Distances

    Authors: Eunwoo Heo, Byeongchan Choi, Myung ock Kim, Mai Lan Tran, Jae-Hun Jung

    Abstract: Persistent homology has been widely used to discover hidden topological structures in data across various applications, including music data. To apply persistent homology, a distance or metric must be defined between points in a point cloud or between nodes in a graph network. These definitions are not unique and depend on the specific objectives of a given problem. In other words, selecting diffe… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  4. arXiv:2506.00722  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Chain-of-Thought Training for Open E2E Spoken Dialogue Systems

    Authors: Siddhant Arora, Jinchuan Tian, Hayato Futami, Jee-weon Jung, Jiatong Shi, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe

    Abstract: Unlike traditional cascaded pipelines, end-to-end (E2E) spoken dialogue systems preserve full differentiability and capture non-phonemic information, making them well-suited for modeling spoken interactions. However, existing E2E approaches often require large-scale training data and generates responses lacking semantic coherence. We propose a simple yet effective strategy leveraging a chain-of-th… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

    Comments: Accepted at INTERSPEECH 2025

  5. arXiv:2505.18860  [pdf, ps, other

    eess.AS

    Context-Driven Dynamic Pruning for Large Speech Foundation Models

    Authors: Masao Someki, Shikhar Bharadwaj, Atharva Anand Joshi, Chyi-Jiunn Lin, Jinchuan Tian, Jee-weon Jung, Markus Müller, Nathan Susanj, Jing Liu, Shinji Watanabe

    Abstract: Speech foundation models achieve strong generalization across languages and acoustic conditions, but require significant computational resources for inference. In the context of speech foundation models, pruning techniques have been studied that dynamically optimize model structures based on the target audio leveraging external context. In this work, we extend this line of research and propose con… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

    Comments: Accepted at Interspeech 2025

  6. arXiv:2505.16798  [pdf, ps, other

    eess.AS cs.AI

    SEED: Speaker Embedding Enhancement Diffusion Model

    Authors: KiHyun Nam, Jungwoo Heo, Jee-weon Jung, Gangin Park, Chaeyoung Jung, Ha-Jin Yu, Joon Son Chung

    Abstract: A primary challenge when deploying speaker recognition systems in real-world applications is performance degradation caused by environmental mismatch. We propose a diffusion-based method that takes speaker embeddings extracted from a pre-trained speaker recognition model and generates refined embeddings. For training, our approach progressively adds Gaussian noise to both clean and noisy speaker e… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

    Comments: Accepted to Interspeech 2025. The official code can be found at https://github.com/kaistmm/seed-pytorch

  7. arXiv:2505.12863  [pdf, other

    cs.SD cs.AI cs.CV eess.AS

    Unified Cross-modal Translation of Score Images, Symbolic Music, and Performance Audio

    Authors: Jongmin Jung, Dongmin Kim, Sihun Lee, Seola Cho, Hyungjoon Soh, Irmak Bukey, Chris Donahue, Dasaem Jeong

    Abstract: Music exists in various modalities, such as score images, symbolic scores, MIDI, and audio. Translations between each modality are established as core tasks of music information retrieval, such as automatic music transcription (audio-to-MIDI) and optical music recognition (score image to symbolic score). However, most past work on multimodal translation trains specialized models on individual tran… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    Comments: Submitted to IEEE Transactions on Audio, Speech and Language Processing (TASLPRO)

  8. arXiv:2505.10101  [pdf, other

    cs.SD cs.AI cs.GR cs.MM eess.AS

    LAV: Audio-Driven Dynamic Visual Generation with Neural Compression and StyleGAN2

    Authors: Jongmin Jung, Dasaem Jeong

    Abstract: This paper introduces LAV (Latent Audio-Visual), a system that integrates EnCodec's neural audio compression with StyleGAN2's generative capabilities to produce visually dynamic outputs driven by pre-recorded audio. Unlike previous works that rely on explicit feature mappings, LAV uses EnCodec embeddings as latent representations, directly transformed into StyleGAN2's style latent space via random… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

    Comments: Paper accepted at ISEA 2025, The 30th International Symposium on Electronic/Emerging Art, Seoul, Republic of Korea, 23 - 29 May 2025

  9. arXiv:2505.07364  [pdf, other

    eess.IV cs.AI

    GAN-based synthetic FDG PET images from T1 brain MRI can serve to improve performance of deep unsupervised anomaly detection models

    Authors: Daria Zotova, Nicolas Pinon, Robin Trombetta, Romain Bouet, Julien Jung, Carole Lartizien

    Abstract: Background and Objective. Research in the cross-modal medical image translation domain has been very productive over the past few years in tackling the scarce availability of large curated multimodality datasets with the promising performance of GAN-based architectures. However, only a few of these studies assessed task-based related performance of these synthetic data, especially for the training… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

    Journal ref: Computer Methods and Programs in Biomedicine, 2025, 265, pp.108727

  10. arXiv:2503.05361  [pdf, other

    eess.SY

    Community Energy Management System for Fast Frequency Response: A Hierarchical Control Approach

    Authors: Joonsung Jung, Hyunjoong Kim, Hyunghwan Shin, Jip Kim

    Abstract: The increase in renewable energy sources (RES) has reduced power system inertia, making frequency stabilization more challenging and highlighting the need for fast frequency response (FFR) resources. While building energy management systems (BEMS) equipped with distributed energy resources (DERs) can provide FFR, individual BEMS alone cannot fully meet demand. To address this, we propose a communi… ▽ More

    Submitted 7 March, 2025; originally announced March 2025.

    Comments: 5 pages, 7 figures, submitted to PES General Meeting 2025

    MSC Class: 90C05; 90C90 ACM Class: I.2.8; C.3; G.1.6

  11. arXiv:2502.08857  [pdf, other

    eess.AS

    ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech

    Authors: Xin Wang, Héctor Delgado, Hemlata Tak, Jee-weon Jung, Hye-jin Shim, Massimiliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen, Nicholas Evans, Kong Aik Lee, Junichi Yamagishi, Myeonghun Jeong, Ge Zhu, Yongyi Zang, You Zhang, Soumi Maiti, Florian Lux, Nicolas Müller, Wangyou Zhang, Chengzhe Sun, Shuwei Hou, Siwei Lyu, Sébastien Le Maguer , et al. (4 additional authors not shown)

    Abstract: ASVspoof 5 is the fifth edition in a series of challenges which promote the study of speech spoofing and deepfake attacks as well as the design of detection solutions. We introduce the ASVspoof 5 database which is generated in a crowdsourced fashion from data collected in diverse acoustic conditions (cf. studio-quality data for earlier ASVspoof databases) and from ~2,000 speakers (cf. ~100 earlier… ▽ More

    Submitted 24 April, 2025; v1 submitted 12 February, 2025; originally announced February 2025.

    Comments: Database link: https://zenodo.org/records/14498691, Database mirror link: https://huggingface.co/datasets/jungjee/asvspoof5, ASVspoof 5 Challenge Workshop Proceeding: https://www.isca-archive.org/asvspoof_2024/index.html

  12. arXiv:2412.19259  [pdf, other

    eess.AS cs.SD

    VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis

    Authors: Jaemin Jung, Junseok Ahn, Chaeyoung Jung, Tan Dat Nguyen, Youngjoon Jang, Joon Son Chung

    Abstract: We present VoiceDiT, a multi-modal generative model for producing environment-aware speech and audio from text and visual prompts. While aligning speech with text is crucial for intelligible speech, achieving this alignment in noisy conditions remains a significant and underexplored challenge in the field. To address this, we present a novel audio generation pipeline named VoiceDiT. This pipeline… ▽ More

    Submitted 26 December, 2024; originally announced December 2024.

    Comments: Accepted to ICASSP 2025

  13. arXiv:2412.00325  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    MusicGen-Chord: Advancing Music Generation through Chord Progressions and Interactive Web-UI

    Authors: Jongmin Jung, Andreas Jansson, Dasaem Jeong

    Abstract: MusicGen is a music generation language model (LM) that can be conditioned on textual descriptions and melodic features. We introduce MusicGen-Chord, which extends this capability by incorporating chord progression features. This model modifies one-hot encoded melody chroma vectors into multi-hot encoded chord chroma vectors, enabling the generation of music that reflects both chord progressions a… ▽ More

    Submitted 29 November, 2024; originally announced December 2024.

    Comments: Late-breaking/demo (LBD) at ISMIR 2024. https://ismir2024program.ismir.net/lbd_424.html

  14. arXiv:2409.17285  [pdf, other

    cs.SD cs.AI eess.AS

    SpoofCeleb: Speech Deepfake Detection and SASV In The Wild

    Authors: Jee-weon Jung, Yihan Wu, Xin Wang, Ji-Hoon Kim, Soumi Maiti, Yuta Matsunaga, Hye-jin Shim, Jinchuan Tian, Nicholas Evans, Joon Son Chung, Wangyou Zhang, Seyun Um, Shinnosuke Takamichi, Shinji Watanabe

    Abstract: This paper introduces SpoofCeleb, a dataset designed for Speech Deepfake Detection (SDD) and Spoofing-robust Automatic Speaker Verification (SASV), utilizing source data from real-world conditions and spoofing attacks generated by Text-To-Speech (TTS) systems also trained on the same real-world data. Robust recognition systems require speech data recorded in varied acoustic environments with diffe… ▽ More

    Submitted 15 April, 2025; v1 submitted 18 September, 2024; originally announced September 2024.

    Comments: IEEE OJSP. Official document lives at: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10839331

  15. arXiv:2409.15897  [pdf, ps, other

    eess.AS cs.SD

    ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs for Audio, Music, and Speech

    Authors: Jiatong Shi, Jinchuan Tian, Yihan Wu, Jee-weon Jung, Jia Qi Yip, Yoshiki Masuyama, William Chen, Yuning Wu, Yuxun Tang, Massa Baali, Dareen Alharhi, Dong Zhang, Ruifan Deng, Tejes Srivastava, Haibin Wu, Alexander H. Liu, Bhiksha Raj, Qin Jin, Ruihua Song, Shinji Watanabe

    Abstract: Neural codecs have become crucial to recent speech and audio generation research. In addition to signal compression capabilities, discrete codecs have also been found to enhance downstream training efficiency and compatibility with autoregressive language models. However, as extensive downstream applications are investigated, challenges have arisen in ensuring fair comparisons across diverse appli… ▽ More

    Submitted 24 February, 2025; v1 submitted 24 September, 2024; originally announced September 2024.

    Comments: Accepted by SLT

  16. arXiv:2409.10791  [pdf, other

    eess.AS cs.SD

    Speaker-IPL: Unsupervised Learning of Speaker Characteristics with i-Vector based Pseudo-Labels

    Authors: Zakaria Aldeneh, Takuya Higuchi, Jee-weon Jung, Li-Wei Chen, Stephen Shum, Ahmed Hussen Abdelaziz, Shinji Watanabe, Tatiana Likhomanenko, Barry-John Theobald

    Abstract: Iterative self-training, or iterative pseudo-labeling (IPL) -- using an improved model from the current iteration to provide pseudo-labels for the next iteration -- has proven to be a powerful approach to enhance the quality of speaker representations. Recent applications of IPL in unsupervised speaker recognition start with representations extracted from very elaborate self-supervised methods (e.… ▽ More

    Submitted 17 January, 2025; v1 submitted 16 September, 2024; originally announced September 2024.

    Comments: ICASSP 2025

  17. arXiv:2409.08711  [pdf, ps, other

    eess.AS cs.AI

    Text-To-Speech Synthesis In The Wild

    Authors: Jee-weon Jung, Wangyou Zhang, Soumi Maiti, Yihan Wu, Xin Wang, Ji-Hoon Kim, Yuta Matsunaga, Seyun Um, Jinchuan Tian, Hye-jin Shim, Nicholas Evans, Joon Son Chung, Shinnosuke Takamichi, Shinji Watanabe

    Abstract: Traditional Text-to-Speech (TTS) systems rely on studio-quality speech recorded in controlled settings.a Recently, an effort known as noisy-TTS training has emerged, aiming to utilize in-the-wild data. However, the lack of dedicated datasets has been a significant limitation. We introduce the TTS In the Wild (TITW) dataset, which is publicly available, created through a fully automated pipeline ap… ▽ More

    Submitted 1 June, 2025; v1 submitted 13 September, 2024; originally announced September 2024.

    Comments: 5 pages, Interspeech 2025

  18. arXiv:2409.01201  [pdf, other

    eess.AS cs.AI cs.SD

    EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance

    Authors: Jaeyeon Kim, Minjeon Jeon, Jaeyoon Jung, Sang Hoon Woo, Jinjoo Lee

    Abstract: In this work, we aim to analyze and optimize the EnCLAP framework, a state-of-the-art model in automated audio captioning. We investigate the impact of modifying the acoustic encoder components, explore pretraining with different dataset scales, and study the effectiveness of a reranking scheme. Through extensive experimentation and quantitative analysis of generated captions, we develop EnCLAP++,… ▽ More

    Submitted 2 September, 2024; originally announced September 2024.

    Comments: Accepted to DCASE2024 Workshop

  19. arXiv:2409.01160  [pdf, ps, other

    eess.AS cs.AI cs.SD

    Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio Captioning

    Authors: Jaeyeon Kim, Jaeyoon Jung, Minjeong Jeon, Sang Hoon Woo, Jinjoo Lee

    Abstract: In this technical report, we describe our submission to DCASE2024 Challenge Task6 (Automated Audio Captioning) and Task8 (Language-based Audio Retrieval). We develop our approach building upon the EnCLAP audio captioning framework and optimizing it for Task6 of the challenge. Notably, we outline the changes in the underlying components and the incorporation of the reranking process. Additionally,… ▽ More

    Submitted 2 September, 2024; originally announced September 2024.

    Comments: DCASE2024 Challenge Technical Report. Ranked 2nd in Task 6 Automated Audio Captioning

  20. arXiv:2408.14886  [pdf, other

    cs.SD cs.AI eess.AS

    The VoxCeleb Speaker Recognition Challenge: A Retrospective

    Authors: Jaesung Huh, Joon Son Chung, Arsha Nagrani, Andrew Brown, Jee-weon Jung, Daniel Garcia-Romero, Andrew Zisserman

    Abstract: The VoxCeleb Speaker Recognition Challenges (VoxSRC) were a series of challenges and workshops that ran annually from 2019 to 2023. The challenges primarily evaluated the tasks of speaker recognition and diarisation under various settings including: closed and open training data; as well as supervised, self-supervised, and semi-supervised training for domain adaptation. The challenges also provide… ▽ More

    Submitted 27 August, 2024; originally announced August 2024.

    Comments: TASLP 2024

  21. arXiv:2408.08739  [pdf, other

    eess.AS cs.AI cs.SD

    ASVspoof 5: Crowdsourced Speech Data, Deepfakes, and Adversarial Attacks at Scale

    Authors: Xin Wang, Hector Delgado, Hemlata Tak, Jee-weon Jung, Hye-jin Shim, Massimiliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen, Nicholas Evans, Kong Aik Lee, Junichi Yamagishi

    Abstract: ASVspoof 5 is the fifth edition in a series of challenges that promote the study of speech spoofing and deepfake attacks, and the design of detection solutions. Compared to previous challenges, the ASVspoof 5 database is built from crowdsourced data collected from a vastly greater number of speakers in diverse acoustic conditions. Attacks, also crowdsourced, are generated and tested using surrogat… ▽ More

    Submitted 16 August, 2024; originally announced August 2024.

    Comments: 8 pages, ASVspoof 5 Workshop (Interspeech2024 Satellite)

  22. arXiv:2408.03593  [pdf, other

    eess.AS

    Bridging the Gap between Audio and Text using Parallel-attention for User-defined Keyword Spotting

    Authors: Youkyum Kim, Jaemin Jung, Jihwan Park, Byeong-Yeol Kim, Joon Son Chung

    Abstract: This paper proposes a novel user-defined keyword spotting framework that accurately detects audio keywords based on text enrollment. Since audio data possesses additional acoustic information compared to text, there are discrepancies between these two modalities. To address this challenge, we present ParallelKWS, which utilises self- and cross-attention in a parallel architecture to effectively ca… ▽ More

    Submitted 7 August, 2024; originally announced August 2024.

    Comments: This work has been submitted to the IEEE for possible publication

  23. arXiv:2408.01180  [pdf, other

    cs.SD cs.IR cs.LG eess.AS

    Nested Music Transformer: Sequentially Decoding Compound Tokens in Symbolic Music and Audio Generation

    Authors: Jiwoo Ryu, Hao-Wen Dong, Jongmin Jung, Dasaem Jeong

    Abstract: Representing symbolic music with compound tokens, where each token consists of several different sub-tokens representing a distinct musical feature or attribute, offers the advantage of reducing sequence length. While previous research has validated the efficacy of compound tokens in music sequence modeling, predicting all sub-tokens simultaneously can lead to suboptimal results as it may not full… ▽ More

    Submitted 2 August, 2024; originally announced August 2024.

    Comments: Accepted at 25th International Society for Music Information Retrieval Conference (ISMIR 2024)

  24. arXiv:2406.17246  [pdf, other

    cs.SD cs.AI eess.AS

    Beyond Silence: Bias Analysis through Loss and Asymmetric Approach in Audio Anti-Spoofing

    Authors: Hye-jin Shim, Md Sahidullah, Jee-weon Jung, Shinji Watanabe, Tomi Kinnunen

    Abstract: Current trends in audio anti-spoofing detection research strive to improve models' ability to generalize across unseen attacks by learning to identify a variety of spoofing artifacts. This emphasis has primarily focused on the spoof class. Recently, several studies have noted that the distribution of silence differs between the two classes, which can serve as a shortcut. In this paper, we extend c… ▽ More

    Submitted 26 August, 2024; v1 submitted 24 June, 2024; originally announced June 2024.

    Comments: 5 pages, 1 figure, 5 tables, ISCA Interspeech 2024 SynData4GenAI Workshop

  25. arXiv:2406.14559  [pdf, other

    cs.SD eess.AS

    Disentangled Representation Learning for Environment-agnostic Speaker Recognition

    Authors: KiHyun Nam, Hee-Soo Heo, Jee-weon Jung, Joon Son Chung

    Abstract: This work presents a framework based on feature disentanglement to learn speaker embeddings that are robust to environmental variations. Our framework utilises an auto-encoder as a disentangler, dividing the input speaker embedding into components related to the speaker and other residual information. We employ a group of objective functions to ensure that the auto-encoder's code representation -… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: Interspeech 2024. The official webpage can be found at https://mm.kaist.ac.kr/projects/voxceleb-disentangler/

  26. arXiv:2406.10083  [pdf, other

    cs.CL cs.SD eess.AS

    On the Evaluation of Speech Foundation Models for Spoken Language Understanding

    Authors: Siddhant Arora, Ankita Pasad, Chung-Ming Chien, Jionghao Han, Roshan Sharma, Jee-weon Jung, Hira Dhamyal, William Chen, Suwon Shon, Hung-yi Lee, Karen Livescu, Shinji Watanabe

    Abstract: The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks was recently introduced to address the need for open resources and benchmarking of complex spoken language understanding (SLU) tasks, including both classification and sequence generation tasks, on natural speech. The benchmark has demonstrated preliminary success in using pre-trained speech foundation models (SFM) for th… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Accepted at ACL Findings 2024

  27. arXiv:2406.05339  [pdf, other

    eess.AS cs.AI

    To what extent can ASV systems naturally defend against spoofing attacks?

    Authors: Jee-weon Jung, Xin Wang, Nicholas Evans, Shinji Watanabe, Hye-jin Shim, Hemlata Tak, Sidhhant Arora, Junichi Yamagishi, Joon Son Chung

    Abstract: The current automatic speaker verification (ASV) task involves making binary decisions on two types of trials: target and non-target. However, emerging advancements in speech generation technology pose significant threats to the reliability of ASV systems. This study investigates whether ASV effortlessly acquires robustness against spoofing attacks (i.e., zero-shot capability) by systematically ex… ▽ More

    Submitted 17 November, 2024; v1 submitted 7 June, 2024; originally announced June 2024.

    Comments: 5 pages, 3 figures, 3 tables, Interspeech 2024

  28. Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement

    Authors: Wangyou Zhang, Kohei Saijo, Jee-weon Jung, Chenda Li, Shinji Watanabe, Yanmin Qian

    Abstract: Deep learning-based speech enhancement (SE) models have achieved impressive performance in the past decade. Numerous advanced architectures have been designed to deliver state-of-the-art performance; however, their scalability potential remains unrevealed. Meanwhile, the majority of research focuses on small-sized datasets with restricted diversity, leading to a plateau in performance improvement.… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: 5 pages, 3 figures, 4 tables, Accepted by Interspeech 2024

  29. arXiv:2403.01355  [pdf, ps, other

    eess.AS cs.LG

    a-DCF: an architecture agnostic metric with application to spoofing-robust speaker verification

    Authors: Hye-jin Shim, Jee-weon Jung, Tomi Kinnunen, Nicholas Evans, Jean-Francois Bonastre, Itshak Lapidot

    Abstract: Spoofing detection is today a mainstream research topic. Standard metrics can be applied to evaluate the performance of isolated spoofing detection solutions and others have been proposed to support their evaluation when they are combined with speaker detection. These either have well-known deficiencies or restrict the architectural approach to combine speaker and spoof detectors. In this paper, w… ▽ More

    Submitted 15 April, 2025; v1 submitted 2 March, 2024; originally announced March 2024.

    Comments: published at ISCA Speaker Odyssey 2024

  30. arXiv:2402.16021  [pdf, ps, other

    cs.CL cs.AI cs.CV eess.AS

    TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages

    Authors: Minsu Kim, Jee-weon Jung, Hyeongseop Rha, Soumi Maiti, Siddhant Arora, Xuankai Chang, Shinji Watanabe, Yong Man Ro

    Abstract: The capability to jointly process multi-modal information is becoming an essential task. However, the limited number of paired multi-modal data and the large computational requirements in multi-modal learning hinder the development. We propose a novel Tri-Modal Translation (TMT) model that translates between arbitrary modalities spanning speech, image, and text. We introduce a novel viewpoint, whe… ▽ More

    Submitted 5 June, 2025; v1 submitted 25 February, 2024; originally announced February 2024.

    Comments: IEEE TMM

  31. arXiv:2402.10748  [pdf, other

    eess.SP cs.HC cs.LG

    A Tiny Transformer for Low-Power Arrhythmia Classification on Microcontrollers

    Authors: Paola Busia, Matteo Antonio Scrugli, Victor Jean-Baptiste Jung, Luca Benini, Paolo Meloni

    Abstract: Wearable systems for the continuous and real-time monitoring of cardiovascular diseases are becoming widespread and valuable assets in diagnosis and therapy. A promising approach for real-time analysis of the electrocardiographic (ECG) signal and the detection of heart conditions, such as arrhythmia, is represented by the transformer machine learning model. Transformers are powerful models for the… ▽ More

    Submitted 21 June, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

    Comments: 2024 IEEE Transactions on Biomedical Circuits and Systems

  32. arXiv:2402.00340  [pdf, other

    cs.SD eess.AS

    Can you Remove the Downstream Model for Speaker Recognition with Self-Supervised Speech Features?

    Authors: Zakaria Aldeneh, Takuya Higuchi, Jee-weon Jung, Skyler Seto, Tatiana Likhomanenko, Stephen Shum, Ahmed Hussen Abdelaziz, Shinji Watanabe, Barry-John Theobald

    Abstract: Self-supervised features are typically used in place of filter-bank features in speaker verification models. However, these models were originally designed to ingest filter-bank features as inputs, and thus, training them on top of self-supervised features assumes that both feature types require the same amount of learning for the task. In this work, we observe that pre-trained self-supervised spe… ▽ More

    Submitted 13 June, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

  33. arXiv:2401.17690  [pdf, other

    eess.AS cs.AI cs.SD

    EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning

    Authors: Jaeyeon Kim, Jaeyoon Jung, Jinjoo Lee, Sang Hoon Woo

    Abstract: We propose EnCLAP, a novel framework for automated audio captioning. EnCLAP employs two acoustic representation models, EnCodec and CLAP, along with a pretrained language model, BART. We also introduce a new training objective called masked codec modeling that improves acoustic awareness of the pretrained language model. Experimental results on AudioCaps and Clotho demonstrate that our model surpa… ▽ More

    Submitted 31 January, 2024; originally announced January 2024.

    Comments: Accepted to ICASSP 2024

  34. arXiv:2401.17230  [pdf, other

    cs.SD cs.AI eess.AS

    ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models

    Authors: Jee-weon Jung, Wangyou Zhang, Jiatong Shi, Zakaria Aldeneh, Takuya Higuchi, Barry-John Theobald, Ahmed Hussen Abdelaziz, Shinji Watanabe

    Abstract: This paper introduces ESPnet-SPK, a toolkit designed with several objectives for training speaker embedding extractors. First, we provide an open-source platform for researchers in the speaker recognition community to effortlessly build models. We provide several models, ranging from x-vector to recent SKA-TDNN. Through the modularized architecture design, variants can be developed easily. We also… ▽ More

    Submitted 13 June, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

    Comments: 5 pages, 3 figures, 7 tables, Interspeech 2024

  35. arXiv:2401.16658  [pdf, ps, other

    cs.CL eess.AS

    OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

    Authors: Yifan Peng, Jinchuan Tian, William Chen, Siddhant Arora, Brian Yan, Yui Sudo, Muhammad Shakeel, Kwanghee Choi, Jiatong Shi, Xuankai Chang, Jee-weon Jung, Shinji Watanabe

    Abstract: Recent studies have highlighted the importance of fully open foundation models. The Open Whisper-style Speech Model (OWSM) is an initial step towards reproducing OpenAI Whisper using public data and open-source toolkits. However, previous versions of OWSM (v1 to v3) are still based on standard Transformer, which might lead to inferior performance compared to state-of-the-art speech encoder archite… ▽ More

    Submitted 26 August, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

    Comments: Accepted at INTERSPEECH 2024. Webpage: https://www.wavlab.org/activities/2024/owsm/

  36. arXiv:2401.14271  [pdf, other

    eess.AS cs.SD

    Improving Design of Input Condition Invariant Speech Enhancement

    Authors: Wangyou Zhang, Jee-weon Jung, Shinji Watanabe, Yanmin Qian

    Abstract: Building a single universal speech enhancement (SE) system that can handle arbitrary input is a demanded but underexplored research topic. Towards this ultimate goal, one direction is to build a single model that handles diverse audio duration, sampling frequencies, and microphone variations in noisy and reverberant scenarios, which we define here as "input condition invariant SE". Such a model wa… ▽ More

    Submitted 15 February, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

    Comments: Accepted by ICASSP 2024, 5 pages, 2 figures, 3 tables (corrected the results of no processing on CHiME-4 (Simu) in Table 2)

  37. arXiv:2312.10019  [pdf, other

    cs.IT cs.LG eess.AS

    Understanding Probe Behaviors through Variational Bounds of Mutual Information

    Authors: Kwanghee Choi, Jee-weon Jung, Shinji Watanabe

    Abstract: With the success of self-supervised representations, researchers seek a better understanding of the information encapsulated within a representation. Among various interpretability methods, we focus on classification-based linear probing. We aim to foster a solid understanding and provide guidelines for linear probing by constructing a novel mathematical framework leveraging information theory. Fi… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

    Comments: Accepted to ICASSP 2024, implementation available at https://github.com/juice500ml/information_probing

  38. Deep Video Inpainting Guided by Audio-Visual Self-Supervision

    Authors: Kyuyeon Kim, Junsik Jung, Woo Jae Kim, Sung-Eui Yoon

    Abstract: Humans can easily imagine a scene from auditory information based on their prior knowledge of audio-visual events. In this paper, we mimic this innate human ability in deep learning models to improve the quality of video inpainting. To implement the prior knowledge, we first train the audio-visual network, which learns the correspondence between auditory and visual information. Then, the audio-vis… ▽ More

    Submitted 11 October, 2023; originally announced October 2023.

    Comments: Accepted at ICASSP 2022

  39. arXiv:2310.02973  [pdf, other

    cs.CL cs.SD eess.AS

    UniverSLU: Universal Spoken Language Understanding for Diverse Tasks with Natural Language Instructions

    Authors: Siddhant Arora, Hayato Futami, Jee-weon Jung, Yifan Peng, Roshan Sharma, Yosuke Kashiwagi, Emiru Tsunoo, Karen Livescu, Shinji Watanabe

    Abstract: Recent studies leverage large language models with multi-tasking capabilities, using natural language prompts to guide the model's behavior and surpassing performance of task-specific models. Motivated by this, we ask: can we build a single model that jointly performs various spoken language understanding (SLU) tasks? We start by adapting a pre-trained automatic speech recognition model to additio… ▽ More

    Submitted 3 April, 2024; v1 submitted 4 October, 2023; originally announced October 2023.

    Comments: Accepted at NAACL 2024

  40. arXiv:2310.01688  [pdf, other

    eess.AS cs.CL cs.SD

    One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition

    Authors: Samuele Cornell, Jee-weon Jung, Shinji Watanabe, Stefano Squartini

    Abstract: This paper presents a novel framework for joint speaker diarization (SD) and automatic speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented recognition). SLIDAR can process arbitrary length inputs and can handle any number of speakers, effectively solving ``who spoke what, when'' concurrently. SLIDAR leverages a sliding window approach and consists of an end-to-end diarizat… ▽ More

    Submitted 2 October, 2023; originally announced October 2023.

  41. arXiv:2309.17352  [pdf, other

    cs.SD eess.AS

    Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation

    Authors: Shih-Lun Wu, Xuankai Chang, Gordon Wichern, Jee-weon Jung, François Germain, Jonathan Le Roux, Shinji Watanabe

    Abstract: Automated audio captioning (AAC) aims to generate informative descriptions for various sounds from nature and/or human activities. In recent years, AAC has quickly attracted research interest, with state-of-the-art systems now relying on a sequence-to-sequence (seq2seq) backbone powered by strong models such as Transformers. Following the macro-trend of applied machine learning research, in this w… ▽ More

    Submitted 9 January, 2024; v1 submitted 29 September, 2023; originally announced September 2023.

    Comments: ICASSP 2024 camera-ready paper. Winner of the DCASE 2023 Challenge Task 6A: Automated Audio Captioning (AAC)

  42. arXiv:2309.15800  [pdf, other

    cs.CL cs.SD eess.AS

    Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study

    Authors: Xuankai Chang, Brian Yan, Kwanghee Choi, Jeeweon Jung, Yichen Lu, Soumi Maiti, Roshan Sharma, Jiatong Shi, Jinchuan Tian, Shinji Watanabe, Yuya Fujita, Takashi Maekaku, Pengcheng Guo, Yao-Fei Cheng, Pavel Denisov, Kohei Saijo, Hsiu-Hsuan Wang

    Abstract: Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies, evoking inefficiencies in sequence modeling. High-dimensional speech features such as spectrograms are often used as the input for the subsequent model. However, they can still be redundant. Recent investigations proposed the use of discrete speech units derived from self-supervised learning repre… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

    Comments: Submitted to IEEE ICASSP 2024

  43. arXiv:2309.13876  [pdf, other

    cs.CL cs.SD eess.AS

    Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

    Authors: Yifan Peng, Jinchuan Tian, Brian Yan, Dan Berrebbi, Xuankai Chang, Xinjian Li, Jiatong Shi, Siddhant Arora, William Chen, Roshan Sharma, Wangyou Zhang, Yui Sudo, Muhammad Shakeel, Jee-weon Jung, Soumi Maiti, Shinji Watanabe

    Abstract: Pre-training speech models on large volumes of data has achieved remarkable success. OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised speech data. It generalizes well to various speech recognition and translation benchmarks even in a zero-shot setup. However, the full pipeline for developing such models (from data collection to training) is not publicly accessib… ▽ More

    Submitted 24 October, 2023; v1 submitted 25 September, 2023; originally announced September 2023.

    Comments: Accepted at ASRU 2023

  44. arXiv:2309.07937  [pdf, other

    eess.AS cs.LG cs.SD

    Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks

    Authors: Soumi Maiti, Yifan Peng, Shukjae Choi, Jee-weon Jung, Xuankai Chang, Shinji Watanabe

    Abstract: We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation. VoxtLM integrates text vocabulary with discrete speech tokens from self-supervised speech features and uses special tokens to enable multitask learning. Compared to a single-task model, VoxtLM exhibits a significant improvement in speech syn… ▽ More

    Submitted 24 January, 2024; v1 submitted 13 September, 2023; originally announced September 2023.

  45. arXiv:2308.01499  [pdf, other

    cs.CV eess.IV

    TDMD: A Database for Dynamic Color Mesh Subjective and Objective Quality Explorations

    Authors: Qi Yang, Joel Jung, Timon Deschamps, Xiaozhong Xu, Shan Liu

    Abstract: Dynamic colored meshes (DCM) are widely used in various applications; however, these meshes may undergo different processes, such as compression or transmission, which can distort them and degrade their quality. To facilitate the development of objective metrics for DCMs and study the influence of typical distortions on their perception, we create the Tencent - dynamic colored mesh database (TDMD)… ▽ More

    Submitted 2 August, 2023; originally announced August 2023.

  46. arXiv:2306.00680  [pdf, other

    cs.SD cs.AI eess.AS

    Encoder-decoder multimodal speaker change detection

    Authors: Jee-weon Jung, Soonshin Seo, Hee-Soo Heo, Geonmin Kim, You Jin Kim, Young-ki Kwon, Minjae Lee, Bong-Jin Lee

    Abstract: The task of speaker change detection (SCD), which detects points where speakers change in an input, is essential for several applications. Several studies solved the SCD task using audio inputs only and have shown limited performance. Recently, multimodal SCD (MMSCD) models, which utilise text modality in addition to audio, have shown improved performance. In this study, the proposed model are bui… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: 5 pages, accepted for presentation at INTERSPEECH 2023

  47. arXiv:2305.19953  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Multi-Dataset Co-Training with Sharpness-Aware Optimization for Audio Anti-spoofing

    Authors: Hye-jin Shim, Jee-weon Jung, Tomi Kinnunen

    Abstract: Audio anti-spoofing for automatic speaker verification aims to safeguard users' identities from spoofing attacks. Although state-of-the-art spoofing countermeasure(CM) models perform well on specific datasets, they lack generalization when evaluated with different datasets. To address this limitation, previous studies have explored large pre-trained models, which require significant resources and… ▽ More

    Submitted 1 June, 2023; v1 submitted 31 May, 2023; originally announced May 2023.

    Comments: Interspeech 2023

  48. arXiv:2305.19051  [pdf, other

    eess.AS cs.AI cs.SD

    Towards single integrated spoofing-aware speaker verification embeddings

    Authors: Sung Hwan Mun, Hye-jin Shim, Hemlata Tak, Xin Wang, Xuechen Liu, Md Sahidullah, Myeonghun Jeong, Min Hyun Han, Massimiliano Todisco, Kong Aik Lee, Junichi Yamagishi, Nicholas Evans, Tomi Kinnunen, Nam Soo Kim, Jee-weon Jung

    Abstract: This study aims to develop a single integrated spoofing-aware speaker verification (SASV) embeddings that satisfy two aspects. First, rejecting non-target speakers' input as well as target speakers' spoofed inputs should be addressed. Second, competitive performance should be demonstrated compared to the fusion of automatic speaker verification (ASV) and countermeasure (CM) embeddings, which outpe… ▽ More

    Submitted 1 June, 2023; v1 submitted 30 May, 2023; originally announced May 2023.

    Comments: Accepted by INTERSPEECH 2023. Code and models are available in https://github.com/sasv-challenge/ASVSpoof5-SASVBaseline

  49. arXiv:2302.10248  [pdf, ps, other

    cs.SD cs.LG eess.AS

    VoxSRC 2022: The Fourth VoxCeleb Speaker Recognition Challenge

    Authors: Jaesung Huh, Andrew Brown, Jee-weon Jung, Joon Son Chung, Arsha Nagrani, Daniel Garcia-Romero, Andrew Zisserman

    Abstract: This paper summarises the findings from the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22), which was held in conjunction with INTERSPEECH 2022. The goal of this challenge was to evaluate how well state-of-the-art speaker recognition systems can diarise and recognise speakers from speech obtained "in the wild". The challenge consisted of: (i) the provision of publicly available speaker re… ▽ More

    Submitted 6 March, 2023; v1 submitted 20 February, 2023; originally announced February 2023.

  50. arXiv:2301.06567  [pdf, other

    cs.CV eess.IV

    Scalable Surface Water Mapping up to Fine-scale using Geometric Features of Water from Topographic Airborne LiDAR Data

    Authors: Hunsoo Song, Jinha Jung

    Abstract: Despite substantial technological advancements, the comprehensive mapping of surface water, particularly smaller bodies (<1ha), continues to be a challenge due to a lack of robust, scalable methods. Standard methods require either training labels or site-specific parameter tuning, which complicates automated mapping and introduces biases related to training data and parameters. The reliance on wat… ▽ More

    Submitted 15 August, 2023; v1 submitted 16 January, 2023; originally announced January 2023.