Skip to main content

Showing 1–14 of 14 results for author: Kuan, C

Searching in archive eess. Search in all archives.
.
  1. arXiv:2507.02768  [pdf, ps, other

    eess.AS cs.CL cs.SD

    DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment

    Authors: Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Sung-Feng Huang, Chih-Kai Yang, Chee-En Yu, Chun-Wei Chen, Wei-Chih Chen, Chien-yu Huang, Yi-Cheng Lin, Yu-Xiang Lin, Chi-An Fu, Chun-Yi Kuan, Wenze Ren, Xuanjun Chen, Wei-Ping Huang, En-Pei Hu, Tzu-Quan Lin, Yuan-Kuei Wu, Kuan-Po Huang, Hsiao-Ying Huang, Huang-Cheng Chou, Kai-Wei Chang, Cheng-Han Chiang , et al. (3 additional authors not shown)

    Abstract: We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following, without requiring task-specific audio instruction-tuning. Recent LALMs typically augment Large Language Models (LLMs) with auditory capabilities by training on large-scale, manually curated or LLM-synthesized audio-instruction datasets. However, these… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: Model and code available at: https://github.com/kehanlu/DeSTA2.5-Audio

  2. arXiv:2505.20166  [pdf, ps, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data

    Authors: Chun-Yi Kuan, Hung-yi Lee

    Abstract: Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs. These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks. However, this adaptation process presents two major limitations. First, ALLMs often suffer from catastrophic forgetting, where crucial textual capa… ▽ More

    Submitted 30 June, 2025; v1 submitted 26 May, 2025; originally announced May 2025.

    Comments: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing. Project Website: https://kuan2jiu99.github.io/Balsa

  3. arXiv:2505.19037  [pdf, ps, other

    eess.AS cs.CL

    Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models

    Authors: Ke-Han Lu, Chun-Yi Kuan, Hung-yi Lee

    Abstract: We introduce Speech-IFeval, an evaluation framework designed to assess instruction-following capabilities and quantify catastrophic forgetting in speech-aware language models (SLMs). Recent SLMs integrate speech perception with large language models (LLMs), often degrading textual capabilities due to speech-centric training. Existing benchmarks conflate speech perception with instruction-following… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

    Comments: Accecpted by Interspeech 2025; https://github.com/kehanlu/Speech-IFEval

  4. arXiv:2505.14518  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples

    Authors: Chun-Yi Kuan, Hung-yi Lee

    Abstract: Recent advancements in audio-aware large language models (ALLMs) enable them to process and understand audio inputs. However, these models often hallucinate non-existent sound events, reducing their reliability in real-world applications. To address this, we propose LISTEN (Learning to Identify Sounds Through Extended Negative Samples), a contrastive-like training method that enhances ALLMs' abili… ▽ More

    Submitted 30 June, 2025; v1 submitted 20 May, 2025; originally announced May 2025.

    Comments: Accepted to Interspeech 2025. Project Website: https://kuan2jiu99.github.io/Balsa

  5. arXiv:2502.05649  [pdf, other

    cs.CL cs.LG eess.AS

    Gender Bias in Instruction-Guided Speech Synthesis Models

    Authors: Chun-Yi Kuan, Hung-yi Lee

    Abstract: Recent advancements in controllable expressive speech synthesis, especially in text-to-speech (TTS) models, have allowed for the generation of speech with specific styles guided by textual descriptions, known as style prompts. While this development enhances the flexibility and naturalness of synthesized speech, there remains a significant gap in understanding how these models handle vague or abst… ▽ More

    Submitted 8 February, 2025; originally announced February 2025.

    Comments: NAACL 2025 Findings

  6. arXiv:2411.07111  [pdf, other

    cs.CL cs.SD eess.AS

    Building a Taiwanese Mandarin Spoken Language Model: A First Attempt

    Authors: Chih-Kai Yang, Yu-Kuan Fu, Chen-An Li, Yi-Cheng Lin, Yu-Xiang Lin, Wei-Chih Chen, Ho Lam Chung, Chun-Yi Kuan, Wei-Ping Huang, Ke-Han Lu, Tzu-Quan Lin, Hsiu-Hsuan Wang, En-Pei Hu, Chan-Jan Hsu, Liang-Hsuan Tseng, I-Hsiang Chiu, Ulin Sanga, Xuanjun Chen, Po-chun Hsu, Shu-wen Yang, Hung-yi Lee

    Abstract: This technical report presents our initial attempt to build a spoken large language model (LLM) for Taiwanese Mandarin, specifically tailored to enable real-time, speech-to-speech interaction in multi-turn conversations. Our end-to-end model incorporates a decoder-only transformer architecture and aims to achieve seamless interaction while preserving the conversational flow, including full-duplex… ▽ More

    Submitted 27 December, 2024; v1 submitted 11 November, 2024; originally announced November 2024.

    Comments: Work in progress

  7. arXiv:2411.05361  [pdf, ps, other

    cs.CL eess.AS

    Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

    Authors: Chien-yu Huang, Wei-Chih Chen, Shu-wen Yang, Andy T. Liu, Chen-An Li, Yu-Xiang Lin, Wei-Cheng Tseng, Anuj Diwan, Yi-Jen Shih, Jiatong Shi, William Chen, Chih-Kai Yang, Wenze Ren, Xuanjun Chen, Chi-Yuan Hsiao, Puyuan Peng, Shih-Heng Wang, Chun-Yi Kuan, Ke-Han Lu, Kai-Wei Chang, Fabian Ritter-Gutierrez, Kuan-Po Huang, Siddhant Arora, You-Kuan Lin, Ming To Chuang , et al. (55 additional authors not shown)

    Abstract: Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluati… ▽ More

    Submitted 9 June, 2025; v1 submitted 8 November, 2024; originally announced November 2024.

    Comments: ICLR 2025

  8. arXiv:2410.16130  [pdf, other

    eess.AS cs.CL cs.SD

    Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning

    Authors: Chun-Yi Kuan, Hung-yi Lee

    Abstract: Recent advancements in large audio-language models (LALMs) have shown impressive capabilities in understanding and reasoning about audio and speech information. However, these models still face challenges, including hallucinating non-existent sound events, misidentifying the order of sound events, and incorrectly attributing sound sources, which undermine their reliability and real-world applicati… ▽ More

    Submitted 31 December, 2024; v1 submitted 21 October, 2024; originally announced October 2024.

    Comments: Accepted to ICASSP 2025. Project Website: https://github.com/kuan2jiu99/audio-hallucination

  9. arXiv:2407.09886  [pdf, other

    eess.AS cs.CL cs.SD

    Speech-Copilot: Leveraging Large Language Models for Speech Processing via Task Decomposition, Modularization, and Program Generation

    Authors: Chun-Yi Kuan, Chih-Kai Yang, Wei-Ping Huang, Ke-Han Lu, Hung-yi Lee

    Abstract: In this work, we introduce Speech-Copilot, a modular framework for instruction-oriented speech-processing tasks that minimizes human effort in toolset construction. Unlike end-to-end methods using large audio-language models, Speech-Copilot builds speech processing-specific toolsets by analyzing pre-collected task instructions and breaking tasks into manageable sub-tasks. It features a flexible ag… ▽ More

    Submitted 23 September, 2024; v1 submitted 13 July, 2024; originally announced July 2024.

    Comments: Accepted to SLT 2024

  10. Listen and Speak Fairly: A Study on Semantic Gender Bias in Speech Integrated Large Language Models

    Authors: Yi-Cheng Lin, Tzu-Quan Lin, Chih-Kai Yang, Ke-Han Lu, Wei-Chih Chen, Chun-Yi Kuan, Hung-yi Lee

    Abstract: Speech Integrated Large Language Models (SILLMs) combine large language models with speech perception to perform diverse tasks, such as emotion recognition to speaker verification, demonstrating universal audio understanding capability. However, these models may amplify biases present in training data, potentially leading to biased access to information for marginalized groups. This work introduce… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

    Journal ref: 2024 IEEE Spoken Language Technology Workshop (SLT), Macao, 2024, pp. 439-446

  11. arXiv:2406.08402  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models

    Authors: Chun-Yi Kuan, Wei-Ping Huang, Hung-yi Lee

    Abstract: Large audio-language models (LALMs) enhance traditional large language models by integrating audio perception capabilities, allowing them to tackle audio-related tasks. Previous research has primarily focused on assessing the performance of LALMs across various tasks, yet overlooking their reliability, particularly concerning issues like object hallucination. In our study, we introduce methods to… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  12. arXiv:2401.00273  [pdf, ps, other

    eess.AS cs.CL

    Investigating Zero-Shot Generalizability on Mandarin-English Code-Switched ASR and Speech-to-text Translation of Recent Foundation Models with Self-Supervision and Weak Supervision

    Authors: Chih-Kai Yang, Kuan-Po Huang, Ke-Han Lu, Chun-Yi Kuan, Chi-Yuan Hsiao, Hung-yi Lee

    Abstract: This work evaluated several cutting-edge large-scale foundation models based on self-supervision or weak supervision, including SeamlessM4T, SeamlessM4T v2, and Whisper-large-v3, on three code-switched corpora. We found that self-supervised models can achieve performances close to the supervised model, indicating the effectiveness of multilingual self-supervised pre-training. We also observed that… ▽ More

    Submitted 30 December, 2023; originally announced January 2024.

    Comments: Submitted to ICASSP 2024 Self-supervision in Audio, Speech and Beyond workshop

  13. arXiv:2309.14324  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Towards General-Purpose Text-Instruction-Guided Voice Conversion

    Authors: Chun-Yi Kuan, Chen An Li, Tsu-Yuan Hsu, Tse-Yang Lin, Ho-Lam Chung, Kai-Wei Chang, Shuo-yiin Chang, Hung-yi Lee

    Abstract: This paper introduces a novel voice conversion (VC) model, guided by text instructions such as "articulate slowly with a deep tone" or "speak in a cheerful boyish voice". Unlike traditional methods that rely on reference utterances to determine the attributes of the converted speech, our model adds versatility and specificity to voice conversion. The proposed VC model is a neural codec language mo… ▽ More

    Submitted 16 January, 2024; v1 submitted 25 September, 2023; originally announced September 2023.

    Comments: Accepted to ASRU 2023

  14. arXiv:2309.09510  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech

    Authors: Chien-yu Huang, Ke-Han Lu, Shih-Heng Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng, Roshan Sharma, Shinji Watanabe, Bhiksha Ramakrishnan, Shady Shehata, Hung-yi Lee

    Abstract: Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions. However, existing studies in speech processing primarily focus on limited or specific tasks. Moreover, the lack of standardized benchmarks hinders a fair comparison across different approaches. Thus, we present Dynamic-SUPERB, a benchmark designed for bui… ▽ More

    Submitted 22 March, 2024; v1 submitted 18 September, 2023; originally announced September 2023.

    Comments: To appear in the proceedings of ICASSP 2024