Skip to main content

Showing 1–15 of 15 results for author: Wang, Y F

Searching in archive eess. Search in all archives.
.
  1. arXiv:2507.02768  [pdf, ps, other

    eess.AS cs.CL cs.SD

    DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment

    Authors: Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Sung-Feng Huang, Chih-Kai Yang, Chee-En Yu, Chun-Wei Chen, Wei-Chih Chen, Chien-yu Huang, Yi-Cheng Lin, Yu-Xiang Lin, Chi-An Fu, Chun-Yi Kuan, Wenze Ren, Xuanjun Chen, Wei-Ping Huang, En-Pei Hu, Tzu-Quan Lin, Yuan-Kuei Wu, Kuan-Po Huang, Hsiao-Ying Huang, Huang-Cheng Chou, Kai-Wei Chang, Cheng-Han Chiang , et al. (3 additional authors not shown)

    Abstract: We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following, without requiring task-specific audio instruction-tuning. Recent LALMs typically augment Large Language Models (LLMs) with auditory capabilities by training on large-scale, manually curated or LLM-synthesized audio-instruction datasets. However, these… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: Model and code available at: https://github.com/kehanlu/DeSTA2.5-Audio

  2. arXiv:2505.21198  [pdf, ps, other

    cs.SD eess.AS

    Universal Speech Enhancement with Regression and Generative Mamba

    Authors: Rong Chao, Rauf Nasretdinov, Yu-Chiang Frank Wang, Ante Jukić, Szu-Wei Fu, Yu Tsao

    Abstract: The Interspeech 2025 URGENT Challenge aimed to advance universal, robust, and generalizable speech enhancement by unifying speech enhancement tasks across a wide variety of conditions, including seven different distortion types and five languages. We present Universal Speech Enhancement Mamba (USEMamba), a state-space speech enhancement model designed to handle long-range sequence modeling, time-f… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: Accepted to Interspeech 2025

  3. arXiv:2505.09615  [pdf, other

    cs.CV cs.SD eess.AS

    UWAV: Uncertainty-weighted Weakly-supervised Audio-Visual Video Parsing

    Authors: Yung-Hsuan Lai, Janek Ebbers, Yu-Chiang Frank Wang, François Germain, Michael Jeffrey Jones, Moitreya Chatterjee

    Abstract: Audio-Visual Video Parsing (AVVP) entails the challenging task of localizing both uni-modal events (i.e., those occurring exclusively in either the visual or acoustic modality of a video) and multi-modal events (i.e., those occurring in both modalities concurrently). Moreover, the prohibitive cost of annotating training data with the class labels of all these events, along with their start and end… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

    Comments: CVPR 2025

  4. arXiv:2503.00733  [pdf, other

    eess.AS cs.CL cs.SD

    UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation

    Authors: Alexander H. Liu, Sang-gil Lee, Chao-Han Huck Yang, Yuan Gong, Yu-Chiang Frank Wang, James R. Glass, Rafael Valle, Bryan Catanzaro

    Abstract: Pre-training and representation learning have been playing an increasingly important role in modern speech processing. Nevertheless, different applications have been relying on different foundation models, since predominant pre-training techniques are either designed for discriminative tasks or generative tasks. In this work, we make the first attempt at building a unified pre-training framework f… ▽ More

    Submitted 2 March, 2025; originally announced March 2025.

    Comments: ICLR 2025; demo page at https://alexander-h-liu.github.io/uniwav-demo.github.io/

  5. arXiv:2501.03805  [pdf, other

    cs.SD cs.CL eess.AS

    Detecting the Undetectable: Assessing the Efficacy of Current Spoof Detection Methods Against Seamless Speech Edits

    Authors: Sung-Feng Huang, Heng-Cheng Kuo, Zhehuai Chen, Xuesong Yang, Chao-Han Huck Yang, Yu Tsao, Yu-Chiang Frank Wang, Hung-yi Lee, Szu-Wei Fu

    Abstract: Neural speech editing advancements have raised concerns about their misuse in spoofing attacks. Traditional partially edited speech corpora primarily focus on cut-and-paste edits, which, while maintaining speaker consistency, often introduce detectable discontinuities. Recent methods, like A\textsuperscript{3}T and Voicebox, improve transitions by leveraging contextual information. To foster spoof… ▽ More

    Submitted 7 January, 2025; originally announced January 2025.

    Comments: SLT 2024

  6. arXiv:2411.15922  [pdf, other

    eess.IV cs.CV

    PromptHSI: Universal Hyperspectral Image Restoration with Vision-Language Modulated Frequency Adaptation

    Authors: Chia-Ming Lee, Ching-Heng Cheng, Yu-Fan Lin, Yi-Ching Cheng, Wo-Ting Liao, Fu-En Yang, Yu-Chiang Frank Wang, Chih-Chung Hsu

    Abstract: Recent advances in All-in-One (AiO) RGB image restoration have demonstrated the effectiveness of prompt learning in handling multiple degradations within a single model. However, extending these approaches to hyperspectral image (HSI) restoration is challenging due to the domain gap between RGB and HSI features, information loss in visual prompts under severe composite degradations, and difficulti… ▽ More

    Submitted 11 March, 2025; v1 submitted 24 November, 2024; originally announced November 2024.

    Comments: Project page: https://chingheng0808.github.io/prompthsiP/static.html

  7. arXiv:2411.05945  [pdf, other

    cs.CL cs.AI cs.LG cs.MA eess.AS

    NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts

    Authors: Yen-Ting Lin, Chao-Han Huck Yang, Zhehuai Chen, Piotr Zelasko, Xuesong Yang, Zih-Ching Chen, Krishna C Puvvada, Szu-Wei Fu, Ke Hu, Jun Wei Chiu, Jagadeesh Balam, Boris Ginsburg, Yu-Chiang Frank Wang

    Abstract: Construction of a general-purpose post-recognition error corrector poses a crucial question: how can we most effectively train a model on a large mixture of domain datasets? The answer would lie in learning dataset-specific features and digesting their knowledge in a single model. Previous methods achieve this by having separate correction language models, resulting in a significant increase in pa… ▽ More

    Submitted 8 November, 2024; originally announced November 2024.

    Comments: NeKo work has been done in June 2024. NeKo LMs will be open source on https://huggingface.co/nvidia under the MIT license

  8. arXiv:2409.20007  [pdf, other

    eess.AS cs.CL cs.SD

    DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data

    Authors: Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Jagadeesh Balam, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee

    Abstract: Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs) by incorporating pre-trained speech models. However, these SLMs often undergo extensive speech instruction-tuning to bridge the gap between speech and text modalities. This requires significant annotation efforts and risks catastrophic forgetting of the original language capabilities… ▽ More

    Submitted 27 January, 2025; v1 submitted 30 September, 2024; originally announced September 2024.

    Comments: Accepted by ICASSP 2025

  9. arXiv:2406.18871  [pdf, other

    eess.AS cs.CL

    DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment

    Authors: Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, He Huang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee

    Abstract: Recent speech language models (SLMs) typically incorporate pre-trained speech models to extend the capabilities from large language models (LLMs). In this paper, we propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities, enabling SLMs to interpret and generate comprehensive natural language descriptions, thereby fa… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  10. arXiv:2402.16321  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech

    Authors: Szu-Wei Fu, Kuo-Hsuan Hung, Yu Tsao, Yu-Chiang Frank Wang

    Abstract: Speech quality estimation has recently undergone a paradigm shift from human-hearing expert designs to machine-learning models. However, current models rely mainly on supervised learning, which is time-consuming and expensive for label collection. To solve this problem, we propose VQScore, a self-supervised metric for evaluating speech based on the quantization error of a vector-quantized-variatio… ▽ More

    Submitted 26 February, 2024; originally announced February 2024.

    Comments: Published as a conference paper at ICLR 2024

  11. arXiv:2305.17343  [pdf, other

    cs.CV cs.SD eess.AS

    Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser

    Authors: Yung-Hsuan Lai, Yen-Chun Chen, Yu-Chiang Frank Wang

    Abstract: Audio-visual learning has been a major pillar of multi-modal machine learning, where the community mostly focused on its modality-aligned setting, i.e., the audio and visual modality are both assumed to signal the prediction target. With the Look, Listen, and Parse dataset (LLP), we investigate the under-explored unaligned setting, where the goal is to recognize audio and visual events in a video… ▽ More

    Submitted 2 October, 2023; v1 submitted 26 May, 2023; originally announced May 2023.

    Comments: NeurIPS 2023

  12. arXiv:2105.00708  [pdf, other

    cs.SD cs.CV cs.MM eess.AS

    Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation

    Authors: Yan-Bo Lin, Yu-Chiang Frank Wang

    Abstract: Human perceives rich auditory experience with distinct sound heard by ears. Videos recorded with binaural audio particular simulate how human receives ambient sound. However, a large number of videos are with monaural audio only, which would degrade the user experience due to the lack of ambient information. To address this issue, we propose an audio spatialization framework to convert a monaural… ▽ More

    Submitted 3 May, 2021; originally announced May 2021.

    Comments: AAAI'21

  13. arXiv:2007.09163  [pdf, other

    cs.CV cs.LG cs.NE eess.IV

    Wavelet Channel Attention Module with a Fusion Network for Single Image Deraining

    Authors: Hao-Hsiang Yang, Chao-Han Huck Yang, Yu-Chiang Frank Wang

    Abstract: Single image deraining is a crucial problem because rain severely degenerates the visibility of images and affects the performance of computer vision tasks like outdoor surveillance systems and intelligent vehicles. In this paper, we propose the new convolutional neural network (CNN) called the wavelet channel attention module with a fusion network. Wavelet transform and the inverse wavelet transf… ▽ More

    Submitted 17 July, 2020; originally announced July 2020.

    Comments: Accepted to IEEE ICIP 2020

    Journal ref: 2020 IEEE International Conference on Image Processing (ICIP)

  14. arXiv:1806.09250  [pdf

    physics.ins-det eess.SP

    Electronics of Time-of-flight Measurement for Back-n at CSNS

    Authors: T. Yu, P. Cao, X. Y. Ji, L. K. Xie, X. R. Huang, Q. An, H. Y. Bai, J. Bao, Y. H. Chen, P. J. Cheng, Z. Q. Cui, R. R. Fan, C. Q. Feng, M. H. Gu, Z. J. Han, G. Z. He, Y. C. He, Y. F. He, H. X. Huang, W. L. Huang, X. L. Ji, H. Y. Jiang, W. Jiang, H. Y. Jing, L. Kang , et al. (46 additional authors not shown)

    Abstract: Back-n is a white neutron experimental facility at China Spallation Neutron Source (CSNS). The time structure of the primary proton beam make it fully applicable to use TOF (time-of-flight) method for neutron energy measuring. We implement the electronics of TOF measurement on the general-purpose readout electronics designed for all of the seven detectors in Back-n. The electronics is based on PXI… ▽ More

    Submitted 24 June, 2018; originally announced June 2018.

    Comments: 4 pages, 13 figures, 21st IEEE Real Time Conference

  15. arXiv:1806.09249  [pdf

    physics.ins-det eess.SP

    T0 Fan-out for Back-n White Neutron Facility at CSNS

    Authors: X. Y. Ji, P. Cao, T. Yu, L. K. Xie, X. R. Huang, Q. An, H. Y. Bai, J. Bao, Y. H. Chen, P. J. Cheng, Z. Q. Cui, R. R. Fan, C. Q. Feng, M. H. Gu, Z. J. Han, G. Z. He, Y. C. He, Y. F. He, H. X. Huang, W. L. Huang, X. L. Ji, H. Y. Jiang, W. Jiang, H. Y. Jing, L. Kang , et al. (46 additional authors not shown)

    Abstract: the main physics goal for Back-n white neutron facility at China Spallation Neutron Source (CSNS) is to measure nuclear data. The energy of neutrons is one of the most important parameters for measuring nuclear data. Method of time of flight (TOF) is used to obtain the energy of neutrons. The time when proton bunches hit the thick tungsten target is considered as the start point of TOF. T0 signal,… ▽ More

    Submitted 24 June, 2018; originally announced June 2018.

    Comments: 3 pages, 6 figures, the 21st IEEE Real Time Conference