Skip to main content

Showing 1–10 of 10 results for author: Hai, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.02863  [pdf, ps, other

    eess.AS cs.AI cs.SD

    CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech

    Authors: Helin Wang, Jiarui Hai, Dading Chong, Karan Thakkar, Tiantian Feng, Dongchao Yang, Junhyeok Lee, Laureano Moro Velazquez, Jesus Villalba, Zengyi Qin, Shrikanth Narayanan, Mounya Elhiali, Najim Dehak

    Abstract: Recent advancements in generative artificial intelligence have significantly transformed the field of style-captioned text-to-speech synthesis (CapTTS). However, adapting CapTTS to real-world applications remains challenging due to the lack of standardized, comprehensive datasets and limited research on downstream tasks built upon CapTTS. To address these gaps, we introduce CapSpeech, a new benchm… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

  2. arXiv:2506.01257  [pdf

    cs.CL cs.AI

    DeepSeek in Healthcare: A Survey of Capabilities, Risks, and Clinical Applications of Open-Source Large Language Models

    Authors: Jiancheng Ye, Sophie Bronstein, Jiarui Hai, Malak Abu Hashish

    Abstract: DeepSeek-R1 is a cutting-edge open-source large language model (LLM) developed by DeepSeek, showcasing advanced reasoning capabilities through a hybrid architecture that integrates mixture of experts (MoE), chain of thought (CoT) reasoning, and reinforcement learning. Released under the permissive MIT license, DeepSeek-R1 offers a transparent and cost-effective alternative to proprietary models li… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

  3. arXiv:2505.19314  [pdf, ps, other

    eess.AS cs.AI cs.SD

    SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline

    Authors: Helin Wang, Jiarui Hai, Dongchao Yang, Chen Chen, Kai Li, Junyi Peng, Thomas Thebaud, Laureano Moro Velazquez, Jesus Villalba, Najim Dehak

    Abstract: Target Speech Extraction (TSE) aims to isolate a target speaker's voice from a mixture of multiple speakers by leveraging speaker-specific cues, typically provided as auxiliary audio (a.k.a. cue audio). Although recent advancements in TSE have primarily employed discriminative models that offer high perceptual quality, these models often introduce unwanted artifacts, reduce naturalness, and are se… ▽ More

    Submitted 31 May, 2025; v1 submitted 25 May, 2025; originally announced May 2025.

  4. arXiv:2409.10819  [pdf, ps, other

    eess.AS cs.SD

    EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

    Authors: Jiarui Hai, Yong Xu, Hao Zhang, Chenxing Li, Helin Wang, Mounya Elhilali, Dong Yu

    Abstract: We introduce EzAudio, a text-to-audio (T2A) generation framework designed to produce high-quality, natural-sounding sound effects. Core designs include: (1) We propose EzAudio-DiT, an optimized Diffusion Transformer (DiT) designed for audio latent representations, improving convergence speed, as well as parameter and memory efficiency. (2) We apply a classifier-free guidance (CFG) rescaling techni… ▽ More

    Submitted 19 June, 2025; v1 submitted 16 September, 2024; originally announced September 2024.

    Comments: Accepted at Interspeech 2025

  5. arXiv:2409.08425  [pdf, other

    eess.AS cs.SD

    SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer

    Authors: Helin Wang, Jiarui Hai, Yen-Ju Lu, Karan Thakkar, Mounya Elhilali, Najim Dehak

    Abstract: In this paper, we introduce SoloAudio, a novel diffusion-based generative model for target sound extraction (TSE). Our approach trains latent diffusion models on audio, replacing the previous U-Net backbone with a skip-connected Transformer that operates on latent features. SoloAudio supports both audio-oriented and language-oriented TSE by utilizing a CLAP model as the feature extractor for targe… ▽ More

    Submitted 1 January, 2025; v1 submitted 12 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP 2025

  6. arXiv:2409.07556  [pdf, other

    eess.AS cs.SD

    SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis

    Authors: Helin Wang, Meng Yu, Jiarui Hai, Chen Chen, Yuchen Hu, Rilin Chen, Najim Dehak, Dong Yu

    Abstract: In this paper, we introduce SSR-Speech, a neural codec autoregressive model designed for stable, safe, and robust zero-shot textbased speech editing and text-to-speech synthesis. SSR-Speech is built on a Transformer decoder and incorporates classifier-free guidance to enhance the stability of the generation process. A watermark Encodec is proposed to embed frame-level watermarks into the edited re… ▽ More

    Submitted 1 January, 2025; v1 submitted 11 September, 2024; originally announced September 2024.

    Comments: ICASSP 2025

  7. arXiv:2402.06599  [pdf, other

    cs.CV cs.AI

    On the Out-Of-Distribution Generalization of Multimodal Large Language Models

    Authors: Xingxuan Zhang, Jiansheng Li, Wenjing Chu, Junjia Hai, Renzhe Xu, Yuqing Yang, Shikai Guan, Jiazheng Xu, Peng Cui

    Abstract: We investigate the generalization boundaries of current Multimodal Large Language Models (MLLMs) via comprehensive evaluation under out-of-distribution scenarios and domain-specific tasks. We evaluate their zero-shot generalization across synthetic images, real-world distributional shifts, and specialized datasets like medical and molecular imagery. Empirical results indicate that MLLMs struggle w… ▽ More

    Submitted 9 February, 2024; originally announced February 2024.

  8. arXiv:2311.00814  [pdf, other

    cs.SD eess.AS

    Investigating Self-Supervised Deep Representations for EEG-based Auditory Attention Decoding

    Authors: Karan Thakkar, Jiarui Hai, Mounya Elhilali

    Abstract: Auditory Attention Decoding (AAD) algorithms play a crucial role in isolating desired sound sources within challenging acoustic environments directly from brain activity. Although recent research has shown promise in AAD using shallow representations such as auditory envelope and spectrogram, there has been limited exploration of deep Self-Supervised (SS) representations on a larger scale. In this… ▽ More

    Submitted 7 November, 2023; v1 submitted 1 November, 2023; originally announced November 2023.

    Comments: Submitted to ICASSP 2024

  9. arXiv:2310.04567  [pdf, other

    eess.AS cs.SD

    DPM-TSE: A Diffusion Probabilistic Model for Target Sound Extraction

    Authors: Jiarui Hai, Helin Wang, Dongchao Yang, Karan Thakkar, Najim Dehak, Mounya Elhilali

    Abstract: Common target sound extraction (TSE) approaches primarily relied on discriminative approaches in order to separate the target sound while minimizing interference from the unwanted sources, with varying success in separating the target from the background. This study introduces DPM-TSE, a first generative method based on diffusion probabilistic modeling (DPM) for target sound extraction, to achieve… ▽ More

    Submitted 9 October, 2023; v1 submitted 6 October, 2023; originally announced October 2023.

    Comments: Submitted to ICASSP 2024

  10. R2RNet: Low-light Image Enhancement via Real-low to Real-normal Network

    Authors: Jiang Hai, Zhu Xuan, Songchen Han, Ren Yang, Yutong Hao, Fengzhu Zou, Fang Lin

    Abstract: Images captured in weak illumination conditions could seriously degrade the image quality. Solving a series of degradation of low-light images can effectively improve the visual quality of images and the performance of high-level visual tasks. In this study, a novel Retinex-based Real-low to Real-normal Network (R2RNet) is proposed for low-light image enhancement, which includes three subnets: a D… ▽ More

    Submitted 11 November, 2021; v1 submitted 28 June, 2021; originally announced June 2021.

    Comments: 12 pages, 9 figures

    Journal ref: Journal of Visual Communication and Image Representation, 2022