Skip to main content

Showing 1–50 of 93 results for author: Bai, Y

Searching in archive eess. Search in all archives.
.
  1. arXiv:2507.05177  [pdf, ps, other

    cs.CL cs.AI cs.SD eess.AS

    OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model

    Authors: Chen Wang, Tianyu Peng, Wen Yang, Yinan Bai, Guangfu Wang, Jun Lin, Lanpeng Jia, Lingxiang Wu, Jinqiao Wang, Chengqing Zong, Jiajun Zhang

    Abstract: Empathetic interaction is a cornerstone of human-machine communication, due to the need for understanding speech enriched with paralinguistic cues and generating emotional and expressive responses. However, the most powerful empathetic LSLMs are increasingly closed off, leaving the crucial details about the architecture, data and development opaque to researchers. Given the critical need for trans… ▽ More

    Submitted 8 July, 2025; v1 submitted 7 July, 2025; originally announced July 2025.

    Comments: Technical Report

  2. arXiv:2507.00185  [pdf

    eess.IV cs.AI cs.CV

    Multimodal, Multi-Disease Medical Imaging Foundation Model (MerMED-FM)

    Authors: Yang Zhou, Chrystie Wan Ning Quek, Jun Zhou, Yan Wang, Yang Bai, Yuhe Ke, Jie Yao, Laura Gutierrez, Zhen Ling Teo, Darren Shu Jeng Ting, Brian T. Soetikno, Christopher S. Nielsen, Tobias Elze, Zengxiang Li, Linh Le Dinh, Lionel Tim-Ee Cheng, Tran Nguyen Tuan Anh, Chee Leong Cheng, Tien Yin Wong, Nan Liu, Iain Beehuat Tan, Tony Kiat Hon Lim, Rick Siow Mong Goh, Yong Liu, Daniel Shu Wei Ting

    Abstract: Current artificial intelligence models for medical imaging are predominantly single modality and single disease. Attempts to create multimodal and multi-disease models have resulted in inconsistent clinical accuracy. Furthermore, training these models typically requires large, labour-intensive, well-labelled datasets. We developed MerMED-FM, a state-of-the-art multimodal, multi-specialty foundatio… ▽ More

    Submitted 30 June, 2025; originally announced July 2025.

    Comments: 42 pages, 3 composite figures, 4 tables

  3. arXiv:2506.20493  [pdf

    eess.SY cs.GT

    Analyzing the Impact of Strategic Bidding on the Reserve Capacity via a Bi-Level Model

    Authors: Yun Xu, Yunxiao Bai, Yunyong Zhang, Peng Wang, Xuelin Wang, Jiqun Guo, Kaijun Xie, Rusheng Zhao

    Abstract: The growing integration of renewable energy sources necessitates adequate reserve capacity to maintain power balance. However, in market clearing, power companies with flexible resources may submit strategic bids to maximize profits, potentially compromising system reserves. This paper examines the effects of such strategic behavior by modeling the market as a bi-level problem. The upper level rep… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

  4. arXiv:2506.11163  [pdf, ps, other

    eess.IV cs.CV cs.GR

    Vector Representations of Vessel Trees

    Authors: James Batten, Michiel Schaap, Matthew Sinclair, Ying Bai, Ben Glocker

    Abstract: We introduce a novel framework for learning vector representations of tree-structured geometric data focusing on 3D vascular networks. Our approach employs two sequentially trained Transformer-based autoencoders. In the first stage, the Vessel Autoencoder captures continuous geometric details of individual vessel segments by learning embeddings from sampled points along each curve. In the second s… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

  5. arXiv:2505.08414  [pdf

    eess.IV cs.CV

    An integrated language-vision foundation model for conversational diagnostics and triaging in primary eye care

    Authors: Zhi Da Soh, Yang Bai, Kai Yu, Yang Zhou, Xiaofeng Lei, Sahil Thakur, Zann Lee, Lee Ching Linette Phang, Qingsheng Peng, Can Can Xue, Rachel Shujuan Chong, Quan V. Hoang, Lavanya Raghavan, Yih Chung Tham, Charumathi Sabanayagam, Wei-Chi Wu, Ming-Chih Ho, Jiangnan He, Preeti Gupta, Ecosse Lamoureux, Seang Mei Saw, Vinay Nangia, Songhomitra Panda-Jonas, Jie Xu, Ya Xing Wang , et al. (6 additional authors not shown)

    Abstract: Current deep learning models are mostly task specific and lack a user-friendly interface to operate. We present Meta-EyeFM, a multi-function foundation model that integrates a large language model (LLM) with vision foundation models (VFMs) for ocular disease assessment. Meta-EyeFM leverages a routing mechanism to enable accurate task-specific analysis based on text queries. Using Low Rank Adaptati… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  6. arXiv:2505.05795  [pdf, other

    eess.SY cs.RO

    Formation Maneuver Control Based on the Augmented Laplacian Method

    Authors: Xinzhe Zhou, Xuyang Wang, Xiaoming Duan, Yuzhu Bai, Jianping He

    Abstract: This paper proposes a novel formation maneuver control method for both 2-D and 3-D space, which enables the formation to translate, scale, and rotate with arbitrary orientation. The core innovation is the novel design of weights in the proposed augmented Laplacian matrix. Instead of using scalars, we represent weights as matrices, which are designed based on a specified rotation axis and allow the… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

  7. arXiv:2504.08907  [pdf, other

    cs.SD cs.CL eess.AS

    Spatial Audio Processing with Large Language Model on Wearable Devices

    Authors: Ayushi Mishra, Yang Bai, Priyadarshan Narayanasamy, Nakul Garg, Nirupam Roy

    Abstract: Integrating spatial context into large language models (LLMs) has the potential to revolutionize human-computer interaction, particularly in wearable devices. In this work, we present a novel system architecture that incorporates spatial speech understanding into LLMs, enabling contextually aware and adaptive applications for wearable technologies. Our approach leverages microstructure-based spati… ▽ More

    Submitted 25 April, 2025; v1 submitted 11 April, 2025; originally announced April 2025.

  8. arXiv:2503.03355  [pdf, other

    cs.CV cs.LG eess.IV

    Rethinking Video Super-Resolution: Towards Diffusion-Based Methods without Motion Alignment

    Authors: Zhihao Zhan, Wang Pang, Xiang Zhu, Yechao Bai

    Abstract: In this work, we rethink the approach to video super-resolution by introducing a method based on the Diffusion Posterior Sampling framework, combined with an unconditional video diffusion transformer operating in latent space. The video generation model, a diffusion transformer, functions as a space-time model. We argue that a powerful model, which learns the physics of the real world, can easily… ▽ More

    Submitted 8 May, 2025; v1 submitted 5 March, 2025; originally announced March 2025.

  9. arXiv:2503.02030  [pdf, other

    cs.LG eess.SY

    Accelerating Multi-Task Temporal Difference Learning under Low-Rank Representation

    Authors: Yitao Bai, Sihan Zeng, Justin Romberg, Thinh T. Doan

    Abstract: We study policy evaluation problems in multi-task reinforcement learning (RL) under a low-rank representation setting. In this setting, we are given $N$ learning tasks where the corresponding value function of these tasks lie in an $r$-dimensional subspace, with $r<N$. One can apply the classic temporal-difference (TD) learning method for solving these problems where this method learns the value f… ▽ More

    Submitted 3 March, 2025; originally announced March 2025.

    Comments: 13 pages, 3 figures

  10. arXiv:2501.12604  [pdf, other

    eess.IV cs.CV cs.LG

    Image Motion Blur Removal in the Temporal Dimension with Video Diffusion Models

    Authors: Wang Pang, Zhihao Zhan, Xiang Zhu, Yechao Bai

    Abstract: Most motion deblurring algorithms rely on spatial-domain convolution models, which struggle with the complex, non-linear blur arising from camera shake and object motion. In contrast, we propose a novel single-image deblurring approach that treats motion blur as a temporal averaging phenomenon. Our core innovation lies in leveraging a pre-trained video diffusion transformer model to capture divers… ▽ More

    Submitted 21 January, 2025; originally announced January 2025.

  11. arXiv:2412.17464  [pdf, other

    cs.CV eess.IV

    CALLIC: Content Adaptive Learning for Lossless Image Compression

    Authors: Daxin Li, Yuanchao Bai, Kai Wang, Junjun Jiang, Xianming Liu, Wen Gao

    Abstract: Learned lossless image compression has achieved significant advancements in recent years. However, existing methods often rely on training amortized generative models on massive datasets, resulting in sub-optimal probability distribution estimation for specific testing images during encoding process. To address this challenge, we explore the connection between the Minimum Description Length (MDL)… ▽ More

    Submitted 23 December, 2024; originally announced December 2024.

    Comments: Accepted by AAAI 2025

  12. arXiv:2412.02611  [pdf, other

    cs.CV cs.AI cs.CL cs.MM cs.SD eess.AS

    AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

    Authors: Kaixiong Gong, Kaituo Feng, Bohao Li, Yibing Wang, Mofan Cheng, Shijia Yang, Jiaming Han, Benyou Wang, Yutong Bai, Zhuoran Yang, Xiangyu Yue

    Abstract: Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5 Pro, and Reka Core, have expanded their capabilities to include vision and audio modalities. While these models demonstrate impressive performance across a wide range of audio-visual applications, our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial: 1) determining which of two s… ▽ More

    Submitted 3 December, 2024; originally announced December 2024.

    Comments: Project page: https://av-odyssey.github.io/

  13. arXiv:2411.18290  [pdf, ps, other

    eess.IV cs.CV

    Leveraging Semantic Asymmetry for Precise Gross Tumor Volume Segmentation of Nasopharyngeal Carcinoma in Planning CT

    Authors: Zi Li, Ying Chen, Zeli Chen, Yanzhou Su, Tai Ma, Tony C. W. Mok, Yan-Jie Zhou, Yunhai Bai, Zhinlin Zheng, Le Lu, Yirui Wang, Jia Ge, Xianghua Ye, Senxiang Yan, Dakai Jin

    Abstract: In the radiation therapy of nasopharyngeal carcinoma (NPC), clinicians typically delineate the gross tumor volume (GTV) using non-contrast planning computed tomography to ensure accurate radiation dose delivery. However, the low contrast between tumors and adjacent normal tissues necessitates that radiation oncologists manually delineate the tumors, often relying on diagnostic MRI for guidance. %… ▽ More

    Submitted 26 June, 2025; v1 submitted 27 November, 2024; originally announced November 2024.

  14. arXiv:2411.13970  [pdf, other

    eess.SP cs.LG

    Movable Antenna-Equipped UAV for Data Collection in Backscatter Sensor Networks: A Deep Reinforcement Learning-based Approach

    Authors: Yu Bai, Boxuan Xie, Ruifan Zhu, Zheng Chang, Riku Jantti

    Abstract: Backscatter communication (BC) becomes a promising energy-efficient solution for future wireless sensor networks (WSNs). Unmanned aerial vehicles (UAVs) enable flexible data collection from remote backscatter devices (BDs), yet conventional UAVs rely on omni-directional fixed-position antennas (FPAs), limiting channel gain and prolonging data collection time. To address this issue, we consider equ… ▽ More

    Submitted 21 November, 2024; originally announced November 2024.

  15. arXiv:2410.20324  [pdf, other

    cs.CR eess.SP

    A New Non-Binary Response Generation Scheme from Physical Unclonable Functions

    Authors: Yonghong Bai, Zhiyuan Yan

    Abstract: Physical Unclonable Functions (PUFs) are widely used in key generation, with each PUF cell typically producing one bit of data. To enable the extraction of longer keys, a new non-binary response generation scheme based on the one-probability of PUF bits is proposed. Instead of using PUF bits directly as keys, non-binary responses are first derived by comparing the one-frequency of PUF bits with th… ▽ More

    Submitted 26 October, 2024; originally announced October 2024.

    Comments: 5 pages, 2 figures, conference

  16. arXiv:2410.20309  [pdf, other

    eess.IV cs.AI cs.CV

    Enhancing Community Vision Screening -- AI Driven Retinal Photography for Early Disease Detection and Patient Trust

    Authors: Xiaofeng Lei, Yih-Chung Tham, Jocelyn Hui Lin Goh, Yangqin Feng, Yang Bai, Zhi Da Soh, Rick Siow Mong Goh, Xinxing Xu, Yong Liu, Ching-Yu Cheng

    Abstract: Community vision screening plays a crucial role in identifying individuals with vision loss and preventing avoidable blindness, particularly in rural communities where access to eye care services is limited. Currently, there is a pressing need for a simple and efficient process to screen and refer individuals with significant eye disease-related vision loss to tertiary eye care centers for further… ▽ More

    Submitted 26 October, 2024; originally announced October 2024.

    Comments: 11 pages, 4 figures, published in MICCAI2024 OMIA XI workshop

  17. arXiv:2410.19008  [pdf, other

    eess.IV cs.AI cs.CV

    Teach Multimodal LLMs to Comprehend Electrocardiographic Images

    Authors: Ruoqi Liu, Yuelin Bai, Xiang Yue, Ping Zhang

    Abstract: The electrocardiogram (ECG) is an essential non-invasive diagnostic tool for assessing cardiac conditions. Existing automatic interpretation methods suffer from limited generalizability, focusing on a narrow range of cardiac conditions, and typically depend on raw physiological signals, which may not be readily available in resource-limited settings where only printed or digital ECG images are acc… ▽ More

    Submitted 21 October, 2024; originally announced October 2024.

  18. arXiv:2410.17814  [pdf, other

    eess.IV cs.CV cs.LG

    Learning Lossless Compression for High Bit-Depth Volumetric Medical Image

    Authors: Kai Wang, Yuanchao Bai, Daxin Li, Deming Zhai, Junjun Jiang, Xianming Liu

    Abstract: Recent advances in learning-based methods have markedly enhanced the capabilities of image compression. However, these methods struggle with high bit-depth volumetric medical images, facing issues such as degraded performance, increased memory demand, and reduced processing speed. To address these challenges, this paper presents the Bit-Division based Lossless Volumetric Image Compression (BD-LVIC… ▽ More

    Submitted 23 October, 2024; originally announced October 2024.

    Comments: 13 pages

  19. arXiv:2409.12724  [pdf, other

    cs.CV eess.IV

    PVContext: Hybrid Context Model for Point Cloud Compression

    Authors: Guoqing Zhang, Wenbo Zhao, Jian Liu, Yuanchao Bai, Junjun Jiang, Xianming Liu

    Abstract: Efficient storage of large-scale point cloud data has become increasingly challenging due to advancements in scanning technology. Recent deep learning techniques have revolutionized this field; However, most existing approaches rely on single-modality contexts, such as octree nodes or voxel occupancy, limiting their ability to capture information across large regions. In this paper, we propose PVC… ▽ More

    Submitted 19 September, 2024; originally announced September 2024.

  20. arXiv:2409.09214  [pdf, other

    cs.SD eess.AS

    Seed-Music: A Unified Framework for High Quality and Controlled Music Generation

    Authors: Ye Bai, Haonan Chen, Jitong Chen, Zhuo Chen, Yi Deng, Xiaohong Dong, Lamtharn Hantrakul, Weituo Hao, Qingqing Huang, Zhongyi Huang, Dongya Jia, Feihu La, Duc Le, Bochen Li, Chumin Li, Hui Li, Xingxing Li, Shouda Liu, Wei-Tsung Lu, Yiqing Lu, Andrew Shaw, Janne Spijkervet, Yakun Sun, Bo Wang, Ju-Chiang Wang , et al. (13 additional authors not shown)

    Abstract: We introduce Seed-Music, a suite of music generation systems capable of producing high-quality music with fine-grained style control. Our unified framework leverages both auto-regressive language modeling and diffusion approaches to support two key music creation workflows: controlled music generation and post-production editing. For controlled music generation, our system enables vocal music gene… ▽ More

    Submitted 19 September, 2024; v1 submitted 13 September, 2024; originally announced September 2024.

    Comments: Seed-Music technical report, 20 pages, 5 figures

  21. arXiv:2409.08680  [pdf, other

    eess.AS cs.AI cs.CL

    NEST-RQ: Next Token Prediction for Speech Self-Supervised Pre-Training

    Authors: Minglun Han, Ye Bai, Chen Shen, Youjia Huang, Mingkun Huang, Zehua Lin, Linhao Dong, Lu Lu, Yuxuan Wang

    Abstract: Speech self-supervised pre-training can effectively improve the performance of downstream tasks. However, previous self-supervised learning (SSL) methods for speech, such as HuBERT and BEST-RQ, focus on utilizing non-causal encoders with bidirectional context, and lack sufficient support for downstream streaming models. To address this issue, we introduce the next token prediction based speech pre… ▽ More

    Submitted 13 September, 2024; originally announced September 2024.

    Comments: 5 pages, 2 figures, Work in progress

  22. Safety Critical Control for Nonlinear Systems with Complex Input Constraints

    Authors: Yaosheng Deng, Yang Bai, Yujie Wang, Masaki Ogura, Mir Feroskhan

    Abstract: In this paper, we propose a novel Control Barrier Function (CBF) based controller for nonlinear systems with complex, time-varying input constraints. To deal with these constraints, we introduce an auxiliary control input to transform the original system into an augmented one, thus reformulating the constrained-input problem into a constrained-output one. This transformation simplifies the Quadrat… ▽ More

    Submitted 18 May, 2025; v1 submitted 18 August, 2024; originally announced August 2024.

    Comments: 8 pages, 2 figures

  23. arXiv:2407.21391  [pdf

    cs.SD cs.CV cs.MM eess.AS

    Design and Development of Laughter Recognition System Based on Multimodal Fusion and Deep Learning

    Authors: Fuzheng Zhao, Yu Bai

    Abstract: This study aims to design and implement a laughter recognition system based on multimodal fusion and deep learning, leveraging image and audio processing technologies to achieve accurate laughter recognition and emotion analysis. First, the system loads video files and uses the OpenCV library to extract facial information while employing the Librosa library to process audio features such as MFCC.… ▽ More

    Submitted 31 July, 2024; originally announced July 2024.

    Comments: 7 pages,2 figures

  24. arXiv:2407.10108  [pdf, other

    eess.AS cs.SD

    Advancing Continual Learning for Robust Deepfake Audio Classification

    Authors: Feiyi Dong, Qingchen Tang, Yichen Bai, Zihan Wang

    Abstract: The emergence of new spoofing attacks poses an increasing challenge to audio security. Current detection methods often falter when faced with unseen spoofing attacks. Traditional strategies, such as retraining with new data, are not always feasible due to extensive storage. This paper introduces a novel continual learning method Continual Audio Defense Enhancer (CADE). First, by utilizing a fixed… ▽ More

    Submitted 14 July, 2024; originally announced July 2024.

    Comments: Submitted to IEEE Tencon. 5 pages

  25. arXiv:2407.04675  [pdf, other

    eess.AS cs.SD

    Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

    Authors: Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chuang Ding, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao, Lu Gao, Yi Guo, Minglun Han, Ting Han, Wenchao Hu, Xinying Hu, Yuxiang Hu, Deyu Hua, Lu Huang, Mingkun Huang, Youjia Huang, Jishuo Jin, Fanliu Kong, Zongwei Lan, Tianyu Li , et al. (30 additional authors not shown)

    Abstract: Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this wor… ▽ More

    Submitted 10 July, 2024; v1 submitted 5 July, 2024; originally announced July 2024.

  26. arXiv:2406.18558  [pdf, other

    cs.CV eess.IV

    BAISeg: Boundary Assisted Weakly Supervised Instance Segmentation

    Authors: Tengbo Wang, Yu Bai

    Abstract: How to extract instance-level masks without instance-level supervision is the main challenge of weakly supervised instance segmentation (WSIS). Popular WSIS methods estimate a displacement field (DF) via learning inter-pixel relations and perform clustering to identify instances. However, the resulting instance centroids are inherently unstable and vary significantly across different clustering al… ▽ More

    Submitted 19 November, 2024; v1 submitted 27 May, 2024; originally announced June 2024.

  27. arXiv:2406.04840  [pdf, other

    cs.SD eess.AS

    TraceableSpeech: Towards Proactively Traceable Text-to-Speech with Watermarking

    Authors: Junzuo Zhou, Jiangyan Yi, Tao Wang, Jianhua Tao, Ye Bai, Chu Yuan Zhang, Yong Ren, Zhengqi Wen

    Abstract: Various threats posed by the progress in text-to-speech (TTS) have prompted the need to reliably trace synthesized speech. However, contemporary approaches to this task involve adding watermarks to the audio separately after generation, a process that hurts both speech quality and watermark imperceptibility. In addition, these approaches are limited in robustness and flexibility. To address these… ▽ More

    Submitted 15 November, 2024; v1 submitted 7 June, 2024; originally announced June 2024.

    Comments: acceped by interspeech 2024

  28. arXiv:2405.18435  [pdf, other

    eess.IV cs.CV

    QUBIQ: Uncertainty Quantification for Biomedical Image Segmentation Challenge

    Authors: Hongwei Bran Li, Fernando Navarro, Ivan Ezhov, Amirhossein Bayat, Dhritiman Das, Florian Kofler, Suprosanna Shit, Diana Waldmannstetter, Johannes C. Paetzold, Xiaobin Hu, Benedikt Wiestler, Lucas Zimmer, Tamaz Amiranashvili, Chinmay Prabhakar, Christoph Berger, Jonas Weidner, Michelle Alonso-Basant, Arif Rashid, Ujjwal Baid, Wesam Adel, Deniz Ali, Bhakti Baheti, Yingbin Bai, Ishaan Bhatt, Sabri Can Cetindag , et al. (55 additional authors not shown)

    Abstract: Uncertainty in medical image segmentation tasks, especially inter-rater variability, arising from differences in interpretations and annotations by various experts, presents a significant challenge in achieving consistent and reliable image segmentation. This variability not only reflects the inherent complexity and subjective nature of medical image interpretation but also directly impacts the de… ▽ More

    Submitted 24 June, 2024; v1 submitted 19 March, 2024; originally announced May 2024.

    Comments: initial technical report

  29. GroupedMixer: An Entropy Model with Group-wise Token-Mixers for Learned Image Compression

    Authors: Daxin Li, Yuanchao Bai, Kai Wang, Junjun Jiang, Xianming Liu, Wen Gao

    Abstract: Transformer-based entropy models have gained prominence in recent years due to their superior ability to capture long-range dependencies in probability distribution estimation compared to convolution-based methods. However, previous transformer-based entropy models suffer from a sluggish coding process due to pixel-wise autoregression or duplicated computation during inference. In this paper, we p… ▽ More

    Submitted 2 May, 2024; originally announced May 2024.

    Comments: Accepted by IEEE TCSVT

  30. arXiv:2404.11275  [pdf, other

    cs.SD eess.AS

    Jointly Recognizing Speech and Singing Voices Based on Multi-Task Audio Source Separation

    Authors: Ye Bai, Chenxing Li, Hao Li, Yuanyuan Zhao, Xiaorui Wang

    Abstract: In short video and live broadcasts, speech, singing voice, and background music often overlap and obscure each other. This complexity creates difficulties in structuring and recognizing the audio content, which may impair subsequent ASR and music understanding applications. This paper proposes a multi-task audio source separation (MTASS) based ASR model called JRSV, which Jointly Recognizes Speech… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

    Comments: Accepted by ICME 2024

  31. arXiv:2404.06393  [pdf, other

    cs.SD cs.AI eess.AS

    MuPT: A Generative Symbolic Music Pretrained Transformer

    Authors: Xingwei Qu, Yuelin Bai, Yinghao Ma, Ziya Zhou, Ka Man Lo, Jiaheng Liu, Ruibin Yuan, Lejun Min, Xueling Liu, Tianyu Zhang, Xinrun Du, Shuyue Guo, Yiming Liang, Yizhi Li, Shangda Wu, Junting Zhou, Tianyu Zheng, Ziyang Ma, Fengze Han, Wei Xue, Gus Xia, Emmanouil Benetos, Xiang Yue, Chenghua Lin, Xu Tan , et al. (3 additional authors not shown)

    Abstract: In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition. To address the chal… ▽ More

    Submitted 5 November, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

  32. arXiv:2403.17392  [pdf, other

    cs.RO eess.SY nlin.AO

    Swarm navigation of cyborg-insects in unknown obstructed soft terrain

    Authors: Yang Bai, Phuoc Thanh Tran Ngoc, Huu Duoc Nguyen, Duc Long Le, Quang Huy Ha, Kazuki Kai, Yu Xiang See To, Yaosheng Deng, Jie Song, Naoki Wakamiya, Hirotaka Sato, Masaki Ogura

    Abstract: Cyborg insects refer to hybrid robots that integrate living insects with miniature electronic controllers to enable robotic-like programmable control. These creatures exhibit advantages over conventional robots in adaption to complex terrain and sustained energy efficiency. Nevertheless, there is a lack of literature on the control of multi-cyborg systems. This research gap is due to the difficult… ▽ More

    Submitted 21 December, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

  33. arXiv:2403.10585  [pdf, other

    eess.IV cs.AI cs.CV cs.LG

    Solving General Noisy Inverse Problem via Posterior Sampling: A Policy Gradient Viewpoint

    Authors: Haoyue Tang, Tian Xie, Aosong Feng, Hanyu Wang, Chenyang Zhang, Yang Bai

    Abstract: Solving image inverse problems (e.g., super-resolution and inpainting) requires generating a high fidelity image that matches the given input (the low-resolution image or the masked image). By using the input image as guidance, we can leverage a pretrained diffusion generative model to solve a wide range of image inverse tasks without task specific model fine-tuning. To precisely estimate the guid… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

    Comments: Accepted and to Appear, AISTATS 2024

  34. arXiv:2401.14007  [pdf, other

    eess.IV cs.CV

    Semantic Ensemble Loss and Latent Refinement for High-Fidelity Neural Image Compression

    Authors: Daxin Li, Yuanchao Bai, Kai Wang, Junjun Jiang, Xianming Liu

    Abstract: Recent advancements in neural compression have surpassed traditional codecs in PSNR and MS-SSIM measurements. However, at low bit-rates, these methods can introduce visually displeasing artifacts, such as blurring, color shifting, and texture loss, thereby compromising perceptual quality of images. To address these issues, this study presents an enhanced neural compression method designed for opti… ▽ More

    Submitted 25 October, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

    Comments: Accepted by VCIP 2024

  35. Diffusion-Based Adversarial Purification for Speaker Verification

    Authors: Yibo Bai, Xiao-Lei Zhang, Xuelong Li

    Abstract: Recently, automatic speaker verification (ASV) based on deep learning is easily contaminated by adversarial attacks, which is a new type of attack that injects imperceptible perturbations to audio signals so as to make ASV produce wrong decisions. This poses a significant threat to the security and reliability of ASV systems. To address this issue, we propose a Diffusion-Based Adversarial Purifica… ▽ More

    Submitted 9 July, 2024; v1 submitted 22 October, 2023; originally announced October 2023.

    Comments: Accepted by IEEE Signal Processing Letters

  36. arXiv:2309.10740  [pdf, other

    cs.SD cs.LG cs.MM eess.AS

    ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation

    Authors: Yatong Bai, Trung Dang, Dung Tran, Kazuhito Koishida, Somayeh Sojoudi

    Abstract: Diffusion models are instrumental in text-to-audio (TTA) generation. Unfortunately, they suffer from slow inference due to an excessive number of queries to the underlying denoising network per generation. To address this bottleneck, we introduce ConsistencyTTA, a framework requiring only a single non-autoregressive network query, thereby accelerating TTA by hundreds of times. We achieve so by pro… ▽ More

    Submitted 24 June, 2024; v1 submitted 19 September, 2023; originally announced September 2023.

  37. arXiv:2309.05908  [pdf, other

    eess.SY

    Reset Controller Synthesis by Reach-avoid Analysis for Delay Hybrid Systems

    Authors: Han Su, Jiyu Zhu, Shenghua Feng, Yunjun Bai, Bin Gu, Jiang Liu, Mengfei Yang, Naijun Zhan

    Abstract: A reset controller plays a crucial role in designing hybrid systems. It restricts the initial set and redefines the reset map associated with discrete transitions, in order to guarantee the system to achieve its objective. Reset controller synthesis, together with feedback controller synthesis and switching logic controller synthesis, provides a correct-by-construction approach to designing hybrid… ▽ More

    Submitted 27 May, 2024; v1 submitted 11 September, 2023; originally announced September 2023.

    Comments: 15 pages, 10 figures

  38. arXiv:2309.05906  [pdf, other

    eess.SY

    Correct-by-Construction for Hybrid Systems by Synthesizing Reset Controller

    Authors: Jiang Liu, Han Su, Yunjun Bai, Bin Gu, Bai Xue, Mengfei Yang, Naijun Zhan

    Abstract: Controller synthesis, including reset controller, feedback controller, and switching logic controller, provides an essential mechanism to guarantee the correctness and reliability of hybrid systems in a correct-by-construction manner. Unfortunately, reset controller synthesis is still in an infant stage in the literature, although it makes theoretical and practical significance. In this paper, we… ▽ More

    Submitted 11 September, 2023; originally announced September 2023.

    Comments: 26 pages, 8 figures

  39. arXiv:2307.15980  [pdf, other

    cs.LG eess.SY

    Initial State Interventions for Deconfounded Imitation Learning

    Authors: Samuel Pfrommer, Yatong Bai, Hyunin Lee, Somayeh Sojoudi

    Abstract: Imitation learning suffers from causal confusion. This phenomenon occurs when learned policies attend to features that do not causally influence the expert actions but are instead spuriously correlated. Causally confused agents produce low open-loop supervised loss but poor closed-loop performance upon deployment. We consider the problem of masking observed confounders in a disentangled representa… ▽ More

    Submitted 11 August, 2023; v1 submitted 29 July, 2023; originally announced July 2023.

    Comments: 62nd IEEE Conference on Decision and Control

  40. arXiv:2306.16710  [pdf

    cs.CL cs.SD eess.AS eess.SP

    Automatic Speech Recognition of Non-Native Child Speech for Language Learning Applications

    Authors: Simone Wills, Yu Bai, Cristian Tejedor-Garcia, Catia Cucchiarini, Helmer Strik

    Abstract: Voicebots have provided a new avenue for supporting the development of language skills, particularly within the context of second language learning. Voicebots, though, have largely been geared towards native adult speakers. We sought to assess the performance of two state-of-the-art ASR systems, Wav2Vec2.0 and Whisper AI, with a view to developing a voicebot that can support children acquiring a f… ▽ More

    Submitted 29 June, 2023; originally announced June 2023.

    Comments: Published on SLATE 2023, Esmad, Politecnico Do Porto, Portugal, 26-28 June, 2023, pp: 11:1-11:8

    Journal ref: 12th Symposium on Languages, Applications and Technologies (SLATE 2023) (p. 7:1-7:8)

  41. Visual-Aware Text-to-Speech

    Authors: Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, Tao Mei

    Abstract: Dynamically synthesizing talking speech that actively responds to a listening head is critical during the face-to-face interaction. For example, the speaker could take advantage of the listener's facial expression to adjust the tones, stressed syllables, or pauses. In this work, we present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and s… ▽ More

    Submitted 21 June, 2023; originally announced June 2023.

    Comments: accepted as oral and top 3% paper by ICASSP 2023

    Journal ref: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023, 1-5

  42. arXiv:2306.04190  [pdf

    cs.CL cs.LG cs.SD eess.AS eess.SP

    An ASR-Based Tutor for Learning to Read: How to Optimize Feedback to First Graders

    Authors: Yu Bai, Cristian Tejedor-Garcia, Ferdy Hubers, Catia Cucchiarini, Helmer Strik

    Abstract: The interest in employing automatic speech recognition (ASR) in applications for reading practice has been growing in recent years. In a previous study, we presented an ASR-based Dutch reading tutor application that was developed to provide instantaneous feedback to first-graders learning to read. We saw that ASR has potential at this stage of the reading process, as the results suggested that pup… ▽ More

    Submitted 7 June, 2023; originally announced June 2023.

    Comments: Published (double-blind peer-reviewed) on SPECOM 2021

    Journal ref: In: Karpov A., Potapova R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science, vol 12997. Springer, Cham

  43. arXiv:2306.02982  [pdf, other

    cs.CL eess.AS

    PolyVoice: Language Models for Speech to Speech Translation

    Authors: Qianqian Dong, Zhiying Huang, Qiao Tian, Chen Xu, Tom Ko, Yunlong Zhao, Siyuan Feng, Tang Li, Kexin Wang, Xuxin Cheng, Fengpeng Yue, Ye Bai, Xi Chen, Lu Lu, Zejun Ma, Yuping Wang, Mingxuan Wang, Yuxuan Wang

    Abstract: We propose PolyVoice, a language model-based framework for speech-to-speech translation (S2ST) system. Our framework consists of two language models: a translation language model and a speech synthesis language model. We use discretized speech units, which are generated in a fully unsupervised way, and thus our framework can be used for unwritten languages. For the speech synthesis part, we adopt… ▽ More

    Submitted 13 June, 2023; v1 submitted 5 June, 2023; originally announced June 2023.

  44. arXiv:2306.01232  [pdf, other

    eess.IV cs.CV

    Deep Reinforcement Learning Framework for Thoracic Diseases Classification via Prior Knowledge Guidance

    Authors: Weizhi Nie, Chen Zhang, Dan Song, Lina Zhao, Yunpeng Bai, Keliang Xie, Anan Liu

    Abstract: The chest X-ray is often utilized for diagnosing common thoracic diseases. In recent years, many approaches have been proposed to handle the problem of automatic diagnosis based on chest X-rays. However, the scarcity of labeled data for related diseases still poses a huge challenge to an accurate diagnosis. In this paper, we focus on the thorax disease diagnostic problem and propose a novel deep r… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

  45. arXiv:2305.12072  [pdf, other

    eess.IV cs.CV

    Chest X-ray Image Classification: A Causal Perspective

    Authors: Weizhi Nie, Chen Zhang, Dan Song, Lina Zhao, Yunpeng Bai, Keliang Xie, Anan Liu

    Abstract: The chest X-ray (CXR) is one of the most common and easy-to-get medical tests used to diagnose common diseases of the chest. Recently, many deep learning-based methods have been proposed that are capable of effectively classifying CXRs. Even though these techniques have worked quite well, it is difficult to establish whether what these algorithms actually learn is the cause-and-effect link between… ▽ More

    Submitted 19 May, 2023; originally announced May 2023.

  46. arXiv:2305.12070  [pdf, other

    eess.IV cs.CV

    Instrumental Variable Learning for Chest X-ray Classification

    Authors: Weizhi Nie, Chen Zhang, Dan song, Yunpeng Bai, Keliang Xie, Anan Liu

    Abstract: The chest X-ray (CXR) is commonly employed to diagnose thoracic illnesses, but the challenge of achieving accurate automatic diagnosis through this method persists due to the complex relationship between pathology. In recent years, various deep learning-based approaches have been suggested to tackle this problem but confounding factors such as image resolution or noise problems often damage model… ▽ More

    Submitted 19 May, 2023; originally announced May 2023.

  47. arXiv:2305.07278  [pdf, ps, other

    eess.SP

    Deep Learning for Asynchronous Massive Access with Data Frame Length Diversity

    Authors: Yanna Bai, Wei Chen, Bo Ai, Petar Popovski

    Abstract: Grant-free non-orthogonal multiple access has been regarded as a viable approach to accommodate access for a massive number of machine-type devices with small data packets. The sporadic activation of the devices creates a multiuser setup where it is suitable to use compressed sensing in order to detect the active devices and decode their data. We consider asynchronous access of machine-type device… ▽ More

    Submitted 12 May, 2023; originally announced May 2023.

  48. A High-Performance Accelerator for Super-Resolution Processing on Embedded GPU

    Authors: Wenqian Zhao, Qi Sun, Yang Bai, Wenbo Li, Haisheng Zheng, Bei Yu, Martin D. F. Wong

    Abstract: Recent years have witnessed impressive progress in super-resolution (SR) processing. However, its real-time inference requirement sets a challenge not only for the model design but also for the on-chip implementation. In this paper, we implement a full-stack SR acceleration framework on embedded GPU devices. The special dictionary learning algorithm used in SR models was analyzed in detail and acc… ▽ More

    Submitted 15 March, 2023; originally announced March 2023.

  49. arXiv:2301.12048  [pdf, other

    cs.CV eess.IV

    Making Reconstruction-based Method Great Again for Video Anomaly Detection

    Authors: Yizhou Wang, Can Qin, Yue Bai, Yi Xu, Xu Ma, Yun Fu

    Abstract: Anomaly detection in videos is a significant yet challenging problem. Previous approaches based on deep neural networks employ either reconstruction-based or prediction-based approaches. Nevertheless, existing reconstruction-based methods 1) rely on old-fashioned convolutional autoencoders and are poor at modeling temporal dependency; 2) are prone to overfit the training samples, leading to indist… ▽ More

    Submitted 27 January, 2023; originally announced January 2023.

    Comments: Accepted by ICDM 2022

  50. arXiv:2301.10314  [pdf, other

    cs.HC cs.SD eess.AS

    WhisperWand: Simultaneous Voice and Gesture Tracking Interface

    Authors: Yang Bai, Irtaza Shahid, Harshvardhan Takawale, Nirupam Roy

    Abstract: This paper presents the design and implementation of WhisperWand, a comprehensive voice and motion tracking interface for voice assistants. Distinct from prior works, WhisperWand is a precise tracking interface that can co-exist with the voice interface on low sampling rate voice assistants. Taking handwriting as a specific application, it can also capture natural strokes and the individualized st… ▽ More

    Submitted 24 January, 2023; originally announced January 2023.