Skip to main content

Showing 1–50 of 274 results for author: Yu, Y

Searching in archive eess. Search in all archives.
.
  1. arXiv:2506.18067  [pdf, ps, other

    eess.SP cs.IT

    Cooperative Bistatic ISAC Systems for Low-Altitude Economy

    Authors: Zhenkun Zhang, Yining Xu, Cunhua Pan, Hong Ren, Yiming Yu, Jiangzhou Wang

    Abstract: The burgeoning low-altitude economy (LAE) necessitates integrated sensing and communication (ISAC) systems capable of high-accuracy multi-target localization and velocity estimation under hardware and coverage constraints inherent in conventional ISAC architectures. This paper addresses these challenges by proposing a cooperative bistatic ISAC framework within MIMO-OFDM cellular networks, enabling… ▽ More

    Submitted 22 June, 2025; originally announced June 2025.

  2. arXiv:2506.08967  [pdf, ps, other

    cs.SD cs.CL eess.AS

    Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model

    Authors: Ailin Huang, Bingxin Li, Bruce Wang, Boyong Wu, Chao Yan, Chengli Feng, Heng Wang, Hongyu Zhou, Hongyuan Wang, Jingbei Li, Jianjian Sun, Joanna Wang, Mingrui Chen, Peng Liu, Ruihang Miao, Shilei Jiang, Tian Fei, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Ge, Zheng Gong, Zhewei Huang , et al. (51 additional authors not shown)

    Abstract: Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a du… ▽ More

    Submitted 13 June, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

    Comments: 12 pages, 3 figures

  3. arXiv:2506.01759  [pdf, ps, other

    cs.RO eess.SY

    ADEPT: Adaptive Diffusion Environment for Policy Transfer Sim-to-Real

    Authors: Youwei Yu, Junhong Xu, Lantao Liu

    Abstract: Model-free reinforcement learning has emerged as a powerful method for developing robust robot control policies capable of navigating through complex and unstructured environments. The effectiveness of these methods hinges on two essential elements: (1) the use of massively parallel physics simulations to expedite policy training, and (2) an environment generator tasked with crafting sufficiently… ▽ More

    Submitted 4 June, 2025; v1 submitted 2 June, 2025; originally announced June 2025.

    Comments: arXiv admin note: substantial text overlap with arXiv:2410.10766

  4. arXiv:2505.18784  [pdf, other

    eess.IV cond-mat.mtrl-sci cs.LG

    A physics-guided smoothing method for material modeling with digital image correlation (DIC) measurements

    Authors: Jihong Wang, Chung-Hao Lee, William Richardson, Yue Yu

    Abstract: In this work, we present a novel approach to process the DIC measurements of multiple biaxial stretching protocols. In particular, we develop a optimization-based approach, which calculates the smoothed nodal displacements using a moving least-squares algorithm subject to positive strain constraints. As such, physically consistent displacement and strain fields are obtained. Then, we further deplo… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

  5. arXiv:2505.18644  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Enhancing Generalization of Speech Large Language Models with Multi-Task Behavior Imitation and Speech-Text Interleaving

    Authors: Jingran Xie, Xiang Li, Hui Wang, Yue Yu, Yang Xiang, Xixin Wu, Zhiyong Wu

    Abstract: Large language models (LLMs) have shown remarkable generalization across tasks, leading to increased interest in integrating speech with LLMs. These speech LLMs (SLLMs) typically use supervised fine-tuning to align speech with text-based LLMs. However, the lack of annotated speech data across a wide range of tasks hinders alignment efficiency, resulting in poor generalization. To address these iss… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

    Comments: Accepted by Interspeech 2025

  6. arXiv:2505.18614  [pdf, ps, other

    cs.CL cs.LG cs.MM cs.SD eess.AS

    MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation

    Authors: Woohyun Cho, Youngmin Kim, Sunghyun Lee, Youngjae Yu

    Abstract: Lyrics translation requires both accurate semantic transfer and preservation of musical rhythm, syllabic structure, and poetic style. In animated musicals, the challenge intensifies due to alignment with visual and auditory cues. We introduce Multilingual Audio-Video Lyrics Benchmark for Animated Song Translation (MAVL), the first multilingual, multimodal benchmark for singable lyrics translation.… ▽ More

    Submitted 5 June, 2025; v1 submitted 24 May, 2025; originally announced May 2025.

    Comments: 28 pages, 8 figures, our codes and datasets are available at https://github.com/k1064190/MAVL

  7. arXiv:2505.15868  [pdf

    q-bio.QM cs.AI eess.IV

    An Inclusive Foundation Model for Generalizable Cytogenetics in Precision Oncology

    Authors: Changchun Yang, Weiqian Dai, Yilan Zhang, Siyuan Chen, Jingdong Hu, Junkai Su, Yuxuan Chen, Ao Xu, Na Li, Xin Gao, Yongguo Yu

    Abstract: Chromosome analysis is vital for diagnosing genetic disorders and guiding cancer therapy decisions through the identification of somatic clonal aberrations. However, developing an AI model are hindered by the overwhelming complexity and diversity of chromosomal abnormalities, requiring extensive annotation efforts, while automated methods remain task-specific and lack generalizability due to the s… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: These authors contributed equally to this work: Changchun Yang, Weiqian Dai, Yilan Zhang

  8. arXiv:2505.14351  [pdf, ps, other

    cs.SD cs.AI cs.CL eess.AS

    FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation

    Authors: Yutong Liu, Ziyue Zhang, Ban Ma-bao, Yuqing Cai, Yongbin Yu, Renzeng Duojie, Xiangxiang Wang, Fan Gao, Cheng Huang, Nyima Tashi

    Abstract: Tibetan is a low-resource language with minimal parallel speech corpora spanning its three major dialects-Ü-Tsang, Amdo, and Kham-limiting progress in speech modeling. To address this issue, we propose FMSD-TTS, a few-shot, multi-speaker, multi-dialect text-to-speech framework that synthesizes parallel dialectal speech from limited reference audio and explicit dialect labels. Our method features a… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

    Comments: 13 pages

  9. arXiv:2505.08681  [pdf, ps, other

    cs.SD cs.AI eess.AS

    A Mamba-based Network for Semi-supervised Singing Melody Extraction Using Confidence Binary Regularization

    Authors: Xiaoliang He, Kangjie Dong, Jingkai Cao, Shuai Yu, Wei Li, Yi Yu

    Abstract: Singing melody extraction (SME) is a key task in the field of music information retrieval. However, existing methods are facing several limitations: firstly, prior models use transformers to capture the contextual dependencies, which requires quadratic computation resulting in low efficiency in the inference stage. Secondly, prior works typically rely on frequencysupervised methods to estimate the… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  10. arXiv:2505.06025  [pdf, ps, other

    cs.NI cs.DC eess.SY

    Efficient Information Updates in Compute-First Networking via Reinforcement Learning with Joint AoI and VoI

    Authors: Jianpeng Qi, Chao Liu, Chengxiang Xu, Rui Wang, Junyu Dong, Yanwei Yu

    Abstract: Timely and efficient dissemination of service information is critical in compute-first networking systems, where user requests arrive dynamically and computing resources are constrained. In such systems, the access point (AP) plays a key role in forwarding user requests to a server based on its latest received service information. This paper considers a single-source, single-destination system and… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

    Comments: 11pages, 40 figures

  11. arXiv:2505.00265  [pdf, other

    cs.LG eess.IV

    Field-scale soil moisture estimated from Sentinel-1 SAR data using a knowledge-guided deep learning approach

    Authors: Yi Yu, Patrick Filippi, Thomas F. A. Bishop

    Abstract: Soil moisture (SM) estimation from active microwave data remains challenging due to the complex interactions between radar backscatter and surface characteristics. While the water cloud model (WCM) provides a semi-physical approach for understanding these interactions, its empirical component often limits performance across diverse agricultural landscapes. This research presents preliminary effort… ▽ More

    Submitted 30 April, 2025; originally announced May 2025.

    Comments: Accepted by the 2025 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2025)

  12. arXiv:2504.10686  [pdf, other

    cs.CV eess.IV

    The Tenth NTIRE 2025 Efficient Super-Resolution Challenge Report

    Authors: Bin Ren, Hang Guo, Lei Sun, Zongwei Wu, Radu Timofte, Yawei Li, Yao Zhang, Xinning Chai, Zhengxue Cheng, Yingsheng Qin, Yucai Yang, Li Song, Hongyuan Yu, Pufan Xu, Cheng Wan, Zhijuan Huang, Peng Guo, Shuyuan Cui, Chenjun Li, Xuehai Hu, Pan Pan, Xin Zhang, Heng Zhang, Qing Luo, Linyan Jiang , et al. (122 additional authors not shown)

    Abstract: This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

    Comments: Accepted by CVPR2025 NTIRE Workshop, Efficient Super-Resolution Challenge Report. 50 pages

  13. arXiv:2504.09225  [pdf, other

    cs.SD cs.AI eess.AS

    AMNet: An Acoustic Model Network for Enhanced Mandarin Speech Synthesis

    Authors: Yubing Cao, Yinfeng Yu, Yongming Li, Liejun Wang

    Abstract: This paper presents AMNet, an Acoustic Model Network designed to improve the performance of Mandarin speech synthesis by incorporating phrase structure annotation and local convolution modules. AMNet builds upon the FastSpeech 2 architecture while addressing the challenge of local context modeling, which is crucial for capturing intricate speech features such as pauses, stress, and intonation. By… ▽ More

    Submitted 12 April, 2025; originally announced April 2025.

    Comments: Main paper (8 pages). Accepted for publication by IJCNN 2025

  14. arXiv:2504.08604  [pdf, other

    cs.RO cs.AI cs.LG eess.SY

    Neural Fidelity Calibration for Informative Sim-to-Real Adaptation

    Authors: Youwei Yu, Lantao Liu

    Abstract: Deep reinforcement learning can seamlessly transfer agile locomotion and navigation skills from the simulator to real world. However, bridging the sim-to-real gap with domain randomization or adversarial methods often demands expert physics knowledge to ensure policy robustness. Even so, cutting-edge simulators may fall short of capturing every real-world detail, and the reconstructed environment… ▽ More

    Submitted 11 April, 2025; originally announced April 2025.

  15. arXiv:2504.05158  [pdf, other

    cs.SD cs.AI eess.AS

    Leveraging Label Potential for Enhanced Multimodal Emotion Recognition

    Authors: Xuechun Shao, Yinfeng Yu, Liejun Wang

    Abstract: Multimodal emotion recognition (MER) seeks to integrate various modalities to predict emotional states accurately. However, most current research focuses solely on the fusion of audio and text features, overlooking the valuable information in emotion labels. This oversight could potentially hinder the performance of existing methods, as emotion labels harbor rich, insightful information that could… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

    Comments: Main paper (8 pages). Accepted for publication by IJCNN 2025

  16. arXiv:2504.04765  [pdf, other

    eess.SY

    Multi-Agent Deep Reinforcement Learning for Multiple Anesthetics Collaborative Control

    Authors: Huijie Li, Yide Yu, Si Shi, Anmin Hu, Jian Huo, Wei Lin, Chaoran Wu, Wuman Luo

    Abstract: Automated control of personalized multiple anesthetics in clinical Total Intravenous Anesthesia (TIVA) is crucial yet challenging. Current systems, including target-controlled infusion (TCI) and closed-loop systems, either rely on relatively static pharmacokinetic/pharmacodynamic (PK/PD) models or focus on single anesthetic control, limiting personalization and collaborative control. To address th… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

  17. arXiv:2504.03687  [pdf, other

    eess.SP cs.AI cs.CV

    Process Optimization and Deployment for Sensor-Based Human Activity Recognition Based on Deep Learning

    Authors: Hanyu Liu, Ying Yu, Hang Xiao, Siyao Li, Xuze Li, Jiarui Li, Haotian Tang

    Abstract: Sensor-based human activity recognition is a key technology for many human-centered intelligent applications. However, this research is still in its infancy and faces many unresolved challenges. To address these, we propose a comprehensive optimization process approach centered on multi-attention interaction. We first utilize unsupervised statistical feature-guided diffusion models for highly adap… ▽ More

    Submitted 22 March, 2025; originally announced April 2025.

  18. arXiv:2503.23108  [pdf, other

    eess.AS cs.LG cs.SD

    SupertonicTTS: Towards Highly Scalable and Efficient Text-to-Speech System

    Authors: Hyeongju Kim, Jinhyeok Yang, Yechan Yu, Seunghun Ji, Jacob Morton, Frederik Bous, Joon Byun, Juheon Lee

    Abstract: We present a novel text-to-speech (TTS) system, namely SupertonicTTS, for improved scalability and efficiency in speech synthesis. SupertonicTTS comprises three components: a speech autoencoder for continuous latent representation, a text-to-latent module leveraging flow-matching for text-to-latent mapping, and an utterance-level duration predictor. To enable a lightweight architecture, we employ… ▽ More

    Submitted 16 May, 2025; v1 submitted 29 March, 2025; originally announced March 2025.

    Comments: 21 pages, preprint

  19. arXiv:2503.21571  [pdf, other

    cs.SD cs.AI eess.AS

    Magnitude-Phase Dual-Path Speech Enhancement Network based on Self-Supervised Embedding and Perceptual Contrast Stretch Boosting

    Authors: Alimjan Mattursun, Liejun Wang, Yinfeng Yu, Chunyang Ma

    Abstract: Speech self-supervised learning (SSL) has made great progress in various speech processing tasks, but there is still room for improvement in speech enhancement (SE). This paper presents BSP-MPNet, a dual-path framework that combines self-supervised features with magnitude-phase information for SE. The approach starts by applying the perceptual contrast stretching (PCS) algorithm to enhance the mag… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

    Comments: Main paper (6 pages). Accepted for publication by ICME 2025

  20. arXiv:2503.14927  [pdf, other

    cs.LG eess.SY math.DS

    Semi-Gradient SARSA Routing with Theoretical Guarantee on Traffic Stability and Weight Convergence

    Authors: Yidan Wu, Yu Yu, Jianan Zhang, Li Jin

    Abstract: We consider the traffic control problem of dynamic routing over parallel servers, which arises in a variety of engineering systems such as transportation and data transmission. We propose a semi-gradient, on-policy algorithm that learns an approximate optimal routing policy. The algorithm uses generic basis functions with flexible weights to approximate the value function across the unbounded stat… ▽ More

    Submitted 19 March, 2025; originally announced March 2025.

    Comments: arXiv admin note: text overlap with arXiv:2404.09188

  21. arXiv:2503.14573  [pdf

    eess.IV cs.CV cs.GR

    Submillimeter-Accurate 3D Lumbar Spine Reconstruction from Biplanar X-Ray Images: Incorporating a Multi-Task Network and Landmark-Weighted Loss

    Authors: Wanxin Yu, Zhemin Zhu, Cong Wang, Yihang Bao, Chunjie Xia, Rongshan Cheng, Yan Yu, Tsung-Yuan Tsai

    Abstract: Three-dimensional reconstruction of the spine under weight-bearing conditions from biplanar X-ray images is of great importance for the clinical assessment of spinal diseases. However, the current fully automated reconstruction methods only achieve millimeter-level accuracy, making it difficult to meet clinical standards. This study developed and validated a fully automated method for high-accurac… ▽ More

    Submitted 18 May, 2025; v1 submitted 18 March, 2025; originally announced March 2025.

    Comments: 24 pages, 11 figures, 5 tables

  22. arXiv:2503.06651  [pdf, other

    cs.IT eess.SP

    Electromagnetic Information Theory: Fundamentals, Paradigm Shifts, and Applications

    Authors: Tengjiao Wang, Zhenyu Kang, Ting Li, Zhihui Chen, Shaobo Wang, Yingpei Lin, Yan Wang, Yichuan Yu

    Abstract: This paper explores the emerging research direction of electromagnetic information theory (EIT), which aims to integrate traditional Shannon-based methodologies with physical consistency, particularly the electromagnetic properties of communication channels. We propose an EIT-based multiple-input multiple-output (MIMO) paradigm that enhances conventional spatially-discrete MIMO models by incorpora… ▽ More

    Submitted 9 March, 2025; originally announced March 2025.

  23. arXiv:2502.19548  [pdf, other

    cs.CL cs.SD eess.AS

    When Large Language Models Meet Speech: A Survey on Integration Approaches

    Authors: Zhengdong Yang, Shuichiro Shimizu, Yahan Yu, Chenhui Chu

    Abstract: Recent advancements in large language models (LLMs) have spurred interest in expanding their application beyond text-based tasks. A large number of studies have explored integrating other modalities with LLMs, notably speech modality, which is naturally related to text. This paper surveys the integration of speech with LLMs, categorizing the methodologies into three primary approaches: text-based,… ▽ More

    Submitted 26 February, 2025; originally announced February 2025.

  24. arXiv:2502.15006  [pdf, ps, other

    cs.RO cs.AI eess.SY

    Safe Beyond the Horizon: Efficient Sampling-based MPC with Neural Control Barrier Functions

    Authors: Ji Yin, Oswin So, Eric Yang Yu, Chuchu Fan, Panagiotis Tsiotras

    Abstract: A common problem when using model predictive control (MPC) in practice is the satisfaction of safety specifications beyond the prediction horizon. While theoretical works have shown that safety can be guaranteed by enforcing a suitable terminal set constraint or a sufficiently long prediction horizon, these techniques are difficult to apply and thus are rarely used by practitioners, especially in… ▽ More

    Submitted 8 July, 2025; v1 submitted 20 February, 2025; originally announced February 2025.

    Comments: Accepted by RSS 2025

  25. arXiv:2502.11946  [pdf, other

    cs.CL cs.AI cs.HC cs.SD eess.AS

    Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

    Authors: Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, Peng Liu, Ruihang Miao, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Gong, Zixin Zhang, Hongyu Zhou, Jianjian Sun, Brian Li, Chengting Feng, Changyi Wan, Hanpeng Hu , et al. (120 additional authors not shown)

    Abstract: Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contribu… ▽ More

    Submitted 18 February, 2025; v1 submitted 17 February, 2025; originally announced February 2025.

  26. arXiv:2501.10937  [pdf, other

    cs.CL cs.SD eess.AS

    Leveraging Chain of Thought towards Empathetic Spoken Dialogue without Corresponding Question-Answering Data

    Authors: Jingran Xie, Shun Lei, Yue Yu, Yang Xiang, Hui Wang, Xixin Wu, Zhiyong Wu

    Abstract: Empathetic dialogue is crucial for natural human-computer interaction, allowing the dialogue system to respond in a more personalized and emotionally aware manner, improving user satisfaction and engagement. The emergence of large language models (LLMs) has revolutionized dialogue generation by harnessing their powerful capabilities and shown its potential in multimodal domains. Many studies have… ▽ More

    Submitted 18 January, 2025; originally announced January 2025.

    Comments: Accepted by ICASSP 2025

  27. arXiv:2501.07459  [pdf, other

    eess.SP

    SynthSoM: A synthetic intelligent multi-modal sensing-communication dataset for Synesthesia of Machines (SoM)

    Authors: Xiang Cheng, Ziwei Huang, Yong Yu, Lu Bai, Mingran Sun, Zengrui Han, Ruide Zhang, Sijiang Li

    Abstract: Given the importance of datasets for sensing-communication integration research, a novel simulation platform for constructing communication and multi-modal sensory dataset is developed. The developed platform integrates three high-precision software, i.e., AirSim, WaveFarer, and Wireless InSite, and further achieves in-depth integration and precise alignment of them. Based on the developed platfor… ▽ More

    Submitted 24 April, 2025; v1 submitted 13 January, 2025; originally announced January 2025.

  28. arXiv:2501.05474  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Modality-Invariant Bidirectional Temporal Representation Distillation Network for Missing Multimodal Sentiment Analysis

    Authors: Xincheng Wang, Liejun Wang, Yinfeng Yu, Xinxin Jiao

    Abstract: Multimodal Sentiment Analysis (MSA) integrates diverse modalities(text, audio, and video) to comprehensively analyze and understand individuals' emotional states. However, the real-world prevalence of incomplete data poses significant challenges to MSA, mainly due to the randomness of modality missing. Moreover, the heterogeneity issue in multimodal data has yet to be effectively addressed. To tac… ▽ More

    Submitted 7 January, 2025; originally announced January 2025.

    Comments: Accepted for publication by 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)

  29. arXiv:2412.20371  [pdf, other

    eess.SP

    Cooperative ISAC-empowered Low-Altitude Economy

    Authors: Jun Tang, Yiming Yu, Cunhua Pan, Hong Ren, Dongming Wang, Jiangzhou Wang, Xiaohu You

    Abstract: This paper proposes a cooperative integrated sensing and communication (ISAC) scheme for the low-altitude sensing scenario, aiming at estimating the parameters of the unmanned aerial vehicles (UAVs) and enhancing the sensing performance via cooperation. The proposed scheme consists of two stages. In Stage I, we formulate the monostatic parameter estimation problem via using a tensor decomposition… ▽ More

    Submitted 29 December, 2024; originally announced December 2024.

  30. arXiv:2412.19967  [pdf, other

    cs.LG cs.AI eess.SP

    MobileNetV2: A lightweight classification model for home-based sleep apnea screening

    Authors: Hui Pan, Yanxuan Yu, Jilun Ye, Xu Zhang

    Abstract: This study proposes a novel lightweight neural network model leveraging features extracted from electrocardiogram (ECG) and respiratory signals for early OSA screening. ECG signals are used to generate feature spectrograms to predict sleep stages, while respiratory signals are employed to detect sleep-related breathing abnormalities. By integrating these predictions, the method calculates the apne… ▽ More

    Submitted 3 January, 2025; v1 submitted 27 December, 2024; originally announced December 2024.

  31. arXiv:2412.14614  [pdf, other

    eess.SY

    A Model-free Biomimetics Algorithm for Deterministic Partially Observable Markov Decision Process

    Authors: Yide Yu, Yue Liu, Xiaochen Yuan, Dennis Wong, Huijie Li, Yan Ma

    Abstract: Partially Observable Markov Decision Process (POMDP) is a mathematical framework for modeling decision-making under uncertainty, where the agent's observations are incomplete and the underlying system dynamics are probabilistic. Solving the POMDP problem within the model-free paradigm is challenging for agents due to the inherent difficulty in accurately identifying and distinguishing between stat… ▽ More

    Submitted 19 December, 2024; originally announced December 2024.

    Comments: 27 pages, 5 figures

  32. arXiv:2412.07387  [pdf, other

    eess.IV cs.AI cs.CV

    Enhanced MRI Representation via Cross-series Masking

    Authors: Churan Wang, Fei Gao, Lijun Yan, Siwen Wang, Yizhou Yu, Yizhou Wang

    Abstract: Magnetic resonance imaging (MRI) is indispensable for diagnosing and planning treatment in various medical conditions due to its ability to produce multi-series images that reveal different tissue characteristics. However, integrating these diverse series to form a coherent analysis presents significant challenges, such as differing spatial resolutions and contrast patterns meanwhile requiring ext… ▽ More

    Submitted 10 December, 2024; originally announced December 2024.

  33. arXiv:2412.00150  [pdf, other

    cs.CV eess.IV

    Curriculum Fine-tuning of Vision Foundation Model for Medical Image Classification Under Label Noise

    Authors: Yeonguk Yu, Minhwan Ko, Sungho Shin, Kangmin Kim, Kyoobin Lee

    Abstract: Deep neural networks have demonstrated remarkable performance in various vision tasks, but their success heavily depends on the quality of the training data. Noisy labels are a critical issue in medical datasets and can significantly degrade model performance. Previous clean sample selection methods have not utilized the well pre-trained features of vision foundation models (VFMs) and assumed that… ▽ More

    Submitted 29 November, 2024; originally announced December 2024.

    Comments: Accepted at NeurIPS 2024

  34. arXiv:2412.00049  [pdf, other

    cs.MM cs.AI cs.CV cs.SD eess.AS

    A Survey of Recent Advances and Challenges in Deep Audio-Visual Correlation Learning

    Authors: Luis Vilaca, Yi Yu, Paula Vinan

    Abstract: Audio-visual correlation learning aims to capture and understand natural phenomena between audio and visual data. The rapid growth of Deep Learning propelled the development of proposals that process audio-visual data and can be observed in the number of proposals in the past years. Thus encouraging the development of a comprehensive survey. Besides analyzing the models used in this context, we al… ▽ More

    Submitted 23 November, 2024; originally announced December 2024.

    Comments: arXiv admin note: text overlap with arXiv:2202.13673

  35. arXiv:2411.11320  [pdf, other

    eess.SP

    Robust and Constrained Estimation of State-Space Models: A Majorization-Minimization Approach

    Authors: Yifan Yu, Shengjie Xiu, Daniel P. Palomar

    Abstract: In this paper, we present a novel optimization algorithm designed specifically for estimating state-space models to deal with heavy-tailed measurement noise and constraints. Our algorithm addresses two significant limitations found in existing approaches: susceptibility to measurement noise outliers and difficulties in incorporating constraints into state estimation. By formulating constrained sta… ▽ More

    Submitted 18 November, 2024; originally announced November 2024.

    Comments: 6 pages, 5 figures. This work has been accepted by and presented at The Asilomar Conference on Signals, Systems, and Computers, Oct. 2024

  36. arXiv:2411.06983  [pdf, other

    eess.SP

    Sensing Capacity for Integrated Sensing and Communication Systems in Low-Altitude Economy

    Authors: Jiahua Wan, Hong Ren, Cunhua Pan, Zhenkun Zhang, Songtao Gao, Yiming Yu, Chengzhong Wang

    Abstract: The burgeoning significance of the low-altitude economy (LAE) has garnered considerable interest, largely fuelled by the widespread deployment of unmanned aerial vehicles (UAVs). To tackle the challenges associated with the detection of unauthorized UAVs and the efficient scheduling of authorized UAVs, this letter introduces a novel performance metric, termed sensing capacity, for integrated sensi… ▽ More

    Submitted 11 November, 2024; originally announced November 2024.

  37. arXiv:2410.20691  [pdf, ps, other

    cs.NI cs.LG eess.SP

    Wireless-Friendly Window Position Optimization for RIS-Aided Outdoor-to-Indoor Networks based on Multi-Modal Large Language Model

    Authors: Jinbo Hou, Kehai Qiu, Zitian Zhang, Yong Yu, Kezhi Wang, Stefano Capolongo, Jiliang Zhang, Zeyang Li, Jie Zhang

    Abstract: This paper aims to simultaneously optimize indoor wireless and daylight performance by adjusting the positions of windows and the beam directions of window-deployed reconfigurable intelligent surfaces (RISs) for RIS-aided outdoor-to-indoor (O2I) networks utilizing large language models (LLM) as optimizers. Firstly, we illustrate the wireless and daylight system models of RIS-aided O2I networks and… ▽ More

    Submitted 20 June, 2025; v1 submitted 7 October, 2024; originally announced October 2024.

  38. arXiv:2409.15744  [pdf, other

    eess.IV cs.CV

    ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features

    Authors: Xin Wei, Yaling Tao, Changde Du, Gangming Zhao, Yizhou Yu, Jinpeng Li

    Abstract: Mammography is the primary imaging tool for breast cancer diagnosis. Despite significant strides in applying deep learning to interpret mammography images, efforts that focus predominantly on visual features often struggle with generalization across datasets. We hypothesize that integrating additional modalities in the radiology practice, notably the linguistic features of reports and manifestatio… ▽ More

    Submitted 24 September, 2024; originally announced September 2024.

  39. arXiv:2409.12139  [pdf, other

    cs.SD cs.AI eess.AS

    Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models

    Authors: Sijing Chen, Yuan Feng, Laipeng He, Tianwei He, Wendi He, Yanni Hu, Bin Lin, Yiting Lin, Yu Pan, Pengfei Tan, Chengwei Tian, Chen Wang, Zhicheng Wang, Ruoye Xie, Jixun Yao, Quanlei Yan, Yuguang Yang, Jianhao Ye, Jingjing Yin, Yanzhen Yu, Huimin Zhang, Xiang Zhang, Guangcheng Zhao, Hongbin Zhou, Pengpeng Zou

    Abstract: With the advent of the big data and large language model era, zero-shot personalized rapid customization has emerged as a significant trend. In this report, we introduce Takin AudioLLM, a series of techniques and models, mainly including Takin TTS, Takin VC, and Takin Morphing, specifically designed for audiobook production. These models are capable of zero-shot speech production, generating high-… ▽ More

    Submitted 23 September, 2024; v1 submitted 18 September, 2024; originally announced September 2024.

    Comments: Technical Report; 18 pages; typos corrected, references added, demo url modified, author name modified;

  40. arXiv:2409.07226  [pdf, other

    cs.SD eess.AS

    Muskits-ESPnet: A Comprehensive Toolkit for Singing Voice Synthesis in New Paradigm

    Authors: Yuning Wu, Jiatong Shi, Yifeng Yu, Yuxun Tang, Tao Qian, Yueqian Lin, Jionghao Han, Xinyi Bai, Shinji Watanabe, Qin Jin

    Abstract: This research presents Muskits-ESPnet, a versatile toolkit that introduces new paradigms to Singing Voice Synthesis (SVS) through the application of pretrained audio models in both continuous and discrete approaches. Specifically, we explore discrete representations derived from SSL models and audio codecs and offer significant advantages in versatility and intelligence, supporting multi-format in… ▽ More

    Submitted 10 October, 2024; v1 submitted 11 September, 2024; originally announced September 2024.

    Comments: Accepted by ACMMM 2024 demo track

  41. arXiv:2409.05904  [pdf

    cs.NI eess.SY

    A Centralized Discovery-Based Method for Integrating Data Distribution Service and Time-Sensitive Networking in In-Vehicle Networks

    Authors: Feng Luo, Yi Ren, Yanhua Yu, Yunpeng Li, Zitong Wang

    Abstract: As the electronic and electrical architecture (E/EA) of intelligent and connected vehicles (ICVs) evolves, traditional distributed and signal-oriented architectures are being replaced by centralized, service-oriented architectures (SOA). This new generation of E/EA demands in-vehicle networks (IVNs) that offer high bandwidth, real-time, reliability, and service-oriented. data distribution service… ▽ More

    Submitted 6 September, 2024; originally announced September 2024.

  42. arXiv:2408.06911  [pdf, other

    eess.AS cs.AI

    Heterogeneous Space Fusion and Dual-Dimension Attention: A New Paradigm for Speech Enhancement

    Authors: Tao Zheng, Liejun Wang, Yinfeng Yu

    Abstract: Self-supervised learning has demonstrated impressive performance in speech tasks, yet there remains ample opportunity for advancement in the realm of speech enhancement research. In addressing speech tasks, confining the attention mechanism solely to the temporal dimension poses limitations in effectively focusing on critical speech features. Considering the aforementioned issues, our study introd… ▽ More

    Submitted 13 August, 2024; originally announced August 2024.

    Comments: Accepted for publication by IEEE International Conference on Systems, Man, and Cybernetics 2024

  43. arXiv:2408.06906  [pdf, other

    eess.AS cs.AI

    VNet: A GAN-based Multi-Tier Discriminator Network for Speech Synthesis Vocoders

    Authors: Yubing Cao, Yongming Li, Liejun Wang, Yinfeng Yu

    Abstract: Since the introduction of Generative Adversarial Networks (GANs) in speech synthesis, remarkable achievements have been attained. In a thorough exploration of vocoders, it has been discovered that audio waveforms can be generated at speeds exceeding real-time while maintaining high fidelity, achieved through the utilization of GAN-based models. Typically, the inputs to the vocoder consist of band-… ▽ More

    Submitted 13 August, 2024; originally announced August 2024.

    Comments: Accepted for publication by IEEE International Conference on Systems, Man, and Cybernetics 2024

  44. arXiv:2408.06851  [pdf, other

    eess.AS cs.AI

    BSS-CFFMA: Cross-Domain Feature Fusion and Multi-Attention Speech Enhancement Network based on Self-Supervised Embedding

    Authors: Alimjan Mattursun, Liejun Wang, Yinfeng Yu

    Abstract: Speech self-supervised learning (SSL) represents has achieved state-of-the-art (SOTA) performance in multiple downstream tasks. However, its application in speech enhancement (SE) tasks remains immature, offering opportunities for improvement. In this study, we introduce a novel cross-domain feature fusion and multi-attention speech enhancement network, termed BSS-CFFMA, which leverages self-super… ▽ More

    Submitted 13 August, 2024; originally announced August 2024.

    Comments: Accepted for publication by IEEE International Conference on Systems, Man, and Cybernetics 2024

  45. arXiv:2407.16779  [pdf, other

    eess.SY

    Learning Networked Dynamical System Models with Weak Form and Graph Neural Networks

    Authors: Yin Yu, Daning Huang, Seho Park, Herschel C. Pangborn

    Abstract: This paper presents a sequence of two approaches for the data-driven control-oriented modeling of networked systems, i.e., the systems that involve many interacting dynamical components. First, a novel deep learning approach named the weak Latent Dynamics Model (wLDM) is developed for learning generic nonlinear dynamics with control. Leveraging the weak form, the wLDM enables more numerically stab… ▽ More

    Submitted 23 July, 2024; originally announced July 2024.

  46. arXiv:2407.12380  [pdf, other

    eess.AS cs.SD

    PCQ: Emotion Recognition in Speech via Progressive Channel Querying

    Authors: Xincheng Wang, Liejun Wang, Yinfeng Yu, Xinxin Jiao

    Abstract: In human-computer interaction (HCI), Speech Emotion Recognition (SER) is a key technology for understanding human intentions and emotions. Traditional SER methods struggle to effectively capture the long-term temporal correla-tions and dynamic variations in complex emotional expressions. To overcome these limitations, we introduce the PCQ method, a pioneering approach for SER via \textbf{P}rogress… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

    Comments: Accepted for publication by International Conference On Intelligent Computing 2024. For data and code, see <a href="https://github.com/ICIG/PCQ-Net">this https URL</a>

  47. arXiv:2407.06525  [pdf, other

    eess.IV cs.CV

    UnmixingSR: Material-aware Network with Unsupervised Unmixing as Auxiliary Task for Hyperspectral Image Super-resolution

    Authors: Yang Yu

    Abstract: Deep learning-based (DL-based) hyperspectral image (HIS) super-resolution (SR) methods have achieved remarkable performance and attracted attention in industry and academia. Nonetheless, most current methods explored and learned the mapping relationship between low-resolution (LR) and high-resolution (HR) HSIs, leading to the side effect of increasing unreliability and irrationality in solving the… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  48. arXiv:2407.01956  [pdf, other

    eess.SY cs.RO

    Cloud-Edge-Terminal Collaborative AIGC for Autonomous Driving

    Authors: Jianan Zhang, Zhiwei Wei, Boxun Liu, Xiayi Wang, Yong Yu, Rongqing Zhang

    Abstract: In dynamic autonomous driving environment, Artificial Intelligence-Generated Content (AIGC) technology can supplement vehicle perception and decision making by leveraging models' generative and predictive capabilities, and has the potential to enhance motion planning, trajectory prediction and traffic simulation. This article proposes a cloud-edge-terminal collaborative architecture to support AIG… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  49. arXiv:2407.00995  [pdf, other

    cs.CY eess.SY physics.app-ph

    Data on the Move: Traffic-Oriented Data Trading Platform Powered by AI Agent with Common Sense

    Authors: Yi Yu, Shengyue Yao, Tianchen Zhou, Yexuan Fu, Jingru Yu, Ding Wang, Xuhong Wang, Cen Chen, Yilun Lin

    Abstract: In the digital era, data has become a pivotal asset, advancing technologies such as autonomous driving. Despite this, data trading faces challenges like the absence of robust pricing methods and the lack of trustworthy trading mechanisms. To address these challenges, we introduce a traffic-oriented data trading platform named Data on The Move (DTM), integrating traffic simulation, data trading, an… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  50. arXiv:2406.08761  [pdf, other

    cs.SD eess.AS

    VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation

    Authors: Yifeng Yu, Jiatong Shi, Yuning Wu, Yuxun Tang, Shinji Watanabe

    Abstract: Singing Voice Synthesis (SVS) has witnessed significant advancements with the advent of deep learning techniques. However, a significant challenge in SVS is the scarcity of labeled singing voice data, which limits the effectiveness of supervised learning methods. In response to this challenge, this paper introduces a novel approach to enhance the quality of SVS by leveraging unlabeled data from pr… ▽ More

    Submitted 13 December, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

    Comments: 8 pages, 2 figures, SLT 2024