Skip to main content

Showing 1–50 of 162 results for author: Ye, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2506.12537  [pdf, ps, other

    cs.CL cs.AI eess.AS

    Speech-Language Models with Decoupled Tokenizers and Multi-Token Prediction

    Authors: Xiaoran Fan, Zhichao Sun, Yangfan Gao, Jingfei Xiong, Hang Yan, Yifei Cao, Jiajun Sun, Shuo Li, Zhihao Zhang, Zhiheng Xi, Yuhao Zhou, Senjie Jin, Changhao Jiang, Junjie Ye, Ming Zhang, Rui Zheng, Zhenhua Han, Yunke Zhang, Demei Yan, Shaokang Dong, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang

    Abstract: Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation. However, challenges remain in achieving effective cross-modal alignment and high-quality speech generation. In this work, we systematically investigate the impact of key components (i.e., speech tokenizers, speech heads, and speaker modeling) on the performance of LLM-centric SLMs. We… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

  2. arXiv:2506.11160  [pdf, ps, other

    eess.AS cs.SD

    S2ST-Omni: An Efficient Multilingual Speech-to-Speech Translation Framework via Seamless Speech-Text Alignment and Progressive Fine-tuning

    Authors: Yu Pan, Yuguang Yang, Yanni Hu, Jianhao Ye, Xiang Zhang, Hongbin Zhou, Lei Ma, Jianjun Zhao

    Abstract: Despite recent advances in multilingual speech-to-speech translation (S2ST), several critical challenges persist: 1) achieving high-quality translation remains a major hurdle, and 2) most existing methods heavily rely on large-scale parallel speech corpora, which are costly and difficult to obtain. To address these issues, we propose \textit{S2ST-Omni}, an efficient and scalable framework for mult… ▽ More

    Submitted 8 July, 2025; v1 submitted 11 June, 2025; originally announced June 2025.

    Comments: Working in progress

  3. arXiv:2505.19225  [pdf, ps, other

    eess.IV cs.CV

    MedITok: A Unified Tokenizer for Medical Image Synthesis and Interpretation

    Authors: Chenglong Ma, Yuanfeng Ji, Jin Ye, Zilong Li, Chenhui Wang, Junzhi Ning, Wei Li, Lihao Liu, Qiushan Guo, Tianbin Li, Junjun He, Hongming Shan

    Abstract: Advanced autoregressive models have reshaped multimodal AI. However, their transformative potential in medical imaging remains largely untapped due to the absence of a unified visual tokenizer -- one capable of capturing fine-grained visual structures for faithful image reconstruction and realistic image synthesis, as well as rich semantics for accurate diagnosis and image interpretation. To this… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

  4. arXiv:2505.17589  [pdf, ps, other

    cs.SD cs.AI eess.AS

    CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

    Authors: Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Chongjia Ni, Xian Shi, Keyu An, Guanrou Yang, Yabin Li, Yanni Chen, Zhifu Gao, Qian Chen, Yue Gu, Mengzhe Chen, Yafeng Chen, Shiliang Zhang, Wen Wang, Jieping Ye

    Abstract: In our prior works, we introduced a scalable streaming speech synthesis model, CosyVoice 2, which integrates a large language model (LLM) and a chunk-aware flow matching (FM) model, and achieves low-latency bi-streaming speech synthesis and human-parity quality. Despite these advancements, CosyVoice 2 exhibits limitations in language coverage, domain diversity, data volume, text formats, and post-… ▽ More

    Submitted 27 May, 2025; v1 submitted 23 May, 2025; originally announced May 2025.

    Comments: Preprint, work in progress

  5. arXiv:2505.15860  [pdf

    eess.IV

    RadarRGBD A Multi-Sensor Fusion Dataset for Perception with RGB-D and mmWave Radar

    Authors: Tieshuai Song, Jiandong Ye, Ao Guo, Guidong He, Bin Yang

    Abstract: Multi-sensor fusion has significant potential in perception tasks for both indoor and outdoor environments. Especially under challenging conditions such as adverse weather and low-light environments, the combined use of millimeter-wave radar and RGB-D sensors has shown distinct advantages. However, existing multi-sensor datasets in the fields of autonomous driving and robotics often lack high-qual… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: 6 pages, 7 figures. Contains a new RGBD dataset for depth completion. Code and dataset will be released

  6. arXiv:2505.13805  [pdf, ps, other

    cs.SD cs.AI eess.AS

    ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and Speech

    Authors: Yu Pan, Yanni Hu, Yuguang Yang, Jixun Yao, Jianhao Ye, Hongbin Zhou, Lei Ma, Jianjun Zhao

    Abstract: Despite great advances, achieving high-fidelity emotional voice conversion (EVC) with flexible and interpretable control remains challenging. This paper introduces ClapFM-EVC, a novel EVC framework capable of generating high-quality converted speech driven by natural language prompts or reference speech with adjustable emotion intensity. We first propose EVC-CLAP, an emotional contrastive language… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    Comments: Accepted by InterSpeech 2025

  7. arXiv:2505.12552  [pdf, other

    eess.IV cs.AI cs.CV cs.LG

    FreqSelect: Frequency-Aware fMRI-to-Image Reconstruction

    Authors: Junliang Ye, Lei Wang, Md Zakir Hossain

    Abstract: Reconstructing natural images from functional magnetic resonance imaging (fMRI) data remains a core challenge in natural decoding due to the mismatch between the richness of visual stimuli and the noisy, low resolution nature of fMRI signals. While recent two-stage models, combining deep variational autoencoders (VAEs) with diffusion models, have advanced this task, they treat all spatial-frequenc… ▽ More

    Submitted 18 May, 2025; originally announced May 2025.

    Comments: Research report

  8. arXiv:2503.23149  [pdf, other

    eess.IV

    Towards Interpretable Counterfactual Generation via Multimodal Autoregression

    Authors: Chenglong Ma, Yuanfeng Ji, Jin Ye, Lu Zhang, Ying Chen, Tianbin Li, Mingjie Li, Junjun He, Hongming Shan

    Abstract: Counterfactual medical image generation enables clinicians to explore clinical hypotheses, such as predicting disease progression, facilitating their decision-making. While existing methods can generate visually plausible images from disease progression prompts, they produce silent predictions that lack interpretation to verify how the generation reflects the hypothesized progression -- a critical… ▽ More

    Submitted 29 March, 2025; originally announced March 2025.

  9. arXiv:2503.14928  [pdf, other

    cs.CV cs.AI cs.SD eess.AS

    Shushing! Let's Imagine an Authentic Speech from the Silent Video

    Authors: Jiaxin Ye, Hongming Shan

    Abstract: Vision-guided speech generation aims to produce authentic speech from facial appearance or lip motions without relying on auditory signals, offering significant potential for applications such as dubbing in filmmaking and assisting individuals with aphonia. Despite recent progress, existing methods struggle to achieve unified cross-modal alignment across semantics, timbre, and emotional prosody fr… ▽ More

    Submitted 19 March, 2025; originally announced March 2025.

    Comments: Project Page: https://imagintalk.github.io

  10. arXiv:2503.13139  [pdf, other

    cs.CV cs.AI cs.CL eess.IV

    Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding

    Authors: Weiyu Guo, Ziyang Chen, Shaoguang Wang, Jianxiang He, Yijie Xu, Jinhui Ye, Ying Sun, Hui Xiong

    Abstract: Understanding long video content is a complex endeavor that often relies on densely sampled frame captions or end-to-end feature selectors, yet these techniques commonly overlook the logical relationships between textual queries and visual elements. In practice, computational constraints necessitate coarse frame subsampling, a challenge analogous to "finding a needle in a haystack." To address thi… ▽ More

    Submitted 17 May, 2025; v1 submitted 17 March, 2025; originally announced March 2025.

    Comments: 32 pages, under review

  11. arXiv:2503.10733  [pdf, other

    cs.LG eess.SP

    TAU: Modeling Temporal Consistency Through Temporal Attentive U-Net for PPG Peak Detection

    Authors: Chunsheng Zuo, Yu Zhao, Juntao Ye

    Abstract: Photoplethysmography (PPG) sensors have been widely used in consumer wearable devices to monitor heart rates (HR) and heart rate variability (HRV). Despite the prevalence, PPG signals can be contaminated by motion artifacts induced from daily activities. Existing approaches mainly use the amplitude information to perform PPG peak detection. However, these approaches cannot accurately identify peak… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

    Comments: 27 pages, submitted to a journal

  12. arXiv:2502.19683  [pdf, other

    eess.IV cs.CV

    Dual-branch Graph Feature Learning for NLOS Imaging

    Authors: Xiongfei Su, Tianyi Zhu, Lina Liu, Zheng Chen, Yulun Zhang, Siyuan Li, Juntian Ye, Feihu Xu, Xin Yuan

    Abstract: The domain of non-line-of-sight (NLOS) imaging is advancing rapidly, offering the capability to reveal occluded scenes that are not directly visible. However, contemporary NLOS systems face several significant challenges: (1) The computational and storage requirements are profound due to the inherent three-dimensional grid data structure, which restricts practical application. (2) The simultaneous… ▽ More

    Submitted 26 February, 2025; originally announced February 2025.

  13. arXiv:2502.02950  [pdf, other

    eess.AS cs.SD

    Fine-grained Preference Optimization Improves Zero-shot Text-to-Speech

    Authors: Jixun Yao, Yuguang Yang, Yu Pan, Yuan Feng, Ziqian Ning, Jianhao Ye, Hongbin Zhou, Lei Xie

    Abstract: Integrating human feedback to align text-to-speech (TTS) system outputs with human preferences has proven to be an effective approach for enhancing the robustness of language model-based TTS systems. Current approaches primarily focus on using preference data annotated at the utterance level. However, frequent issues that affect the listening experience often only arise in specific segments of aud… ▽ More

    Submitted 5 February, 2025; originally announced February 2025.

    Comments: WIP

  14. arXiv:2502.01046  [pdf, other

    cs.SD cs.CV eess.AS

    Emotional Face-to-Speech

    Authors: Jiaxin Ye, Boyuan Cao, Hongming Shan

    Abstract: How much can we infer about an emotional voice solely from an expressive face? This intriguing question holds great potential for applications such as virtual character dubbing and aiding individuals with expressive language disorders. Existing face-to-speech methods offer great promise in capturing identity characteristics but struggle to generate diverse vocal styles with emotional expression. I… ▽ More

    Submitted 2 February, 2025; originally announced February 2025.

  15. arXiv:2501.15368  [pdf, other

    cs.CL cs.SD eess.AS

    Baichuan-Omni-1.5 Technical Report

    Authors: Yadong Li, Jun Liu, Tao Zhang, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, Chong Li, Yuanbo Fang, Dongdong Kuang, Mingrui Wang, Chenglin Zhu, Youwei Zhang, Hongyu Guo, Fengyu Zhang, Yuran Wang, Bowen Ding, Wei Song, Xu Li, Yuqi Huo, Zheng Liang , et al. (68 additional authors not shown)

    Abstract: We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pip… ▽ More

    Submitted 25 January, 2025; originally announced January 2025.

  16. arXiv:2501.13339  [pdf, ps, other

    eess.SP

    Joint Beamforming and Position Optimization for Fluid RIS-aided ISAC Systems

    Authors: Junjie Ye, Peichang Zhang, Xiao-Peng Li, Lei Huang, Yuanwei Liu

    Abstract: A fluid reconfigurable intelligent surface (fRIS)-aided integrated sensing and communications (ISAC) system is proposed to enhance multi-target sensing and multi-user communication. Unlike the conventional RIS, the fRIS incorporates movable elements whose positions can be flexibly adjusted to provide extra spatial degrees of freedom. In this system, a joint optimization problem is formulated to mi… ▽ More

    Submitted 24 January, 2025; v1 submitted 22 January, 2025; originally announced January 2025.

    Comments: 13 pages, 10 figures, has submitted to an IEEE journal for possible publication

  17. arXiv:2501.09400  [pdf, ps, other

    cs.IT eess.SP

    Joint Antenna Selection and Beamforming Design for Active RIS-aided ISAC Systems

    Authors: Wei Ma, Peichang Zhang, Junjie Ye, Rouyang Guan, Xiao-Peng Li, Lei Huang

    Abstract: Active reconfigurable intelligent surface (A-RIS) aided integrated sensing and communications (ISAC) system has been considered as a promising paradigm to improve spectrum efficiency. However, massive energy-hungry radio frequency (RF) chains hinder its large-scale deployment. To address this issue, an A-RIS-aided ISAC system with antenna selection (AS) is proposed in this work, where a target is… ▽ More

    Submitted 16 January, 2025; originally announced January 2025.

  18. arXiv:2412.19967  [pdf, other

    cs.LG cs.AI eess.SP

    MobileNetV2: A lightweight classification model for home-based sleep apnea screening

    Authors: Hui Pan, Yanxuan Yu, Jilun Ye, Xu Zhang

    Abstract: This study proposes a novel lightweight neural network model leveraging features extracted from electrocardiogram (ECG) and respiratory signals for early OSA screening. ECG signals are used to generate feature spectrograms to predict sleep stages, while respiratory signals are employed to detect sleep-related breathing abnormalities. By integrating these predictions, the method calculates the apne… ▽ More

    Submitted 3 January, 2025; v1 submitted 27 December, 2024; originally announced December 2024.

  19. arXiv:2412.13558  [pdf, other

    eess.IV cs.CL cs.CV cs.LG

    Read Like a Radiologist: Efficient Vision-Language Model for 3D Medical Imaging Interpretation

    Authors: Changsun Lee, Sangjoon Park, Cheong-Il Shin, Woo Hee Choi, Hyun Jeong Park, Jeong Eun Lee, Jong Chul Ye

    Abstract: Recent medical vision-language models (VLMs) have shown promise in 2D medical image interpretation. However extending them to 3D medical imaging has been challenging due to computational complexities and data scarcity. Although a few recent VLMs specified for 3D medical imaging have emerged, all are limited to learning volumetric representation of a 3D medical image as a set of sub-volumetric feat… ▽ More

    Submitted 18 December, 2024; originally announced December 2024.

  20. arXiv:2412.04724  [pdf, other

    eess.AS cs.SD

    StableVC: Style Controllable Zero-Shot Voice Conversion with Conditional Flow Matching

    Authors: Jixun Yao, Yuguang Yang, Yu Pan, Ziqian Ning, Jiaohao Ye, Hongbin Zhou, Lei Xie

    Abstract: Zero-shot voice conversion (VC) aims to transfer the timbre from the source speaker to an arbitrary unseen speaker while preserving the original linguistic content. Despite recent advancements in zero-shot VC using language model-based or diffusion-based approaches, several challenges remain: 1) current approaches primarily focus on adapting timbre from unseen speakers and are unable to transfer s… ▽ More

    Submitted 10 December, 2024; v1 submitted 5 December, 2024; originally announced December 2024.

    Comments: Accepted by AAAI 2025

  21. arXiv:2411.15540  [pdf, other

    cs.CV cs.AI cs.LG eess.IV

    Optical-Flow Guided Prompt Optimization for Coherent Video Generation

    Authors: Hyelin Nam, Jaemin Kim, Dohun Lee, Jong Chul Ye

    Abstract: While text-to-video diffusion models have made significant strides, many still face challenges in generating videos with temporal consistency. Within diffusion frameworks, guidance techniques have proven effective in enhancing output quality during inference; however, applying these methods to video diffusion models introduces additional complexity of handling computations across entire sequences.… ▽ More

    Submitted 23 March, 2025; v1 submitted 23 November, 2024; originally announced November 2024.

    Comments: CVPR 2025 (poster); project page: https://motionprompt.github.io/

  22. arXiv:2411.15490  [pdf, other

    cs.CV cs.LG eess.IV

    Improving Factuality of 3D Brain MRI Report Generation with Paired Image-domain Retrieval and Text-domain Augmentation

    Authors: Junhyeok Lee, Yujin Oh, Dahyoun Lee, Hyon Keun Joh, Chul-Ho Sohn, Sung Hyun Baik, Cheol Kyu Jung, Jung Hyun Park, Kyu Sung Choi, Byung-Hoon Kim, Jong Chul Ye

    Abstract: Acute ischemic stroke (AIS) requires time-critical management, with hours of delayed intervention leading to an irreversible disability of the patient. Since diffusion weighted imaging (DWI) using the magnetic resonance image (MRI) plays a crucial role in the detection of AIS, automated prediction of AIS from DWI has been a research topic of clinical importance. While text radiology reports contai… ▽ More

    Submitted 23 November, 2024; originally announced November 2024.

  23. arXiv:2411.14525  [pdf, other

    eess.IV cs.CV

    SegBook: A Simple Baseline and Cookbook for Volumetric Medical Image Segmentation

    Authors: Jin Ye, Ying Chen, Yanjun Li, Haoyu Wang, Zhongying Deng, Ziyan Huang, Yanzhou Su, Chenglong Ma, Yuanfeng Ji, Junjun He

    Abstract: Computed Tomography (CT) is one of the most popular modalities for medical imaging. By far, CT images have contributed to the largest publicly available datasets for volumetric medical segmentation tasks, covering full-body anatomical structures. Large amounts of full-body CT images provide the opportunity to pre-train powerful models, e.g., STU-Net pre-trained in a supervised fashion, to segment… ▽ More

    Submitted 21 November, 2024; originally announced November 2024.

  24. arXiv:2411.02026  [pdf, other

    cs.SD cs.AI eess.AS

    CTEFM-VC: Zero-Shot Voice Conversion Based on Content-Aware Timbre Ensemble Modeling and Flow Matching

    Authors: Yu Pan, Yuguang Yang, Jixun Yao, Jianhao Ye, Hongbin Zhou, Lei Ma, Jianjun Zhao

    Abstract: Zero-shot voice conversion (VC) aims to transform the timbre of a source speaker into any previously unseen target speaker, while preserving the original linguistic content. Despite notable progress, attaining a degree of speaker similarity and naturalness on par with ground truth recordings continues to pose great challenge. In this paper, we propose CTEFM-VC, a zero-shot VC framework that levera… ▽ More

    Submitted 4 November, 2024; originally announced November 2024.

    Comments: Work in progress; 5 pages;

  25. arXiv:2410.01350  [pdf, other

    cs.SD cs.AI eess.AS

    Takin-VC: Expressive Zero-Shot Voice Conversion via Adaptive Hybrid Content Encoding and Enhanced Timbre Modeling

    Authors: Yuguang Yang, Yu Pan, Jixun Yao, Xiang Zhang, Jianhao Ye, Hongbin Zhou, Lei Xie, Lei Ma, Jianjun Zhao

    Abstract: Expressive zero-shot voice conversion (VC) is a critical and challenging task that aims to transform the source timbre into an arbitrary unseen speaker while preserving the original content and expressive qualities. Despite recent progress in zero-shot VC, there remains considerable potential for improvements in speaker similarity and speech naturalness. Moreover, existing zero-shot VC systems str… ▽ More

    Submitted 10 January, 2025; v1 submitted 2 October, 2024; originally announced October 2024.

    Comments: Work in Progress; Under Review

  26. arXiv:2410.00046  [pdf, other

    eess.IV cs.CV cs.LG

    Mixture of Multicenter Experts in Multimodal Generative AI for Advanced Radiotherapy Target Delineation

    Authors: Yujin Oh, Sangjoon Park, Xiang Li, Wang Yi, Jonathan Paly, Jason Efstathiou, Annie Chan, Jun Won Kim, Hwa Kyung Byun, Ik Jae Lee, Jaeho Cho, Chan Woo Wee, Peng Shu, Peilong Wang, Nathan Yu, Jason Holmes, Jong Chul Ye, Quanzheng Li, Wei Liu, Woong Sub Koom, Jin Sung Kim, Kyungsang Kim

    Abstract: Clinical experts employ diverse philosophies and strategies in patient care, influenced by regional patient populations. However, existing medical artificial intelligence (AI) models are often trained on data distributions that disproportionately reflect highly prevalent patterns, reinforcing biases and overlooking the diverse expertise of clinicians. To overcome this limitation, we introduce the… ▽ More

    Submitted 26 October, 2024; v1 submitted 27 September, 2024; originally announced October 2024.

    Comments: 39 pages

  27. arXiv:2409.12377  [pdf, other

    eess.IV cs.CV

    Fundus image enhancement through direct diffusion bridges

    Authors: Sehui Kim, Hyungjin Chung, Se Hie Park, Eui-Sang Chung, Kayoung Yi, Jong Chul Ye

    Abstract: We propose FD3, a fundus image enhancement method based on direct diffusion bridges, which can cope with a wide range of complex degradations, including haze, blur, noise, and shadow. We first propose a synthetic forward model through a human feedback loop with board-certified ophthalmologists for maximal quality improvement of low-quality in-vivo images. Using the proposed forward model, we train… ▽ More

    Submitted 18 September, 2024; originally announced September 2024.

    Comments: Published at IEEE JBHI. 12 pages, 10 figures. Code and Data: https://github.com/heeheee888/FD3

  28. arXiv:2409.12139  [pdf, other

    cs.SD cs.AI eess.AS

    Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models

    Authors: Sijing Chen, Yuan Feng, Laipeng He, Tianwei He, Wendi He, Yanni Hu, Bin Lin, Yiting Lin, Yu Pan, Pengfei Tan, Chengwei Tian, Chen Wang, Zhicheng Wang, Ruoye Xie, Jixun Yao, Quanlei Yan, Yuguang Yang, Jianhao Ye, Jingjing Yin, Yanzhen Yu, Huimin Zhang, Xiang Zhang, Guangcheng Zhao, Hongbin Zhou, Pengpeng Zou

    Abstract: With the advent of the big data and large language model era, zero-shot personalized rapid customization has emerged as a significant trend. In this report, we introduce Takin AudioLLM, a series of techniques and models, mainly including Takin TTS, Takin VC, and Takin Morphing, specifically designed for audiobook production. These models are capable of zero-shot speech production, generating high-… ▽ More

    Submitted 23 September, 2024; v1 submitted 18 September, 2024; originally announced September 2024.

    Comments: Technical Report; 18 pages; typos corrected, references added, demo url modified, author name modified;

  29. arXiv:2408.03361  [pdf, other

    eess.IV cs.CV

    GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI

    Authors: Pengcheng Chen, Jin Ye, Guoan Wang, Yanjun Li, Zhongying Deng, Wei Li, Tianbin Li, Haodong Duan, Ziyan Huang, Yanzhou Su, Benyou Wang, Shaoting Zhang, Bin Fu, Jianfei Cai, Bohan Zhuang, Eric J Seibel, Junjun He, Yu Qiao

    Abstract: Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals, and can be applied in various fields. In the medical field, LVLMs have a high potential to offer substantial assistance for diagnosis and treatment. Before that, it is crucial to develop benchmarks to evaluate LVLMs' effectiveness in various medical applications. Curren… ▽ More

    Submitted 21 October, 2024; v1 submitted 6 August, 2024; originally announced August 2024.

    Comments: GitHub: https://github.com/uni-medical/GMAI-MMBench Hugging face: https://huggingface.co/datasets/OpenGVLab/GMAI-MMBench

  30. arXiv:2406.07162  [pdf, other

    cs.SD cs.AI cs.CL cs.MM eess.AS

    EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark

    Authors: Ziyang Ma, Mingjie Chen, Hezhao Zhang, Zhisheng Zheng, Wenxi Chen, Xiquan Li, Jiaxin Ye, Xie Chen, Thomas Hain

    Abstract: Speech emotion recognition (SER) is an important part of human-computer interaction, receiving extensive attention from both industry and academia. However, the current research field of SER has long suffered from the following problems: 1) There are few reasonable and universal splits of the datasets, making comparing different models and methods difficult. 2) No commonly used benchmark covers nu… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH 2024. GitHub Repository: https://github.com/emo-box/EmoBox

  31. arXiv:2405.16011  [pdf, ps, other

    eess.SP

    Semantic Importance-Aware Communications with Semantic Correction Using Large Language Models

    Authors: Shuaishuai Guo, Yanhu Wang, Jia Ye, Anbang Zhang, Kun Xu

    Abstract: Semantic communications, a promising approach for agent-human and agent-agent interactions, typically operate at a feature level, lacking true semantic understanding. This paper explores understanding-level semantic communications (ULSC), transforming visual data into human-intelligible semantic content. We employ an image caption neural network (ICNN) to derive semantic representations from visua… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

  32. arXiv:2404.13605  [pdf, other

    cs.CV eess.IV

    Turb-Seg-Res: A Segment-then-Restore Pipeline for Dynamic Videos with Atmospheric Turbulence

    Authors: Ripon Kumar Saha, Dehao Qin, Nianyi Li, Jinwei Ye, Suren Jayasuriya

    Abstract: Tackling image degradation due to atmospheric turbulence, particularly in dynamic environment, remains a challenge for long-range imaging systems. Existing techniques have been primarily designed for static scenes or scenes with small motion. This paper presents the first segment-then-restore pipeline for restoring the videos of dynamic scenes in turbulent environment. We leverage mean optical flo… ▽ More

    Submitted 21 April, 2024; originally announced April 2024.

    Comments: CVPR 2024 Paper

  33. arXiv:2403.17324  [pdf, ps, other

    eess.SP

    Unsupervised Learning for Joint Beamforming Design in RIS-aided ISAC Systems

    Authors: Junjie Ye, Lei Huang, Zhen Chen, Peichang Zhang, Mohamed Rihan

    Abstract: It is critical to design efficient beamforming in reconfigurable intelligent surface (RIS)-aided integrated sensing and communication (ISAC) systems for enhancing spectrum utilization. However, conventional methods often have limitations, either incurring high computational complexity due to iterative algorithms or sacrificing performance when using heuristic methods. To achieve both low complexit… ▽ More

    Submitted 15 May, 2024; v1 submitted 25 March, 2024; originally announced March 2024.

    Comments: Accpeted by IEEE Wireless Communications Letters

  34. Multi-Center Fetal Brain Tissue Annotation (FeTA) Challenge 2022 Results

    Authors: Kelly Payette, Céline Steger, Roxane Licandro, Priscille de Dumast, Hongwei Bran Li, Matthew Barkovich, Liu Li, Maik Dannecker, Chen Chen, Cheng Ouyang, Niccolò McConnell, Alina Miron, Yongmin Li, Alena Uus, Irina Grigorescu, Paula Ramirez Gilliland, Md Mahfuzur Rahman Siddiquee, Daguang Xu, Andriy Myronenko, Haoyu Wang, Ziyan Huang, Jin Ye, Mireia Alenyà, Valentin Comte, Oscar Camara , et al. (42 additional authors not shown)

    Abstract: Segmentation is a critical step in analyzing the developing human fetal brain. There have been vast improvements in automatic segmentation methods in the past several years, and the Fetal Brain Tissue Annotation (FeTA) Challenge 2021 helped to establish an excellent standard of fetal brain segmentation. However, FeTA 2021 was a single center study, and the generalizability of algorithms across dif… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

    Comments: Results from FeTA Challenge 2022, held at MICCAI; Manuscript submitted to IEEE Transactions on Medical Imaging (2024). Supplementary Info (including submission methods descriptions) available here: https://zenodo.org/records/10628648

  35. arXiv:2402.02327  [pdf, other

    cs.CV cs.SD eess.AS

    Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues

    Authors: Tianxiang Chen, Zhentao Tan, Tao Gong, Qi Chu, Yue Wu, Bin Liu, Le Lu, Jieping Ye, Nenghai Yu

    Abstract: How to effectively interact audio with vision has garnered considerable interest within the multi-modality research field. Recently, a novel audio-visual segmentation (AVS) task has been proposed, aiming to segment the sounding objects in video frames under the guidance of audio cues. However, most existing AVS methods are hindered by a modality imbalance where the visual features tend to dominate… ▽ More

    Submitted 6 February, 2024; v1 submitted 3 February, 2024; originally announced February 2024.

  36. arXiv:2401.03425  [pdf, other

    eess.SY math.ST

    Uncertainty Propagation and Bayesian Fusion on Unimodular Lie Groups from a Parametric Perspective

    Authors: Jikai Ye, Gregory S. Chirikjian

    Abstract: We address the problem of uncertainty propagation and Bayesian fusion on unimodular Lie groups. Starting from a stochastic differential equation (SDE) defined on Lie groups via Mckean-Gangolli injection, we first convert it to a parametric SDE in exponential coordinates. The coefficient transform method for the conversion is stated for both Ito's and Stratonovich's interpretation of the SDE. Then… ▽ More

    Submitted 7 March, 2025; v1 submitted 7 January, 2024; originally announced January 2024.

    Comments: Accepted by CDC 2024; modified typos in theorem 2 and appendix A

  37. arXiv:2312.15185  [pdf, other

    cs.CL cs.HC cs.MM cs.SD eess.AS

    emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation

    Authors: Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, Xie Chen

    Abstract: We propose emotion2vec, a universal speech emotion representation model. emotion2vec is pre-trained on open-source unlabeled emotion data through self-supervised online distillation, combining utterance-level loss and frame-level loss during pre-training. emotion2vec outperforms state-of-the-art pre-trained universal models and emotion specialist models by only training linear layers for the speec… ▽ More

    Submitted 23 December, 2023; originally announced December 2023.

    Comments: Code, checkpoints, and extracted features are available at https://github.com/ddlBoJack/emotion2vec

  38. arXiv:2312.09576  [pdf, other

    eess.IV cs.CV

    SegRap2023: A Benchmark of Organs-at-Risk and Gross Tumor Volume Segmentation for Radiotherapy Planning of Nasopharyngeal Carcinoma

    Authors: Xiangde Luo, Jia Fu, Yunxin Zhong, Shuolin Liu, Bing Han, Mehdi Astaraki, Simone Bendazzoli, Iuliana Toma-Dasu, Yiwen Ye, Ziyang Chen, Yong Xia, Yanzhou Su, Jin Ye, Junjun He, Zhaohu Xing, Hongqiu Wang, Lei Zhu, Kaixiang Yang, Xin Fang, Zhiwei Wang, Chan Woong Lee, Sang Joon Park, Jaehee Chun, Constantin Ulrich, Klaus H. Maier-Hein , et al. (17 additional authors not shown)

    Abstract: Radiation therapy is a primary and effective NasoPharyngeal Carcinoma (NPC) treatment strategy. The precise delineation of Gross Tumor Volumes (GTVs) and Organs-At-Risk (OARs) is crucial in radiation treatment, directly impacting patient prognosis. Previously, the delineation of GTVs and OARs was performed by experienced radiation oncologists. Recently, deep learning has achieved promising results… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

    Comments: A challenge report of SegRap2023 (organized in conjunction with MICCAI2023)

  39. arXiv:2312.03348  [pdf, other

    eess.SY cs.RO

    Uncertainty Propagation on Unimodular Matrix Lie Groups

    Authors: Jikai Ye, Amitesh S. Jayaraman, Gregory S. Chirikjian

    Abstract: This paper addresses uncertainty propagation on unimodular matrix Lie groups that have a surjective exponential map. We derive the exact formula for the propagation of mean and covariance in a continuous-time setting from the governing Fokker-Planck equation. Two approximate propagation methods are discussed based on the exact formula. One uses numerical quadrature and another utilizes the expansi… ▽ More

    Submitted 6 December, 2023; originally announced December 2023.

    Comments: 37 pages

  40. arXiv:2312.03013  [pdf, other

    eess.IV cs.AI cs.CV cs.LG

    Breast Ultrasound Report Generation using LangChain

    Authors: Jaeyoung Huh, Hyun Jeong Park, Jong Chul Ye

    Abstract: Breast ultrasound (BUS) is a critical diagnostic tool in the field of breast imaging, aiding in the early detection and characterization of breast abnormalities. Interpreting breast ultrasound images commonly involves creating comprehensive medical reports, containing vital information to promptly assess the patient's condition. However, the ultrasound imaging system necessitates capturing multipl… ▽ More

    Submitted 4 December, 2023; originally announced December 2023.

  41. Energy Efficiency Optimization in Active Reconfigurable Intelligent Surface-Aided Integrated Sensing and Communication Systems

    Authors: Junjie Ye, Mohamed Rihan, Peichang Zhang, Lei Huang, Stefano Buzzi, Zhen Chen

    Abstract: Energy efficiency (EE) is a challenging task in integrated sensing and communication (ISAC) systems, where high spectral efficiency and low energy consumption appear as conflicting requirements. Although passive reconfigurable intelligent surface (RIS) has emerged as a promising technology for enhancing the EE of the ISAC system, the multiplicative fading feature hinders its effectiveness. This pa… ▽ More

    Submitted 19 September, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

    Comments: Accepted by IEEE TVT

  42. arXiv:2311.11969  [pdf, other

    eess.IV cs.CV

    SA-Med2D-20M Dataset: Segment Anything in 2D Medical Imaging with 20 Million masks

    Authors: Jin Ye, Junlong Cheng, Jianpin Chen, Zhongying Deng, Tianbin Li, Haoyu Wang, Yanzhou Su, Ziyan Huang, Jilong Chen, Lei Jiang, Hui Sun, Min Zhu, Shaoting Zhang, Junjun He, Yu Qiao

    Abstract: Segment Anything Model (SAM) has achieved impressive results for natural image segmentation with input prompts such as points and bounding boxes. Its success largely owes to massive labeled training data. However, directly applying SAM to medical image segmentation cannot perform well because SAM lacks medical knowledge -- it does not use medical images for training. To incorporate medical knowled… ▽ More

    Submitted 20 November, 2023; originally announced November 2023.

  43. arXiv:2311.07033  [pdf, other

    eess.IV cs.CV

    TTMFN: Two-stream Transformer-based Multimodal Fusion Network for Survival Prediction

    Authors: Ruiquan Ge, Xiangyang Hu, Rungen Huang, Gangyong Jia, Yaqi Wang, Renshu Gu, Changmiao Wang, Elazab Ahmed, Linyan Wang, Juan Ye, Ye Li

    Abstract: Survival prediction plays a crucial role in assisting clinicians with the development of cancer treatment protocols. Recent evidence shows that multimodal data can help in the diagnosis of cancer disease and improve survival prediction. Currently, deep learning-based approaches have experienced increasing success in survival prediction by integrating pathological images and gene expression data. H… ▽ More

    Submitted 12 November, 2023; originally announced November 2023.

  44. LLM-driven Multimodal Target Volume Contouring in Radiation Oncology

    Authors: Yujin Oh, Sangjoon Park, Hwa Kyung Byun, Yeona Cho, Ik Jae Lee, Jin Sung Kim, Jong Chul Ye

    Abstract: Target volume contouring for radiation therapy is considered significantly more challenging than the normal organ segmentation tasks as it necessitates the utilization of both image and text-based clinical information. Inspired by the recent advancement of large language models (LLMs) that can facilitate the integration of the textural information and images, here we present a novel LLM-driven mul… ▽ More

    Submitted 24 October, 2024; v1 submitted 3 November, 2023; originally announced November 2023.

    Comments: Published in Nature Communications, see https://www.nature.com/articles/s41467-024-53387-y

    Journal ref: Nat Commun 15, 9186 (2024)

  45. arXiv:2311.00483  [pdf, other

    eess.IV cs.CV

    DEFN: Dual-Encoder Fourier Group Harmonics Network for Three-Dimensional Indistinct-Boundary Object Segmentation

    Authors: Xiaohua Jiang, Yihao Guo, Jian Huang, Yuting Wu, Meiyi Luo, Zhaoyang Xu, Qianni Zhang, Xingru Huang, Hong He, Shaowei Jiang, Jing Ye, Mang Xiao

    Abstract: The precise spatial and quantitative delineation of indistinct-boundary medical objects is paramount for the accuracy of diagnostic protocols, efficacy of surgical interventions, and reliability of postoperative assessments. Despite their significance, the effective segmentation and instantaneous three-dimensional reconstruction are significantly impeded by the paucity of representative samples in… ▽ More

    Submitted 19 June, 2024; v1 submitted 1 November, 2023; originally announced November 2023.

    Comments: 36pages,16figures,7tables

    MSC Class: 68; 92 ACM Class: I.4; J.3

  46. arXiv:2309.12688  [pdf, ps, other

    cs.IT eess.SP

    Green Holographic MIMO Communications With A Few Transmit Radio Frequency Chains

    Authors: Shuaishuai Guo, Jia Ye, Kaiqian Qu, Shuping Dang

    Abstract: Holographic multiple-input multiple-output (MIMO) communications are widely recognized as a promising candidate for the next-generation air interface. With holographic MIMO surface, the number of the spatial degrees-of-freedom (DoFs) considerably increases and also significantly varies as the user moves. To fully employ the large and varying number of spatial DoFs, the number of equipped RF chains… ▽ More

    Submitted 22 September, 2023; originally announced September 2023.

    Comments: 10 figures; has been accepted by TGCN

  47. A-Eval: A Benchmark for Cross-Dataset Evaluation of Abdominal Multi-Organ Segmentation

    Authors: Ziyan Huang, Zhongying Deng, Jin Ye, Haoyu Wang, Yanzhou Su, Tianbin Li, Hui Sun, Junlong Cheng, Jianpin Chen, Junjun He, Yun Gu, Shaoting Zhang, Lixu Gu, Yu Qiao

    Abstract: Although deep learning have revolutionized abdominal multi-organ segmentation, models often struggle with generalization due to training on small, specific datasets. With the recent emergence of large-scale datasets, some important questions arise: \textbf{Can models trained on these datasets generalize well on different ones? If yes/no, how to further improve their generalizability?} To address t… ▽ More

    Submitted 7 September, 2023; originally announced September 2023.

  48. arXiv:2309.03112  [pdf, other

    eess.SY

    A Lie-Theoretic Approach to Propagating Uncertainty Jointly in Attitude and Angular Momentum

    Authors: Amitesh S. Jayaraman, Jikai Ye, Gregory S. Chirikjian

    Abstract: Dynamic state estimation, as opposed to kinematic state estimation, seeks to estimate not only the orientation of a rigid body but also its angular velocity, through Euler's equations of rotational motion. This paper demonstrates that the dynamic state estimation problem can be reformulated as estimating a probability distribution on a Lie group defined on phase space (the product space of rotatio… ▽ More

    Submitted 6 September, 2023; originally announced September 2023.

    Comments: 8 pages, 4 figures

  49. arXiv:2308.12859  [pdf, ps, other

    cs.SD cs.LG eess.AS stat.ME

    Towards Automated Animal Density Estimation with Acoustic Spatial Capture-Recapture

    Authors: Yuheng Wang, Juan Ye, David L. Borchers

    Abstract: Passive acoustic monitoring can be an effective way of monitoring wildlife populations that are acoustically active but difficult to survey visually. Digital recorders allow surveyors to gather large volumes of data at low cost, but identifying target species vocalisations in these data is non-trivial. Machine learning (ML) methods are often used to do the identification. They can process large vo… ▽ More

    Submitted 24 August, 2023; originally announced August 2023.

    Comments: 35 pages, 5 figures

  50. arXiv:2308.06533  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Knowledge Distilled Ensemble Model for sEMG-based Silent Speech Interface

    Authors: Wenqiang Lai, Qihan Yang, Ye Mao, Endong Sun, Jiangnan Ye

    Abstract: Voice disorders affect millions of people worldwide. Surface electromyography-based Silent Speech Interfaces (sEMG-based SSIs) have been explored as a potential solution for decades. However, previous works were limited by small vocabularies and manually extracted features from raw data. To address these limitations, we propose a lightweight deep learning knowledge-distilled ensemble model for sEM… ▽ More

    Submitted 6 August, 2023; originally announced August 2023.

    Comments: 6 pages, 5 figures