Skip to main content

Showing 1–50 of 168 results for author: Lü, L

Searching in archive eess. Search in all archives.
.
  1. arXiv:2507.03872  [pdf, ps, other

    eess.IV cs.CV

    PLUS: Plug-and-Play Enhanced Liver Lesion Diagnosis Model on Non-Contrast CT Scans

    Authors: Jiacheng Hao, Xiaoming Zhang, Wei Liu, Xiaoli Yin, Yuan Gao, Chunli Li, Ling Zhang, Le Lu, Yu Shi, Xu Han, Ke Yan

    Abstract: Focal liver lesions (FLL) are common clinical findings during physical examination. Early diagnosis and intervention of liver malignancies are crucial to improving patient survival. Although the current 3D segmentation paradigm can accurately detect lesions, it faces limitations in distinguishing between malignant and benign liver lesions, primarily due to its inability to differentiate subtle var… ▽ More

    Submitted 4 July, 2025; originally announced July 2025.

    Comments: MICCAI 2025 (Early Accepted)

  2. arXiv:2506.20282  [pdf, ps, other

    eess.IV cs.CV

    Opportunistic Osteoporosis Diagnosis via Texture-Preserving Self-Supervision, Mixture of Experts and Multi-Task Integration

    Authors: Jiaxing Huang, Heng Guo, Le Lu, Fan Yang, Minfeng Xu, Ge Yang, Wei Luo

    Abstract: Osteoporosis, characterized by reduced bone mineral density (BMD) and compromised bone microstructure, increases fracture risk in aging populations. While dual-energy X-ray absorptiometry (DXA) is the clinical standard for BMD assessment, its limited accessibility hinders diagnosis in resource-limited regions. Opportunistic computed tomography (CT) analysis has emerged as a promising alternative f… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

    Comments: Accepted by MICCAI 2025

  3. arXiv:2505.19709  [pdf, ps, other

    cs.IT eess.SP

    Capacity-Optimized Pre-Equalizer Design for Visible Light Communication Systems

    Authors: Runxin Zhang, Yulin Shao, Jian Xiong, Lu Lu, Murat Uysal

    Abstract: Since commercial LEDs are primarily designed for illumination rather than data transmission, their modulation bandwidth is inherently limited to a few MHz. This becomes a major bottleneck in the implementation of visible light communication (VLC) systems necessiating the design of pre-equalizers. While state-of-the-art equalizer designs primarily focus on the data rate increasing through bandwidth… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  4. arXiv:2505.05112  [pdf, ps, other

    eess.IV cs.CV

    MDAA-Diff: CT-Guided Multi-Dose Adaptive Attention Diffusion Model for PET Denoising

    Authors: Xiaolong Niu, Zanting Ye, Xu Han, Yanchao Huang, Hao Sun, Hubing Wu, Lijun Lu

    Abstract: Acquiring high-quality Positron Emission Tomography (PET) images requires administering high-dose radiotracers, which increases radiation exposure risks. Generating standard-dose PET (SPET) from low-dose PET (LPET) has become a potential solution. However, previous studies have primarily focused on single low-dose PET denoising, neglecting two critical factors: discrepancies in dose response cause… ▽ More

    Submitted 21 June, 2025; v1 submitted 8 May, 2025; originally announced May 2025.

  5. arXiv:2504.14076  [pdf, other

    cs.SD cs.LG eess.AS

    Transformation of audio embeddings into interpretable, concept-based representations

    Authors: Alice Zhang, Edison Thomaz, Lie Lu

    Abstract: Advancements in audio neural networks have established state-of-the-art results on downstream audio tasks. However, the black-box structure of these models makes it difficult to interpret the information encoded in their internal audio representations. In this work, we explore the semantic interpretability of audio embeddings extracted from these neural networks by leveraging CLAP, a contrastive l… ▽ More

    Submitted 18 April, 2025; originally announced April 2025.

    Comments: Accepted to International Joint Conference on Neural Networks (IJCNN) 2025

  6. arXiv:2504.07827  [pdf, other

    eess.IV cs.CV

    HarmonySeg: Tubular Structure Segmentation with Deep-Shallow Feature Fusion and Growth-Suppression Balanced Loss

    Authors: Yi Huang, Ke Zhang, Wei Liu, Yuanyuan Wang, Vishal M. Patel, Le Lu, Xu Han, Dakai Jin, Ke Yan

    Abstract: Accurate segmentation of tubular structures in medical images, such as vessels and airway trees, is crucial for computer-aided diagnosis, radiotherapy, and surgical planning. However, significant challenges exist in algorithm design when faced with diverse sizes, complex topologies, and (often) incomplete data annotation of these structures. We address these difficulties by proposing a new tubular… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

  7. arXiv:2504.00493  [pdf, other

    eess.SY

    Perturbation-Based Pinning Control Strategy for Enhanced Synchronization in Complex Networks

    Authors: Ziang Mao, Tianlong Fan, Linyuan Lü

    Abstract: Synchronization is essential for the stability and coordinated operation of complex networked systems. Pinning control, which selectively controls a subset of nodes, provides a scalable solution to enhance network synchronizability. However, existing strategies face key limitations: heuristic centrality-based methods lack a direct connection to synchronization dynamics, while spectral approaches,… ▽ More

    Submitted 10 April, 2025; v1 submitted 1 April, 2025; originally announced April 2025.

    Comments: This work has been submitted to the IEEE for possible publication

  8. arXiv:2503.20290  [pdf, ps, other

    eess.AS cs.AI cs.CL cs.SD

    QualiSpeech: A Speech Quality Assessment Dataset with Natural Language Reasoning and Descriptions

    Authors: Siyin Wang, Wenyi Yu, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Lu Lu, Yu Tsao, Junichi Yamagishi, Yuxuan Wang, Chao Zhang

    Abstract: This paper explores a novel perspective to speech quality assessment by leveraging natural language descriptions, offering richer, more nuanced insights than traditional numerical scoring methods. Natural language feedback provides instructive recommendations and detailed evaluations, yet existing datasets lack the comprehensive annotations needed for this approach. To bridge this gap, we introduc… ▽ More

    Submitted 15 June, 2025; v1 submitted 26 March, 2025; originally announced March 2025.

    Comments: 22 pages, 10 figures

  9. arXiv:2503.15338  [pdf, other

    eess.AS cs.CL cs.SD

    Solla: Towards a Speech-Oriented LLM That Hears Acoustic Context

    Authors: Junyi Ao, Dekun Chen, Xiaohai Tian, Wenjie Feng, Jun Zhang, Lu Lu, Yuxuan Wang, Haizhou Li, Zhizheng Wu

    Abstract: Large Language Models (LLMs) have recently shown remarkable ability to process not only text but also multimodal inputs such as speech and audio. However, most existing models primarily focus on analyzing input signals using text instructions, overlooking scenarios in which speech instructions and audio are mixed and serve as inputs to the model. To address these challenges, we introduce Solla, a… ▽ More

    Submitted 19 March, 2025; originally announced March 2025.

  10. arXiv:2503.12698  [pdf, other

    eess.IV cs.CV

    A Continual Learning-driven Model for Accurate and Generalizable Segmentation of Clinically Comprehensive and Fine-grained Whole-body Anatomies in CT

    Authors: Dazhou Guo, Zhanghexuan Ji, Yanzhou Su, Dandan Zheng, Heng Guo, Puyang Wang, Ke Yan, Yirui Wang, Qinji Yu, Zi Li, Minfeng Xu, Jianfeng Zhang, Haoshen Li, Jia Ge, Tsung-Ying Ho, Bing-Shen Huang, Tashan Ai, Kuaile Zhao, Na Shen, Qifeng Wang, Yun Bian, Tingyu Wu, Peng Du, Hua Zhang, Feng-Ming Kong , et al. (9 additional authors not shown)

    Abstract: Precision medicine in the quantitative management of chronic diseases and oncology would be greatly improved if the Computed Tomography (CT) scan of any patient could be segmented, parsed and analyzed in a precise and detailed way. However, there is no such fully annotated CT dataset with all anatomies delineated for training because of the exceptionally high manual cost, the need for specialized… ▽ More

    Submitted 16 March, 2025; originally announced March 2025.

  11. arXiv:2503.11262  [pdf, ps, other

    cs.CV eess.IV

    Dark Noise Diffusion: Noise Synthesis for Low-Light Image Denoising

    Authors: Liying Lu, Raphaël Achddou, Sabine Süsstrunk

    Abstract: Low-light photography produces images with low signal-to-noise ratios due to limited photons. In such conditions, common approximations like the Gaussian noise model fall short, and many denoising techniques fail to remove noise effectively. Although deep-learning methods perform well, they require large datasets of paired images that are impractical to acquire. As a remedy, synthesizing realistic… ▽ More

    Submitted 5 July, 2025; v1 submitted 14 March, 2025; originally announced March 2025.

  12. arXiv:2503.03786  [pdf, ps, other

    q-bio.TO cs.CV eess.IV

    Self is the Best Learner: CT-free Ultra-Low-Dose PET Organ Segmentation via Collaborating Denoising and Segmentation Learning

    Authors: Zanting Ye, Xiaolong Niu, Xu Han, Xuanbin Wu, Wantong Lu, Yijun Lu, Hao Sun, Yanchao Huang, Hubing Wu, Lijun Lu

    Abstract: Organ segmentation in Positron Emission Tomography (PET) plays a vital role in cancer quantification. Low-dose PET (LDPET) provides a safer alternative by reducing radiation exposure. However, the inherent noise and blurred boundaries make organ segmentation more challenging. Additionally, existing PET organ segmentation methods rely on coregistered Computed Tomography (CT) annotations, overlookin… ▽ More

    Submitted 26 June, 2025; v1 submitted 4 March, 2025; originally announced March 2025.

    Comments: This work has been accepted by MICCAI2025; 9 pages, 5 figures

  13. arXiv:2502.04230  [pdf, other

    cs.SD cs.AI cs.CR cs.LG eess.AS

    XAttnMark: Learning Robust Audio Watermarking with Cross-Attention

    Authors: Yixin Liu, Lie Lu, Jihui Jin, Lichao Sun, Andrea Fanelli

    Abstract: The rapid proliferation of generative audio synthesis and editing technologies has raised significant concerns about copyright infringement, data provenance, and the spread of misinformation through deepfake audio. Watermarking offers a proactive solution by embedding imperceptible, identifiable, and traceable marks into audio content. While recent neural network-based watermarking methods like Wa… ▽ More

    Submitted 7 February, 2025; v1 submitted 6 February, 2025; originally announced February 2025.

    Comments: 24 pages, 10 figures

  14. arXiv:2502.02603  [pdf, other

    eess.AS cs.CL cs.SD

    SEAL: Speech Embedding Alignment Learning for Speech Large Language Model with Retrieval-Augmented Generation

    Authors: Chunyu Sun, Bingyu Liu, Zhichao Cui, Anbin Qi, Tian-hao Zhang, Dinghao Zhou, Lewei Lu

    Abstract: Embedding-based retrieval models have made significant strides in retrieval-augmented generation (RAG) techniques for text and multimodal large language models (LLMs) applications. However, when it comes to speech larage language models (SLLMs), these methods are limited to a two-stage process, where automatic speech recognition (ASR) is combined with text-based retrieval. This sequential architec… ▽ More

    Submitted 26 January, 2025; originally announced February 2025.

  15. Collaborative Channel Access and Transmission for NR Sidelink and Wi-Fi Coexistence over Unlicensed Spectrum

    Authors: Zhuangzhuang Yan, Xinyu Gu, Zhenyu Liu, Liyang Lu

    Abstract: With the rapid development of various internet of things (IoT) applications, including industrial IoT (IIoT) and visual IoT (VIoT), the demand for direct device-to-device communication to support high data rates continues to grow. To address this demand, 5G-Advanced has introduced sidelink communication over the unlicensed spectrum (SL-U) to increase data rates. However, the primary challenge of S… ▽ More

    Submitted 14 February, 2025; v1 submitted 19 January, 2025; originally announced January 2025.

  16. arXiv:2501.12023  [pdf, other

    cs.LG cs.CV eess.IV

    Comparative Analysis of Pre-trained Deep Learning Models and DINOv2 for Cushing's Syndrome Diagnosis in Facial Analysis

    Authors: Hongjun Liu, Changwei Song, Jiaqi Qiang, Jianqiang Li, Hui Pan, Lin Lu, Xiao Long, Qing Zhao, Jiuzuo Huang, Shi Chen

    Abstract: Cushing's syndrome is a condition caused by excessive glucocorticoid secretion from the adrenal cortex, often manifesting with moon facies and plethora, making facial data crucial for diagnosis. Previous studies have used pre-trained convolutional neural networks (CNNs) for diagnosing Cushing's syndrome using frontal facial images. However, CNNs are better at capturing local features, while Cushin… ▽ More

    Submitted 21 January, 2025; originally announced January 2025.

  17. arXiv:2501.11854  [pdf

    eess.IV cs.CV

    WaveNet-SF: A Hybrid Network for Retinal Disease Detection Based on Wavelet Transform in the Spatial-Frequency Domain

    Authors: Jilan Cheng, Guoli Long, Zeyu Zhang, Zhenjia Qi, Hanyu Wang, Libin Lu, Shuihua Wang, Yudong Zhang, Jin Hong

    Abstract: Retinal diseases are a leading cause of vision impairment and blindness, with timely diagnosis being critical for effective treatment. Optical Coherence Tomography (OCT) has become a standard imaging modality for retinal disease diagnosis, but OCT images often suffer from issues such as speckle noise, complex lesion shapes, and varying lesion sizes, making interpretation challenging. In this paper… ▽ More

    Submitted 20 January, 2025; originally announced January 2025.

  18. arXiv:2501.08566  [pdf, other

    cs.SD cs.AI eess.AS

    Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement

    Authors: Qianniu Chen, Xiaoyang Hao, Bowen Li, Yue Liu, Li Lu

    Abstract: Zero-shot Text-To-Speech (TTS) synthesis shows great promise for personalized voice customization through voice cloning. However, current methods for achieving zero-shot TTS heavily rely on large model scales and extensive training datasets to ensure satisfactory performance and generalizability across various speakers. This raises concerns regarding both deployment costs and data security. In thi… ▽ More

    Submitted 14 January, 2025; originally announced January 2025.

    Comments: 5 pages,4 figures

  19. arXiv:2501.02730  [pdf, other

    eess.SP

    New Paradigm for Unified Near-Field and Far-Field Wireless Communications

    Authors: Zhaocheng Wang, Haochen Wu, Yuanbin Chen, Liyang Lu

    Abstract: Current Type I and Type II codebooks in fifth generation (5G) wireless communications are limited in supporting the coexistence of far-field and near-field user equipments, as they are exclusively designed for far-field scenarios. To fill this knowledge gap and encourage relevant proposals by the 3rd Generation Partnership Project (3GPP), this article provides a novel codebook to facilitate a unif… ▽ More

    Submitted 21 April, 2025; v1 submitted 5 January, 2025; originally announced January 2025.

    Comments: This paper has been accepted by IEEE Network magazine

  20. arXiv:2412.05831  [pdf, other

    cs.MM cs.SD eess.AS

    Semi-Supervised Contrastive Learning for Controllable Video-to-Music Retrieval

    Authors: Shanti Stewart, Gouthaman KV, Lie Lu, Andrea Fanelli

    Abstract: Content creators often use music to enhance their videos, from soundtracks in movies to background music in video blogs and social media content. However, identifying the best music for a video can be a difficult and time-consuming task. To address this challenge, we propose a novel framework for automatically retrieving a matching music clip for a given video, and vice versa. Our approach leverag… ▽ More

    Submitted 22 December, 2024; v1 submitted 8 December, 2024; originally announced December 2024.

    Comments: Accepted at ICASSP 2025

  21. arXiv:2411.18290  [pdf, ps, other

    eess.IV cs.CV

    Leveraging Semantic Asymmetry for Precise Gross Tumor Volume Segmentation of Nasopharyngeal Carcinoma in Planning CT

    Authors: Zi Li, Ying Chen, Zeli Chen, Yanzhou Su, Tai Ma, Tony C. W. Mok, Yan-Jie Zhou, Yunhai Bai, Zhinlin Zheng, Le Lu, Yirui Wang, Jia Ge, Xianghua Ye, Senxiang Yan, Dakai Jin

    Abstract: In the radiation therapy of nasopharyngeal carcinoma (NPC), clinicians typically delineate the gross tumor volume (GTV) using non-contrast planning computed tomography to ensure accurate radiation dose delivery. However, the low contrast between tumors and adjacent normal tissues necessitates that radiation oncologists manually delineate the tumors, often relying on diagnostic MRI for guidance. %… ▽ More

    Submitted 26 June, 2025; v1 submitted 27 November, 2024; originally announced November 2024.

  22. arXiv:2411.18138  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation

    Authors: Wenyi Yu, Siyin Wang, Xiaoyu Yang, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Guangzhi Sun, Lu Lu, Yuxuan Wang, Chao Zhang

    Abstract: Full-duplex multimodal large language models (LLMs) provide a unified framework for addressing diverse speech understanding and generation tasks, enabling more natural and seamless human-machine conversations. Unlike traditional modularised conversational AI systems, which separate speech recognition, understanding, and text-to-speech generation into distinct components, multimodal LLMs operate as… ▽ More

    Submitted 27 November, 2024; originally announced November 2024.

    Comments: Technical report

  23. arXiv:2411.05361  [pdf, ps, other

    cs.CL eess.AS

    Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

    Authors: Chien-yu Huang, Wei-Chih Chen, Shu-wen Yang, Andy T. Liu, Chen-An Li, Yu-Xiang Lin, Wei-Cheng Tseng, Anuj Diwan, Yi-Jen Shih, Jiatong Shi, William Chen, Chih-Kai Yang, Wenze Ren, Xuanjun Chen, Chi-Yuan Hsiao, Puyuan Peng, Shih-Heng Wang, Chun-Yi Kuan, Ke-Han Lu, Kai-Wei Chang, Fabian Ritter-Gutierrez, Kuan-Po Huang, Siddhant Arora, You-Kuan Lin, Ming To Chuang , et al. (55 additional authors not shown)

    Abstract: Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluati… ▽ More

    Submitted 9 June, 2025; v1 submitted 8 November, 2024; originally announced November 2024.

    Comments: ICLR 2025

  24. arXiv:2411.00726  [pdf, other

    eess.IV cs.AI cs.CV

    Cross-Fundus Transformer for Multi-modal Diabetic Retinopathy Grading with Cataract

    Authors: Fan Xiao, Junlin Hou, Ruiwei Zhao, Rui Feng, Haidong Zou, Lina Lu, Yi Xu, Juzhao Zhang

    Abstract: Diabetic retinopathy (DR) is a leading cause of blindness worldwide and a common complication of diabetes. As two different imaging tools for DR grading, color fundus photography (CFP) and infrared fundus photography (IFP) are highly-correlated and complementary in clinical applications. To the best of our knowledge, this is the first study that explores a novel multi-modal deep learning framework… ▽ More

    Submitted 1 November, 2024; originally announced November 2024.

    Comments: 10 pages, 4 figures

  25. arXiv:2410.18610  [pdf, other

    eess.IV cs.CV

    A Joint Representation Using Continuous and Discrete Features for Cardiovascular Diseases Risk Prediction on Chest CT Scans

    Authors: Minfeng Xu, Chen-Chen Fan, Yan-Jie Zhou, Wenchao Guo, Pan Liu, Jing Qi, Le Lu, Hanqing Chao, Kunlun He

    Abstract: Cardiovascular diseases (CVD) remain a leading health concern and contribute significantly to global mortality rates. While clinical advancements have led to a decline in CVD mortality, accurately identifying individuals who could benefit from preventive interventions remains an unsolved challenge in preventive cardiology. Current CVD risk prediction models, recommended by guidelines, are based on… ▽ More

    Submitted 15 November, 2024; v1 submitted 24 October, 2024; originally announced October 2024.

    Comments: 23 pages, 9 figures

  26. arXiv:2409.16644  [pdf, other

    eess.AS cs.CL cs.SD

    Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation

    Authors: Siyin Wang, Wenyi Yu, Yudong Yang, Changli Tang, Yixuan Li, Jimin Zhuang, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Guangzhi Sun, Lu Lu, Yuxuan Wang, Chao Zhang

    Abstract: Speech quality assessment typically requires evaluating audio from multiple aspects, such as mean opinion score (MOS) and speaker similarity (SIM) \etc., which can be challenging to cover using one small model designed for a single task. In this paper, we propose leveraging recently introduced auditory large language models (LLMs) for automatic speech quality assessment. By employing task-specific… ▽ More

    Submitted 1 April, 2025; v1 submitted 25 September, 2024; originally announced September 2024.

    Comments: Accepted by ICASSP 2025

  27. arXiv:2409.08680  [pdf, other

    eess.AS cs.AI cs.CL

    NEST-RQ: Next Token Prediction for Speech Self-Supervised Pre-Training

    Authors: Minglun Han, Ye Bai, Chen Shen, Youjia Huang, Mingkun Huang, Zehua Lin, Linhao Dong, Lu Lu, Yuxuan Wang

    Abstract: Speech self-supervised pre-training can effectively improve the performance of downstream tasks. However, previous self-supervised learning (SSL) methods for speech, such as HuBERT and BEST-RQ, focus on utilizing non-causal encoders with bidirectional context, and lack sufficient support for downstream streaming models. To address this issue, we introduce the next token prediction based speech pre… ▽ More

    Submitted 13 September, 2024; originally announced September 2024.

    Comments: 5 pages, 2 figures, Work in progress

  28. arXiv:2409.03804  [pdf, other

    eess.IV

    End-to-end Multi-source Visual Prompt Tuning for Survival Analysis in Whole Slide Images

    Authors: Zhongwei Qiu, Hanqing Chao, Wenbin Liu, Yixuan Shen, Le Lu, Ke Yan, Dakai Jin, Yun Bian, Hui Jiang

    Abstract: Survival analysis using pathology images poses a considerable challenge, as it requires the localization of relevant information from the multitude of tiles within whole slide images (WSIs). Current methods typically resort to a two-stage approach, where a pre-trained network extracts features from tiles, which are then used by survival models. This process, however, does not optimize the survival… ▽ More

    Submitted 5 September, 2024; originally announced September 2024.

  29. arXiv:2409.03087  [pdf, other

    eess.IV cs.CV

    Coupling AI and Citizen Science in Creation of Enhanced Training Dataset for Medical Image Segmentation

    Authors: Amir Syahmi, Xiangrong Lu, Yinxuan Li, Haoxuan Yao, Hanjun Jiang, Ishita Acharya, Shiyi Wang, Yang Nan, Xiaodan Xing, Guang Yang

    Abstract: Recent advancements in medical imaging and artificial intelligence (AI) have greatly enhanced diagnostic capabilities, but the development of effective deep learning (DL) models is still constrained by the lack of high-quality annotated datasets. The traditional manual annotation process by medical experts is time- and resource-intensive, limiting the scalability of these datasets. In this work, w… ▽ More

    Submitted 4 September, 2024; originally announced September 2024.

  30. Audio Match Cutting: Finding and Creating Matching Audio Transitions in Movies and Videos

    Authors: Dennis Fedorishin, Lie Lu, Srirangaraj Setlur, Venu Govindaraju

    Abstract: A "match cut" is a common video editing technique where a pair of shots that have a similar composition transition fluidly from one to another. Although match cuts are often visual, certain match cuts involve the fluid transition of audio, where sounds from different sources merge into one indistinguishable transition between two shots. In this paper, we explore the ability to automatically find a… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

    Comments: Accepted to ICASSP 2024

  31. arXiv:2408.01808  [pdf, other

    cs.CR cs.AI cs.SD eess.AS

    ALIF: Low-Cost Adversarial Audio Attacks on Black-Box Speech Platforms using Linguistic Features

    Authors: Peng Cheng, Yuwei Wang, Peng Huang, Zhongjie Ba, Xiaodong Lin, Feng Lin, Li Lu, Kui Ren

    Abstract: Extensive research has revealed that adversarial examples (AE) pose a significant threat to voice-controllable smart devices. Recent studies have proposed black-box adversarial attacks that require only the final transcription from an automatic speech recognition (ASR) system. However, these attacks typically involve many queries to the ASR, resulting in substantial costs. Moreover, AE-based adver… ▽ More

    Submitted 3 August, 2024; originally announced August 2024.

    Comments: Published in the 2024 IEEE Symposium on Security and Privacy (SP)

  32. arXiv:2407.13210  [pdf, other

    eess.IV cs.CV

    Improved Esophageal Varices Assessment from Non-Contrast CT Scans

    Authors: Chunli Li, Xiaoming Zhang, Yuan Gao, Xiaoli Yin, Le Lu, Ling Zhang, Ke Yan, Yu Shi

    Abstract: Esophageal varices (EV), a serious health concern resulting from portal hypertension, are traditionally diagnosed through invasive endoscopic procedures. Despite non-contrast computed tomography (NC-CT) imaging being a less expensive and non-invasive imaging modality, it has yet to gain full acceptance as a primary clinical diagnostic tool for EV evaluation. To overcome existing diagnostic challen… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: Early accepted to MICCAI 2024

  33. arXiv:2407.11529  [pdf, other

    eess.IV cs.AI cs.CV

    Cross-Phase Mutual Learning Framework for Pulmonary Embolism Identification on Non-Contrast CT Scans

    Authors: Bizhe Bai, Yan-Jie Zhou, Yujian Hu, Tony C. W. Mok, Yilang Xiang, Le Lu, Hongkun Zhang, Minfeng Xu

    Abstract: Pulmonary embolism (PE) is a life-threatening condition where rapid and accurate diagnosis is imperative yet difficult due to predominantly atypical symptomatology. Computed tomography pulmonary angiography (CTPA) is acknowledged as the gold standard imaging tool in clinics, yet it can be contraindicated for emergency department (ED) patients and represents an onerous procedure, thus necessitating… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: Early accept by MICCAI 2024

  34. arXiv:2407.04675  [pdf, other

    eess.AS cs.SD

    Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

    Authors: Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chuang Ding, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao, Lu Gao, Yi Guo, Minglun Han, Ting Han, Wenchao Hu, Xinying Hu, Yuxiang Hu, Deyu Hua, Lu Huang, Mingkun Huang, Youjia Huang, Jishuo Jin, Fanliu Kong, Zongwei Lan, Tianyu Li , et al. (30 additional authors not shown)

    Abstract: Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this wor… ▽ More

    Submitted 10 July, 2024; v1 submitted 5 July, 2024; originally announced July 2024.

  35. arXiv:2407.03992  [pdf, other

    eess.IV

    Medical Image Fusion for High-Level Analysis: A Mutual Enhancement Framework for Unaligned PAT and MRI

    Authors: Yutian Zhong, Jinchuan He, Zhichao Liang, Shuangyang Zhang, Qianjin Feng, Lijun Lu, Li Qi

    Abstract: Photoacoustic tomography (PAT) offers optical contrast, whereas magnetic resonance imaging (MRI) excels in imaging soft tissue and organ anatomy. The fusion of PAT with MRI holds promising application prospects due to their complementary advantages. Existing image fusion have made considerable progress in pre-registered images, yet spatial deformations are difficult to avoid in medical imaging sce… ▽ More

    Submitted 19 March, 2025; v1 submitted 4 July, 2024; originally announced July 2024.

  36. arXiv:2406.15222  [pdf

    eess.IV cs.AI cs.CV

    A Deep Learning System for Rapid and Accurate Warning of Acute Aortic Syndrome on Non-contrast CT in China

    Authors: Yujian Hu, Yilang Xiang, Yan-Jie Zhou, Yangyan He, Dehai Lang, Shifeng Yang, Xiaolong Du, Chunlan Den, Youyao Xu, Gaofeng Wang, Zhengyao Ding, Jingyong Huang, Wenjun Zhao, Xuejun Wu, Donglin Li, Qianqian Zhu, Zhenjiang Li, Chenyang Qiu, Ziheng Wu, Yunjun He, Chen Tian, Yihui Qiu, Zuodong Lin, Xiaolong Zhang, Yuan He , et al. (19 additional authors not shown)

    Abstract: The accurate and timely diagnosis of acute aortic syndromes (AAS) in patients presenting with acute chest pain remains a clinical challenge. Aortic CT angiography (CTA) is the imaging protocol of choice in patients with suspected AAS. However, due to economic and workflow constraints in China, the majority of suspected patients initially undergo non-contrast CT as the initial imaging testing, and… ▽ More

    Submitted 23 April, 2025; v1 submitted 13 June, 2024; originally announced June 2024.

  37. arXiv:2406.13340  [pdf, other

    cs.CL cs.SD eess.AS

    SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

    Authors: Junyi Ao, Yuancheng Wang, Xiaohai Tian, Dekun Chen, Jun Zhang, Lu Lu, Yuxuan Wang, Haizhou Li, Zhizheng Wu

    Abstract: Speech encompasses a wealth of information, including but not limited to content, paralinguistic, and environmental information. This comprehensive nature of speech significantly impacts communication and is crucial for human-computer interaction. Chat-Oriented Large Language Models (LLMs), known for their general-purpose assistance capabilities, have evolved to handle multi-modal inputs, includin… ▽ More

    Submitted 16 January, 2025; v1 submitted 19 June, 2024; originally announced June 2024.

    Comments: Accepted to NeurIPS 2024

  38. arXiv:2406.07914  [pdf, other

    cs.SD eess.AS

    Can Large Language Models Understand Spatial Audio?

    Authors: Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Jun Zhang, Lu Lu, Zejun Ma, Yuxuan Wang, Chao Zhang

    Abstract: This paper explores enabling large language models (LLMs) to understand spatial information from multichannel audio, a skill currently lacking in auditory LLMs. By leveraging LLMs' advanced cognitive and inferential abilities, the aim is to enhance understanding of 3D environments via audio. We study 3 spatial audio tasks: sound source localization (SSL), far-field speech recognition (FSR), and lo… ▽ More

    Submitted 14 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech 2024

  39. arXiv:2406.07842  [pdf, other

    eess.AS cs.CL

    Dual-Pipeline with Low-Rank Adaptation for New Language Integration in Multilingual ASR

    Authors: Yerbolat Khassanov, Zhipeng Chen, Tianfeng Chen, Tze Yuang Chong, Wei Li, Jun Zhang, Lu Lu, Yuxuan Wang

    Abstract: This paper addresses challenges in integrating new languages into a pre-trained multilingual automatic speech recognition (mASR) system, particularly in scenarios where training data for existing languages is limited or unavailable. The proposed method employs a dual-pipeline with low-rank adaptation (LoRA). It maintains two data flow pipelines-one for existing languages and another for new langua… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: 5 pages, 2 figures, 4 tables

  40. arXiv:2406.02430  [pdf, other

    eess.AS cs.SD

    Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

    Authors: Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu , et al. (21 additional authors not shown)

    Abstract: We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and sub… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  41. arXiv:2405.20336  [pdf, other

    cs.CV cs.SD eess.AS

    RapVerse: Coherent Vocals and Whole-Body Motions Generations from Text

    Authors: Jiaben Chen, Xin Yan, Yihang Chen, Siyuan Cen, Qinwei Ma, Haoyu Zhen, Kaizhi Qian, Lie Lu, Chuang Gan

    Abstract: In this work, we introduce a challenging task for simultaneously generating 3D holistic body motions and singing vocals directly from textual lyrics inputs, advancing beyond existing works that typically address these two modalities in isolation. To facilitate this, we first collect the RapVerse dataset, a large dataset containing synchronous rapping vocals, lyrics, and high-quality 3D holistic bo… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: Project website: https://vis-www.cs.umass.edu/RapVerse

  42. arXiv:2404.13372  [pdf, other

    eess.IV cs.CV

    HybridFlow: Infusing Continuity into Masked Codebook for Extreme Low-Bitrate Image Compression

    Authors: Lei Lu, Yanyue Xie, Wei Jiang, Wei Wang, Xue Lin, Yanzhi Wang

    Abstract: This paper investigates the challenging problem of learned image compression (LIC) with extreme low bitrates. Previous LIC methods based on transmitting quantized continuous features often yield blurry and noisy reconstruction due to the severe quantization loss. While previous LIC methods based on learned codebooks that discretize visual space usually give poor-fidelity reconstruction due to the… ▽ More

    Submitted 20 April, 2024; originally announced April 2024.

  43. arXiv:2404.04878  [pdf, other

    eess.IV cs.CV

    CycleINR: Cycle Implicit Neural Representation for Arbitrary-Scale Volumetric Super-Resolution of Medical Data

    Authors: Wei Fang, Yuxing Tang, Heng Guo, Mingze Yuan, Tony C. W. Mok, Ke Yan, Jiawen Yao, Xin Chen, Zaiyi Liu, Le Lu, Ling Zhang, Minfeng Xu

    Abstract: In the realm of medical 3D data, such as CT and MRI images, prevalent anisotropic resolution is characterized by high intra-slice but diminished inter-slice resolution. The lowered resolution between adjacent slices poses challenges, hindering optimal viewing experiences and impeding the development of robust downstream analysis algorithms. Various volumetric super-resolution algorithms aim to sur… ▽ More

    Submitted 7 April, 2024; originally announced April 2024.

    Comments: CVPR accepted paper

  44. arXiv:2403.12620  [pdf, other

    eess.SP

    Near-Field Channel Estimation in Dual-Band XL-MIMO with Side Information-Assisted Compressed Sensing

    Authors: Haochen Wu, Liyang Lu, Zhaocheng Wang

    Abstract: Near-field communication comes to be an indispensable part of the future sixth generation (6G) communications at the arrival of the forth-coming deployment of extremely large-scale multiple-input-multiple-output (XL-MIMO) systems. Due to the huge array aperture and high-frequency bands, the electromagnetic radiation field is modeled by the spherical waves instead of the conventional planar waves,… ▽ More

    Submitted 2 September, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

    Comments: This paper has been accepted by IEEE Transactions on Communications (IEEE TCOM). doi: 10.1109/TCOMM.2024.3445282

  45. arXiv:2403.12369  [pdf, ps, other

    eess.SP

    Near-Field Communications with Block-Dominant Compressed Sensing: Fundamentals, Approaches, and Future Directions

    Authors: Liyang Lu, Ke Ma, Yue Wang, Zhaocheng Wang

    Abstract: In the context of extremely large-scale antenna arrays deployed in sixth-generation (6G) mobile networks, near-field (NF) communications have gained considerable attention. Unlike the planar waves formulated in the far-field, electromagnetic radiation propagates as spherical waves in the NF. This alteration affects the NF channel characteristics, particularly resulting in weak sparsity in angular-… ▽ More

    Submitted 26 January, 2025; v1 submitted 18 March, 2024; originally announced March 2024.

    Comments: To appear in IEEE Communications Magazine

  46. arXiv:2403.02010  [pdf, other

    cs.SD eess.AS

    SA-SOT: Speaker-Aware Serialized Output Training for Multi-Talker ASR

    Authors: Zhiyun Fan, Linhao Dong, Jun Zhang, Lu Lu, Zejun Ma

    Abstract: Multi-talker automatic speech recognition plays a crucial role in scenarios involving multi-party interactions, such as meetings and conversations. Due to its inherent complexity, this task has been receiving increasing attention. Notably, the serialized output training (SOT) stands out among various approaches because of its simplistic architecture and exceptional performance. However, the freque… ▽ More

    Submitted 4 March, 2024; originally announced March 2024.

  47. arXiv:2402.07485  [pdf, other

    cs.SD eess.AS

    MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning

    Authors: Hang Zhao, Yifei Xin, Zhesong Yu, Bilei Zhu, Lu Lu, Zejun Ma

    Abstract: In the realm of audio-language pre-training (ALP), the challenge of achieving cross-modal alignment is significant. Moreover, the integration of audio inputs with diverse distributions and task variations poses challenges in developing generic audio-language models. In this study, we present MINT, a novel ALP framework boosting audio-language models through multi-target pre-training and instructio… ▽ More

    Submitted 11 June, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

  48. Block-Sparse Tensor Recovery

    Authors: Liyang Lu, Zhaocheng Wang, Zhen Gao, Sheng Chen, H. Vincent Poor

    Abstract: This work explores the fundamental problem of the recoverability of a sparse tensor being reconstructed from its compressed embodiment. We present a generalized model of block-sparse tensor recovery as a theoretical foundation, where concepts measuring holistic mutual incoherence property (MIP) of the measurement matrix set are defined. A representative algorithm based on the orthogonal matching p… ▽ More

    Submitted 24 August, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

    Comments: Accepted by IEEE Transactions on Information Theory

    Journal ref: IEEE Trans. Inf. Theory, vol. 70, no. 12, pp. 9293-9326, Dec. 2024

  49. arXiv:2402.02327  [pdf, other

    cs.CV cs.SD eess.AS

    Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues

    Authors: Tianxiang Chen, Zhentao Tan, Tao Gong, Qi Chu, Yue Wu, Bin Liu, Le Lu, Jieping Ye, Nenghai Yu

    Abstract: How to effectively interact audio with vision has garnered considerable interest within the multi-modality research field. Recently, a novel audio-visual segmentation (AVS) task has been proposed, aiming to segment the sounding objects in video frames under the guidance of audio cues. However, most existing AVS methods are hindered by a modality imbalance where the visual features tend to dominate… ▽ More

    Submitted 6 February, 2024; v1 submitted 3 February, 2024; originally announced February 2024.

  50. arXiv:2401.15704  [pdf, other

    cs.CR cs.SD eess.AS

    Phoneme-Based Proactive Anti-Eavesdropping with Controlled Recording Privilege

    Authors: Peng Huang, Yao Wei, Peng Cheng, Zhongjie Ba, Li Lu, Feng Lin, Yang Wang, Kui Ren

    Abstract: The widespread smart devices raise people's concerns of being eavesdropped on. To enhance voice privacy, recent studies exploit the nonlinearity in microphone to jam audio recorders with inaudible ultrasound. However, existing solutions solely rely on energetic masking. Their simple-form noise leads to several problems, such as high energy requirements and being easily removed by speech enhancemen… ▽ More

    Submitted 28 January, 2024; originally announced January 2024.

    Comments: 14 pages, 28 figures; submitted to IEEE TDSC