Skip to main content

Showing 1–50 of 313 results for author: Liu, T

Searching in archive eess. Search in all archives.
.
  1. arXiv:2506.23353  [pdf, ps, other

    cs.CV eess.IV

    Layer Decomposition and Morphological Reconstruction for Task-Oriented Infrared Image Enhancement

    Authors: Siyuan Chai, Xiaodong Guo, Tong Liu

    Abstract: Infrared image helps improve the perception capabilities of autonomous driving in complex weather conditions such as fog, rain, and low light. However, infrared image often suffers from low contrast, especially in non-heat-emitting targets like bicycles, which significantly affects the performance of downstream high-level vision tasks. Furthermore, achieving contrast enhancement without amplifying… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

  2. arXiv:2506.17540  [pdf, ps, other

    eess.IV cs.CV cs.LG

    MTSIC: Multi-stage Transformer-based GAN for Spectral Infrared Image Colorization

    Authors: Tingting Liu, Yuan Liu, Jinhui Tang, Liyin Yuan, Chengyu Liu, Chunlai Li, Xiubao Sui, Qian Chen

    Abstract: Thermal infrared (TIR) images, acquired through thermal radiation imaging, are unaffected by variations in lighting conditions and atmospheric haze. However, TIR images inherently lack color and texture information, limiting downstream tasks and potentially causing visual fatigue. Existing colorization methods primarily rely on single-band images with limited spectral information and insufficient… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

  3. arXiv:2506.09344  [pdf, ps, other

    cs.AI cs.CL cs.CV cs.LG cs.SD eess.AS

    Ming-Omni: A Unified Multimodal Model for Perception and Generation

    Authors: Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, Furong Xu, GuangMing Yao, Jun Zhou, Jingdong Chen, Jianxin Sun, Jiajia Liu, Jianjiang Zhu, Jun Peng, Kaixiang Ji, Kaiyou Song, Kaimeng Ren, Libin Wang, Lixiang Ru, Lele Xie, Longhua Tan , et al. (33 additional authors not shown)

    Abstract: We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

    Comments: 18 pages,8 figures

  4. arXiv:2505.13338  [pdf, ps, other

    cs.CL cs.AI eess.AS

    Contextual Paralinguistic Data Creation for Multi-Modal Speech-LLM: Data Condensation and Spoken QA Generation

    Authors: Qiongqiong Wang, Hardik B. Sailor, Tianchi Liu, Ai Ti Aw

    Abstract: Current speech-LLMs exhibit limited capability in contextual reasoning alongside paralinguistic understanding, primarily due to the lack of Question-Answer (QA) datasets that cover both aspects. We propose a novel framework for dataset generation from in-the-wild speech data, that integrates contextual reasoning with paralinguistic information. It consists of a pseudo paralinguistic label-based da… ▽ More

    Submitted 3 June, 2025; v1 submitted 19 May, 2025; originally announced May 2025.

    Comments: Accepted at Interspeech 2025. [v2]: The dataset has been released, and the link is now updated

  5. arXiv:2505.08547  [pdf

    eess.IV

    SAR-GTR: Attributed Scattering Information Guided SAR Graph Transformer Recognition Algorithm

    Authors: Xuying Xiong, Xinyu Zhang, Weidong Jiang, Li Liu, Yongxiang Liu, Tianpeng Liu

    Abstract: Utilizing electromagnetic scattering information for SAR data interpretation is currently a prominent research focus in the SAR interpretation domain. Graph Neural Networks (GNNs) can effectively integrate domain-specific physical knowledge and human prior knowledge, thereby alleviating challenges such as limited sample availability and poor generalization in SAR interpretation. In this study, we… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  6. arXiv:2505.07043  [pdf, ps, other

    eess.SY

    Design and Experimental Test of Datatic Approximate Optimal Filter in Nonlinear Dynamic Systems

    Authors: Weixian He, Zeyu He, Wenhan Cao, Haoyu Gao, Tong Liu, Bin Shuai, Chang Liu, Shengbo Eben Li

    Abstract: Filtering is crucial in engineering fields, providing vital state estimation for control systems. However, the nonlinear nature of complex systems and the presence of non-Gaussian noises pose significant challenges to the performance of conventional filtering methods in terms of estimation accuracy and computational efficiency. In this work, we present a data-driven closed-loop filter, termed data… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

  7. arXiv:2504.19555  [pdf, ps, other

    eess.SP

    Physical-Layer Security in Mixed Near-Field and Far-Field Communication Systems

    Authors: Tianyu Liu, Changsheng You, Cong Zhou, Yunpu Zhang, Shiqi Gong, Heng Liu, Guangchi Zhang

    Abstract: Extremely large-scale arrays (XL-arrays) have emerged as a promising technology to improve the spectrum efficiency and spatial resolution of future wireless systems. Different from existing works that mostly considered physical layer security (PLS) in either the far-field or near-field, we consider in this paper a new and practical scenario, where legitimate users (Bobs) are located in the far-fie… ▽ More

    Submitted 4 May, 2025; v1 submitted 28 April, 2025; originally announced April 2025.

  8. arXiv:2504.19550  [pdf, ps, other

    eess.SP

    Deployment Optimization for XL-IRS Assisted Multi-User Communications

    Authors: Chao Zhou, Changsheng You, Tianyu Liu, Bin Lyu

    Abstract: In this paper, we study the deployment optimization for an extremely large-scale intelligent reflecting surface (XL-IRS) assisted multi-user communication system, within which the channels between the XL-IRS and the BS (or user) are modeled by the near-field spherical wavefronts. To draw some valuable insights, we first consider the single-user case, where an alternating optimization (AO) based al… ▽ More

    Submitted 28 April, 2025; originally announced April 2025.

  9. arXiv:2504.18425  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.MM cs.SD

    Kimi-Audio Technical Report

    Authors: KimiTeam, Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, Zhengtao Wang, Chu Wei, Yifei Xin, Xinran Xu, Jianwei Yu, Yutao Zhang, Xinyu Zhou, Y. Charles, Jun Chen, Yanru Chen, Yulun Du, Weiran He, Zhenxing Hu, Guokun Lai , et al. (15 additional authors not shown)

    Abstract: We present Kimi-Audio, an open-source audio foundation model that excels in audio understanding, generation, and conversation. We detail the practices in building Kimi-Audio, including model architecture, data curation, training recipe, inference deployment, and evaluation. Specifically, we leverage a 12.5Hz audio tokenizer, design a novel LLM-based architecture with continuous features as input a… ▽ More

    Submitted 25 April, 2025; originally announced April 2025.

  10. arXiv:2504.18016  [pdf, ps, other

    eess.SP

    Optimal Power Allocation for OFDM-based Ranging Using Random Communication Signals

    Authors: Ying Zhang, Fan Liu, Tao Liu, Shi Jin

    Abstract: High-precision ranging plays a crucial role in future 6G Integrated Sensing and Communication (ISAC) systems. To improve the ranging performance while maximizing the resource utilization efficiency, future 6G ISAC networks have to reuse data payload signals for both communication and sensing, whose inherent randomness may deteriorate the ranging performance. To address this issue, this paper inves… ▽ More

    Submitted 24 April, 2025; originally announced April 2025.

    Comments: 12 pages, 9 figures, submitted to IEEE for possible publication

  11. arXiv:2504.15081  [pdf, other

    eess.SY

    PID-GM: PID Control with Gain Mapping

    Authors: Bo Zhu, Wei Yu, Hugh H. T. Liu

    Abstract: Proportional-Integral-Differential (PID) control is widely used in industrial control systems. However, up to now there are at least two open problems related with PID control. One is to have a comprehensive understanding of its robustness with respect to model uncertainties and disturbances. The other is to build intuitive, explicit and mathematically provable guidelines for PID gain tuning. In t… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

    Comments: 8 pages, 7 figures

  12. arXiv:2504.05657  [pdf, other

    eess.AS cs.AI cs.SD

    Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-spoofing

    Authors: Tianchi Liu, Duc-Tuan Truong, Rohan Kumar Das, Kong Aik Lee, Haizhou Li

    Abstract: Speech foundation models have significantly advanced various speech-related tasks by providing exceptional representation capabilities. However, their high-dimensional output features often create a mismatch with downstream task models, which typically require lower-dimensional inputs. A common solution is to apply a dimensionality reduction (DR) layer, but this approach increases parameter overhe… ▽ More

    Submitted 8 April, 2025; originally announced April 2025.

    Comments: This manuscript has been submitted for peer review

  13. arXiv:2503.14655  [pdf, other

    q-bio.NC cs.AI cs.CV eess.IV

    Core-Periphery Principle Guided State Space Model for Functional Connectome Classification

    Authors: Minheng Chen, Xiaowei Yu, Jing Zhang, Tong Chen, Chao Cao, Yan Zhuang, Yanjun Lyu, Lu Zhang, Tianming Liu, Dajiang Zhu

    Abstract: Understanding the organization of human brain networks has become a central focus in neuroscience, particularly in the study of functional connectivity, which plays a crucial role in diagnosing neurological disorders. Advances in functional magnetic resonance imaging and machine learning techniques have significantly improved brain network analysis. However, traditional machine learning approaches… ▽ More

    Submitted 18 March, 2025; originally announced March 2025.

  14. arXiv:2503.01938  [pdf, other

    eess.IV cs.CV

    A Lightweight Deep Exclusion Unfolding Network for Single Image Reflection Removal

    Authors: Jun-Jie Huang, Tianrui Liu, Zihan Chen, Xinwang Liu, Meng Wang, Pier Luigi Dragotti

    Abstract: Single Image Reflection Removal (SIRR) is a canonical blind source separation problem and refers to the issue of separating a reflection-contaminated image into a transmission and a reflection image. The core challenge lies in minimizing the commonalities among different sources. Existing deep learning approaches either neglect the significance of feature interactions or rely on heuristically desi… ▽ More

    Submitted 3 March, 2025; originally announced March 2025.

  15. arXiv:2502.18889  [pdf, other

    cs.SD cs.AI cs.CL cs.HC cs.LG eess.AS

    Clip-TTS: Contrastive Text-content and Mel-spectrogram, A High-Quality Text-to-Speech Method based on Contextual Semantic Understanding

    Authors: Tianyun Liu

    Abstract: Traditional text-to-speech (TTS) methods primarily focus on establishing a mapping between phonemes and mel-spectrograms. However, during the phoneme encoding stage, there is often a lack of real mel-spectrogram auxiliary information, which results in the encoding process lacking true semantic understanding. At the same time, traditional TTS systems often struggle to balance the inference speed of… ▽ More

    Submitted 8 March, 2025; v1 submitted 26 February, 2025; originally announced February 2025.

  16. arXiv:2502.16584  [pdf, other

    cs.SD cs.AI cs.CL cs.MM eess.AS

    Audio-FLAN: A Preliminary Release

    Authors: Liumeng Xue, Ziya Zhou, Jiahao Pan, Zixuan Li, Shuai Fan, Yinghao Ma, Sitong Cheng, Dongchao Yang, Haohan Guo, Yujia Xiao, Xinsheng Wang, Zixuan Shen, Chuanbo Zhu, Xinshen Zhang, Tianchi Liu, Ruibin Yuan, Zeyue Tian, Haohe Liu, Emmanouil Benetos, Ge Zhang, Yike Guo, Wei Xue

    Abstract: Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learnin… ▽ More

    Submitted 23 February, 2025; originally announced February 2025.

  17. arXiv:2502.13972  [pdf, other

    eess.SP cs.AI cs.LG

    IncepFormerNet: A multi-scale multi-head attention network for SSVEP classification

    Authors: Yan Huang, Yongru Chen, Lei Cao, Yongnian Cao, Xuechun Yang, Yilin Dong, Tianyu Liu

    Abstract: In recent years, deep learning (DL) models have shown outstanding performance in EEG classification tasks, particularly in Steady-State Visually Evoked Potential(SSVEP)-based Brain-Computer-Interfaces(BCI)systems. DL methods have been successfully applied to SSVEP-BCI. This study proposes a new model called IncepFormerNet, which is a hybrid of the Inception and Transformer architectures. IncepForm… ▽ More

    Submitted 4 February, 2025; originally announced February 2025.

  18. arXiv:2502.13366  [pdf, other

    cs.RO eess.SY

    Low-Complexity Cooperative Payload Transportation for Nonholonomic Mobile Robots Under Scalable Constraints

    Authors: Renhe Guan, Yuanzhe Wang, Tao Liu, Yan Wang

    Abstract: Cooperative transportation, a key aspect of logistics cyber-physical systems (CPS), is typically approached using dis tributed control and optimization-based methods. The distributed control methods consume less time, but poorly handle and extend to multiple constraints. Instead, optimization-based methods handle constraints effectively, but they are usually centralized, time-consuming a… ▽ More

    Submitted 18 February, 2025; originally announced February 2025.

  19. arXiv:2501.16409  [pdf

    eess.IV cs.AI q-bio.NC

    Classification of Mild Cognitive Impairment Based on Dynamic Functional Connectivity Using Spatio-Temporal Transformer

    Authors: Jing Zhang, Yanjun Lyu, Xiaowei Yu, Lu Zhang, Chao Cao, Tong Chen, Minheng Chen, Yan Zhuang, Tianming Liu, Dajiang Zhu

    Abstract: Dynamic functional connectivity (dFC) using resting-state functional magnetic resonance imaging (rs-fMRI) is an advanced technique for capturing the dynamic changes of neural activities, and can be very useful in the studies of brain diseases such as Alzheimer's disease (AD). Yet, existing studies have not fully leveraged the sequential information embedded within dFC that can potentially provide… ▽ More

    Submitted 27 January, 2025; originally announced January 2025.

  20. Brain-Adapter: Enhancing Neurological Disorder Analysis with Adapter-Tuning Multimodal Large Language Models

    Authors: Jing Zhang, Xiaowei Yu, Yanjun Lyu, Lu Zhang, Tong Chen, Chao Cao, Yan Zhuang, Minheng Chen, Tianming Liu, Dajiang Zhu

    Abstract: Understanding brain disorders is crucial for accurate clinical diagnosis and treatment. Recent advances in Multimodal Large Language Models (MLLMs) offer a promising approach to interpreting medical images with the support of text descriptions. However, previous research has primarily focused on 2D medical images, leaving richer spatial information of 3D images under-explored, and single-modality-… ▽ More

    Submitted 27 January, 2025; originally announced January 2025.

  21. arXiv:2501.14279  [pdf, other

    eess.IV cs.CV

    Deep Learning-Powered Classification of Thoracic Diseases in Chest X-Rays

    Authors: Yiming Lei, Michael Nguyen, Tzu Chia Liu, Hyounkyun Oh

    Abstract: Chest X-rays play a pivotal role in diagnosing respiratory diseases such as pneumonia, tuberculosis, and COVID-19, which are prevalent and present unique diagnostic challenges due to overlapping visual features and variability in image quality. Severe class imbalance and the complexity of medical images hinder automated analysis. This study leverages deep learning techniques, including transfer le… ▽ More

    Submitted 24 January, 2025; originally announced January 2025.

  22. arXiv:2501.09935  [pdf, ps, other

    eess.IV cs.CV physics.med-ph

    Physics-informed DeepCT: Sinogram Wavelet Decomposition Meets Masked Diffusion

    Authors: Zekun Zhou, Tan Liu, Bing Yu, Yanru Gong, Liu Shi, Qiegen Liu

    Abstract: Diffusion model shows remarkable potential on sparse-view computed tomography (SVCT) reconstruction. However, when a network is trained on a limited sample space, its generalization capability may be constrained, which degrades performance on unfamiliar data. For image generation tasks, this can lead to issues such as blurry details and inconsistencies between regions. To alleviate this problem, w… ▽ More

    Submitted 15 June, 2025; v1 submitted 16 January, 2025; originally announced January 2025.

  23. arXiv:2501.06488  [pdf, other

    cs.CV cs.AI cs.HC cs.MM eess.IV

    NVS-SQA: Exploring Self-Supervised Quality Representation Learning for Neurally Synthesized Scenes without References

    Authors: Qiang Qu, Yiran Shen, Xiaoming Chen, Yuk Ying Chung, Weidong Cai, Tongliang Liu

    Abstract: Neural View Synthesis (NVS), such as NeRF and 3D Gaussian Splatting, effectively creates photorealistic scenes from sparse viewpoints, typically evaluated by quality assessment methods like PSNR, SSIM, and LPIPS. However, these full-reference methods, which compare synthesized views to reference views, may not fully capture the perceptual quality of neurally synthesized scenes (NSS), particularly… ▽ More

    Submitted 11 January, 2025; originally announced January 2025.

  24. arXiv:2501.05729  [pdf, other

    cs.SD cs.AI eess.AS

    ExPO: Explainable Phonetic Trait-Oriented Network for Speaker Verification

    Authors: Yi Ma, Shuai Wang, Tianchi Liu, Haizhou Li

    Abstract: In speaker verification, we use computational method to verify if an utterance matches the identity of an enrolled speaker. This task is similar to the manual task of forensic voice comparison, where linguistic analysis is combined with auditory measurements to compare and evaluate voice samples. Despite much success, we have yet to develop a speaker verification system that offers explainable res… ▽ More

    Submitted 14 January, 2025; v1 submitted 10 January, 2025; originally announced January 2025.

    Comments: Accepted by IEEE Signal Processing Letters

  25. JammingSnake: A follow-the-leader continuum robot with variable stiffness based on fiber jamming

    Authors: Chen Qian, Tangyou Liu, Liao Wu

    Abstract: Follow-the-leader (FTL) motion is essential for continuum robots operating in fragile and confined environments. It allows the robot to exert minimal force on its surroundings, reducing the risk of damage. This paper presents a novel design of a snake-like robot capable of achieving FTL motion by integrating fiber jamming modules (FJMs). The proposed robot can dynamically adjust its stiffness duri… ▽ More

    Submitted 19 June, 2025; v1 submitted 4 January, 2025; originally announced January 2025.

    Comments: 8 pages, 4 figures, published in T-MECH

  26. arXiv:2412.18619  [pdf, other

    cs.CL cs.AI cs.CV cs.LG cs.MM eess.AS

    Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

    Authors: Liang Chen, Zekun Wang, Shuhuai Ren, Lei Li, Haozhe Zhao, Yunshui Li, Zefan Cai, Hongcheng Guo, Lei Zhang, Yizhe Xiong, Yichi Zhang, Ruoyu Wu, Qingxiu Dong, Ge Zhang, Jian Yang, Lingwei Meng, Shujie Hu, Yulong Chen, Junyang Lin, Shuai Bai, Andreas Vlachos, Xu Tan, Minjia Zhang, Wen Xiao, Aaron Yee , et al. (2 additional authors not shown)

    Abstract: Building on the foundations of language modeling in natural language processing, Next Token Prediction (NTP) has evolved into a versatile training objective for machine learning tasks across various modalities, achieving considerable success. As Large Language Models (LLMs) have advanced to unify understanding and generation tasks within the textual modality, recent research has shown that tasks f… ▽ More

    Submitted 29 December, 2024; v1 submitted 16 December, 2024; originally announced December 2024.

    Comments: 69 papes, 18 figures, repo at https://github.com/LMM101/Awesome-Multimodal-Next-Token-Prediction

  27. arXiv:2412.16474  [pdf, other

    eess.AS cs.CL

    Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling

    Authors: Shao-Syuan Huang, Kuan-Po Huang, Andy T. Liu, Hung-yi Lee

    Abstract: Multilingual Automatic Speech Recognition (ASR) aims to recognize and transcribe speech from multiple languages within a single system. Whisper, one of the most advanced ASR models, excels in this domain by handling 99 languages effectively, leveraging a vast amount of data and incorporating language tags as prefixes to guide the recognition process. However, despite its success, Whisper struggles… ▽ More

    Submitted 20 December, 2024; originally announced December 2024.

    Comments: Accepted by ICASSP 2025

  28. arXiv:2412.11590  [pdf, other

    cs.RO eess.SY

    A Real-Time System for Scheduling and Managing UAV Delivery in Urban

    Authors: Han Liu, Tian Liu, Kai Huang

    Abstract: As urban logistics demand continues to grow, UAV delivery has become a key solution to improve delivery efficiency, reduce traffic congestion, and lower logistics costs. However, to fully leverage the potential of UAV delivery networks, efficient swarm scheduling and management are crucial. In this paper, we propose a real-time scheduling and management system based on the ``Airport-Unloading Stat… ▽ More

    Submitted 16 December, 2024; originally announced December 2024.

  29. arXiv:2412.11538  [pdf, other

    cs.CL cs.AI eess.AS

    MERaLiON-SpeechEncoder: Towards a Speech Foundation Model for Singapore and Beyond

    Authors: Muhammad Huzaifah, Geyu Lin, Tianchi Liu, Hardik B. Sailor, Kye Min Tan, Tarun K. Vangani, Qiongqiong Wang, Jeremy H. M. Wong, Nancy F. Chen, Ai Ti Aw

    Abstract: This technical report describes the MERaLiON-SpeechEncoder, a foundation model designed to support a wide range of downstream speech applications. Developed as part of Singapore's National Multimodal Large Language Model Programme, the MERaLiON-SpeechEncoder is tailored to address the speech processing needs in Singapore and the surrounding Southeast Asian region. The model currently supports main… ▽ More

    Submitted 20 December, 2024; v1 submitted 16 December, 2024; originally announced December 2024.

  30. arXiv:2412.11407  [pdf, other

    cs.CV eess.IV

    An Enhanced Classification Method Based on Adaptive Multi-Scale Fusion for Long-tailed Multispectral Point Clouds

    Authors: TianZhu Liu, BangYan Hu, YanFeng Gu, Xian Li, Aleksandra Pižurica

    Abstract: Multispectral point cloud (MPC) captures 3D spatial-spectral information from the observed scene, which can be used for scene understanding and has a wide range of applications. However, most of the existing classification methods were extensively tested on indoor datasets, and when applied to outdoor datasets they still face problems including sparse labeled targets, differences in land-covers sc… ▽ More

    Submitted 15 December, 2024; originally announced December 2024.

    Comments: 16 pages, 9 figures, 5 tables

  31. arXiv:2411.11762  [pdf

    cs.RO eess.SY

    High-Speed Cornering Control and Real-Vehicle Deployment for Autonomous Electric Vehicles

    Authors: Shiyue Zhao, Junzhi Zhang, Neda Masoud, Yuhong Jiang, Heye Huang, Tao Liu

    Abstract: Executing drift maneuvers during high-speed cornering presents significant challenges for autonomous vehicles, yet offers the potential to minimize turning time and enhance driving dynamics. While reinforcement learning (RL) has shown promising results in simulated environments, discrepancies between simulations and real-world conditions have limited its practical deployment. This study introduces… ▽ More

    Submitted 21 November, 2024; v1 submitted 18 November, 2024; originally announced November 2024.

    Comments: In the process of being submitted to the Journal of IEEE Transactions on Industrial Electronics

  32. arXiv:2411.07500  [pdf, other

    eess.IV

    MaDiNet: Mamba Diffusion Network for SAR Target Detection

    Authors: Jie Zhou, Chao Xiao, Bowen Peng, Tianpeng Liu, Zhen Liu, Yongxiang Liu, Li Liu

    Abstract: The fundamental challenge in SAR target detection lies in developing discriminative, efficient, and robust representations of target characteristics within intricate non-cooperative environments. However, accurate target detection is impeded by factors including the sparse distribution and discrete features of the targets, as well as complex background interference. In this study, we propose a \te… ▽ More

    Submitted 11 November, 2024; originally announced November 2024.

  33. arXiv:2411.05361  [pdf, ps, other

    cs.CL eess.AS

    Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

    Authors: Chien-yu Huang, Wei-Chih Chen, Shu-wen Yang, Andy T. Liu, Chen-An Li, Yu-Xiang Lin, Wei-Cheng Tseng, Anuj Diwan, Yi-Jen Shih, Jiatong Shi, William Chen, Chih-Kai Yang, Wenze Ren, Xuanjun Chen, Chi-Yuan Hsiao, Puyuan Peng, Shih-Heng Wang, Chun-Yi Kuan, Ke-Han Lu, Kai-Wei Chang, Fabian Ritter-Gutierrez, Kuan-Po Huang, Siddhant Arora, You-Kuan Lin, Ming To Chuang , et al. (55 additional authors not shown)

    Abstract: Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluati… ▽ More

    Submitted 9 June, 2025; v1 submitted 8 November, 2024; originally announced November 2024.

    Comments: ICLR 2025

  34. arXiv:2410.17812  [pdf, other

    eess.IV cs.AI cs.CV

    PGDiffSeg: Prior-Guided Denoising Diffusion Model with Parameter-Shared Attention for Breast Cancer Segmentation

    Authors: Feiyan Feng, Tianyu Liu, Hong Wang, Jun Zhao, Wei Li, Yanshen Sun

    Abstract: Early detection through imaging and accurate diagnosis is crucial in mitigating the high mortality rate associated with breast cancer. However, locating tumors from low-resolution and high-noise medical images is extremely challenging. Therefore, this paper proposes a novel PGDiffSeg (Prior-Guided Diffusion Denoising Model with Parameter-Shared Attention) that applies diffusion denoising methods t… ▽ More

    Submitted 23 October, 2024; originally announced October 2024.

  35. arXiv:2410.09674  [pdf, other

    eess.IV cs.CV cs.LG cs.NE

    EG-SpikeFormer: Eye-Gaze Guided Transformer on Spiking Neural Networks for Medical Image Analysis

    Authors: Yi Pan, Hanqi Jiang, Junhao Chen, Yiwei Li, Huaqin Zhao, Yifan Zhou, Peng Shu, Zihao Wu, Zhengliang Liu, Dajiang Zhu, Xiang Li, Yohannes Abate, Tianming Liu

    Abstract: Neuromorphic computing has emerged as a promising energy-efficient alternative to traditional artificial intelligence, predominantly utilizing spiking neural networks (SNNs) implemented on neuromorphic hardware. Significant advancements have been made in SNN-based convolutional neural networks (CNNs) and Transformer architectures. However, neuromorphic computing for the medical imaging domain rema… ▽ More

    Submitted 29 October, 2024; v1 submitted 12 October, 2024; originally announced October 2024.

  36. arXiv:2410.07594  [pdf

    eess.SY

    Design and Characterization of High Efficiency Single-stage Electromagnetic Coil Guns

    Authors: Sophia Chen, Annie Peng, Ava Chen, Takyiu Liu

    Abstract: This study presents several novel approaches to improve the efficiency of a single-stage coil gun. Conventional designs typically feature a uniformly wound solenoid and a ferrite projectile. For our research, we constructed a microcontroller-based prototype to test several new enhancements, including the use of a bipolar current pulse, a stepped multilayer coil with non-uniform winding densities,… ▽ More

    Submitted 10 October, 2024; originally announced October 2024.

    Comments: 10 pages, 23 figures

  37. arXiv:2410.04081  [pdf, other

    cs.CV cs.AI eess.IV

    Epsilon-VAE: Denoising as Visual Decoding

    Authors: Long Zhao, Sanghyun Woo, Ziyu Wan, Yandong Li, Han Zhang, Boqing Gong, Hartwig Adam, Xuhui Jia, Ting Liu

    Abstract: In generative modeling, tokenization simplifies complex data into compact, structured representations, creating a more efficient, learnable space. For high-dimensional visual data, it reduces redundancy and emphasizes key features for high-quality generation. Current visual tokenization methods rely on a traditional autoencoder framework, where the encoder compresses data into latent representatio… ▽ More

    Submitted 28 May, 2025; v1 submitted 5 October, 2024; originally announced October 2024.

    Comments: Accepted to ICML 2025. v2: added comparisons to SD-VAE and more visual results; v3: minor change to title; v4: camera-ready version

  38. arXiv:2410.03143  [pdf, other

    eess.IV cs.CV cs.LG

    ECHOPulse: ECG controlled echocardio-grams video generation

    Authors: Yiwei Li, Sekeun Kim, Zihao Wu, Hanqi Jiang, Yi Pan, Pengfei Jin, Sifan Song, Yucheng Shi, Tianming Liu, Quanzheng Li, Xiang Li

    Abstract: Echocardiography (ECHO) is essential for cardiac assessments, but its video quality and interpretation heavily relies on manual expertise, leading to inconsistent results from clinical and portable devices. ECHO video generation offers a solution by improving automated monitoring through synthetic data and generating high-quality videos from routine health data. However, existing models often face… ▽ More

    Submitted 11 October, 2024; v1 submitted 4 October, 2024; originally announced October 2024.

  39. Mind the Gap: Promoting Missing Modality Brain Tumor Segmentation with Alignment

    Authors: Tianyi Liu, Zhaorui Tan, Haochuan Jiang, Xi Yang, Kaizhu Huang

    Abstract: Brain tumor segmentation is often based on multiple magnetic resonance imaging (MRI). However, in clinical practice, certain modalities of MRI may be missing, which presents an even more difficult scenario. To cope with this challenge, knowledge distillation has emerged as one promising strategy. However, recent efforts typically overlook the modality gaps and thus fail to learn invariant feature… ▽ More

    Submitted 28 September, 2024; originally announced September 2024.

  40. arXiv:2409.16295  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Efficient Training of Self-Supervised Speech Foundation Models on a Compute Budget

    Authors: Andy T. Liu, Yi-Cheng Lin, Haibin Wu, Stefan Winkler, Hung-yi Lee

    Abstract: Despite their impressive success, training foundation models remains computationally costly. This paper investigates how to efficiently train speech foundation models with self-supervised learning (SSL) under a limited compute budget. We examine critical factors in SSL that impact the budget, including model architecture, model size, and data size. Our goal is to make analytical steps toward under… ▽ More

    Submitted 4 February, 2025; v1 submitted 9 September, 2024; originally announced September 2024.

    Comments: Accepted to IEEE SLT 2024

    Journal ref: 2024 IEEE Spoken Language Technology Workshop (SLT)

  41. arXiv:2409.13104  [pdf, other

    cs.CV cs.AI cs.LG eess.SY

    ERIC: Estimating Rainfall with Commodity Doorbell Camera for Precision Residential Irrigation

    Authors: Tian Liu, Liuyi Jin, Radu Stoleru, Amran Haroon, Charles Swanson, Kexin Feng

    Abstract: Current state-of-the-art residential irrigation systems, such as WaterMyYard, rely on rainfall data from nearby weather stations to adjust irrigation amounts. However, the accuracy of rainfall data is compromised by the limited spatial resolution of rain gauges and the significant variability of hyperlocal rainfall, leading to substantial water waste. To improve irrigation efficiency, we developed… ▽ More

    Submitted 3 October, 2024; v1 submitted 19 September, 2024; originally announced September 2024.

    Comments: BuildSys 2024

  42. arXiv:2409.09601  [pdf, other

    cs.SD cs.AI cs.MM eess.AS

    A Survey of Foundation Models for Music Understanding

    Authors: Wenjun Li, Ying Cai, Ziyang Wu, Wenyi Zhang, Yifan Chen, Rundong Qi, Mengqi Dong, Peigen Chen, Xiao Dong, Fenghao Shi, Lei Guo, Junwei Han, Bao Ge, Tianming Liu, Lin Gan, Tuo Zhang

    Abstract: Music is essential in daily life, fulfilling emotional and entertainment needs, and connecting us personally, socially, and culturally. A better understanding of music can enhance our emotions, cognitive skills, and cultural connections. The rapid advancement of artificial intelligence (AI) has introduced new ways to analyze music, aiming to replicate human understanding of music and provide relat… ▽ More

    Submitted 14 September, 2024; originally announced September 2024.

    Comments: 20 pages, 2 figures

  43. arXiv:2409.08346  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    Towards Quantifying and Reducing Language Mismatch Effects in Cross-Lingual Speech Anti-Spoofing

    Authors: Tianchi Liu, Ivan Kukanov, Zihan Pan, Qiongqiong Wang, Hardik B. Sailor, Kong Aik Lee

    Abstract: The effects of language mismatch impact speech anti-spoofing systems, while investigations and quantification of these effects remain limited. Existing anti-spoofing datasets are mainly in English, and the high cost of acquiring multilingual datasets hinders training language-independent models. We initiate this work by evaluating top-performing speech anti-spoofing systems that are trained on Eng… ▽ More

    Submitted 12 September, 2024; originally announced September 2024.

    Comments: Accepted to the IEEE Spoken Language Technology Workshop (SLT) 2024

  44. arXiv:2409.02302  [pdf, other

    eess.AS cs.AI cs.SD

    Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024

    Authors: Anmol Guragain, Tianchi Liu, Zihan Pan, Hardik B. Sailor, Qiongqiong Wang

    Abstract: This work details our approach to achieving a leading system with a 1.79% pooled equal error rate (EER) on the evaluation set of the Controlled Singing Voice Deepfake Detection (CtrSVDD). The rapid advancement of generative AI models presents significant challenges for detecting AI-generated deepfake singing voices, attracting increased research attention. The Singing Voice Deepfake Detection (SVD… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

    Comments: Accepted to the IEEE Spoken Language Technology Workshop (SLT) 2024

  45. arXiv:2408.15887  [pdf

    eess.IV cs.CV

    SpineMamba: Enhancing 3D Spinal Segmentation in Clinical Imaging through Residual Visual Mamba Layers and Shape Priors

    Authors: Zhiqing Zhang, Tianyong Liu, Guojia Fan, Bin Li, Qianjin Feng, Shoujun Zhou

    Abstract: Accurate segmentation of 3D clinical medical images is critical in the diagnosis and treatment of spinal diseases. However, the inherent complexity of spinal anatomy and uncertainty inherent in current imaging technologies, poses significant challenges for semantic segmentation of spinal images. Although convolutional neural networks (CNNs) and Transformer-based models have made some progress in s… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

    Comments: 17 pages, 11 figures

  46. arXiv:2407.11481  [pdf, other

    cs.LG cs.AI eess.SP

    Multi-Channel Masked Autoencoder and Comprehensive Evaluations for Reconstructing 12-Lead ECG from Arbitrary Single-Lead ECG

    Authors: Jiarong Chen, Wanqing Wu, Tong Liu, Shenda Hong

    Abstract: Electrocardiogram (ECG) has emerged as a widely accepted diagnostic instrument for cardiovascular diseases (CVD). The standard clinical 12-lead ECG configuration causes considerable inconvenience and discomfort, while wearable devices offers a more practical alternative. To reduce information gap between 12-lead ECG and single-lead ECG, this study proposes a multi-channel masked autoencoder (MCMA)… ▽ More

    Submitted 3 October, 2024; v1 submitted 16 July, 2024; originally announced July 2024.

    Comments: It is a revised version.The open-source code is publicly available at https://github.com/CHENJIAR3/MCMA

  47. arXiv:2407.05758  [pdf, other

    eess.IV cs.AI cs.CV

    Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports

    Authors: Yutong Zhang, Yi Pan, Tianyang Zhong, Peixin Dong, Kangni Xie, Yuxiao Liu, Hanqi Jiang, Zhengliang Liu, Shijie Zhao, Tuo Zhang, Xi Jiang, Dinggang Shen, Tianming Liu, Xin Zhang

    Abstract: Medical images and radiology reports are crucial for diagnosing medical conditions, highlighting the importance of quantitative analysis for clinical decision-making. However, the diversity and cross-source heterogeneity of these data challenge the generalizability of current data-mining methods. Multimodal large language models (MLLMs) have recently transformed many domains, significantly affecti… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  48. arXiv:2406.16041  [pdf, ps, other

    eess.SP

    Gridless Parameter Estimation in Partly Calibrated Rectangular Arrays

    Authors: Tianyi Liu, Sai Pavan Deram, Khaled Ardah, Martin Haardt, Marc E. Pfetsch, Marius Pesavento

    Abstract: Spatial frequency estimation from a mixture of noisy sinusoids finds applications in various fields. While subspace-based methods offer cost-effective super-resolution parameter estimation, they demand precise array calibration, posing challenges for large antennas. In contrast, sparsity-based approaches outperform subspace methods, especially in scenarios with limited snapshots or correlated sour… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

    Comments: 16 pages, 5 figures. This work has been submitted to the IEEE Transactions on Signal Processing for possible publication

  49. arXiv:2406.14799  [pdf, other

    cs.RO eess.SY

    Capture Point Control in Thruster-Assisted Bipedal Locomotion

    Authors: Shreyansh Pitroda, Aditya Bondada, Kaushik Venkatesh Krishnamurthy, Adarsh Salagame, Chenghao Wang, Taoran Liu, Bibek Gupta, Eric Sihite, Reza Nemovi, Alireza Ramezani, Morteza Gharib

    Abstract: Despite major advancements in control design that are robust to unplanned disturbances, bipedal robots are still susceptible to falling over and struggle to negotiate rough terrains. By utilizing thrusters in our bipedal robot, we can perform additional posture manipulation and expand the modes of locomotion to enhance the robot's stability and ability to negotiate rough and difficult-to-navigate… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: Submitted and to be presented at IEEE AIM 2024. arXiv admin note: substantial text overlap with arXiv:2103.15952

  50. A tensor model for the calibration of air-coupled ultrasonic sensor arrays in 3D imaging

    Authors: Raphael Müller, Gianni Allevato, Matthias Rutsch, Christoph Haugwitz, Tianyi Liu, Mario Kupnik, Marius Pesavento

    Abstract: Arrays of ultrasonic sensors are capable of 3D imaging in air and an affordable supplement to other sensing modalities, such as radar, lidar, and camera, i.e. in heterogeneous sensing systems. However, manufacturing tolerances of air-coupled ultrasonic sensors may lead to amplitude and phase deviations. Together with artifacts from imperfect knowledge of the array geometry, there are numerous fact… ▽ More

    Submitted 18 December, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

    Comments: 22 pages, 6 figures. This work has been accepted for publication by Elsevier B.V

    Journal ref: Signal Process. 230 (2025) 109812