-
Layer Decomposition and Morphological Reconstruction for Task-Oriented Infrared Image Enhancement
Authors:
Siyuan Chai,
Xiaodong Guo,
Tong Liu
Abstract:
Infrared image helps improve the perception capabilities of autonomous driving in complex weather conditions such as fog, rain, and low light. However, infrared image often suffers from low contrast, especially in non-heat-emitting targets like bicycles, which significantly affects the performance of downstream high-level vision tasks. Furthermore, achieving contrast enhancement without amplifying…
▽ More
Infrared image helps improve the perception capabilities of autonomous driving in complex weather conditions such as fog, rain, and low light. However, infrared image often suffers from low contrast, especially in non-heat-emitting targets like bicycles, which significantly affects the performance of downstream high-level vision tasks. Furthermore, achieving contrast enhancement without amplifying noise and losing important information remains a challenge. To address these challenges, we propose a task-oriented infrared image enhancement method. Our approach consists of two key components: layer decomposition and saliency information extraction. First, we design an layer decomposition method for infrared images, which enhances scene details while preserving dark region features, providing more features for subsequent saliency information extraction. Then, we propose a morphological reconstruction-based saliency extraction method that effectively extracts and enhances target information without amplifying noise. Our method improves the image quality for object detection and semantic segmentation tasks. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods.
△ Less
Submitted 29 June, 2025;
originally announced June 2025.
-
MTSIC: Multi-stage Transformer-based GAN for Spectral Infrared Image Colorization
Authors:
Tingting Liu,
Yuan Liu,
Jinhui Tang,
Liyin Yuan,
Chengyu Liu,
Chunlai Li,
Xiubao Sui,
Qian Chen
Abstract:
Thermal infrared (TIR) images, acquired through thermal radiation imaging, are unaffected by variations in lighting conditions and atmospheric haze. However, TIR images inherently lack color and texture information, limiting downstream tasks and potentially causing visual fatigue. Existing colorization methods primarily rely on single-band images with limited spectral information and insufficient…
▽ More
Thermal infrared (TIR) images, acquired through thermal radiation imaging, are unaffected by variations in lighting conditions and atmospheric haze. However, TIR images inherently lack color and texture information, limiting downstream tasks and potentially causing visual fatigue. Existing colorization methods primarily rely on single-band images with limited spectral information and insufficient feature extraction capabilities, which often result in image distortion and semantic ambiguity. In contrast, multiband infrared imagery provides richer spectral data, facilitating the preservation of finer details and enhancing semantic accuracy. In this paper, we propose a generative adversarial network (GAN)-based framework designed to integrate spectral information to enhance the colorization of infrared images. The framework employs a multi-stage spectral self-attention Transformer network (MTSIC) as the generator. Each spectral feature is treated as a token for self-attention computation, and a multi-head self-attention mechanism forms a spatial-spectral attention residual block (SARB), achieving multi-band feature mapping and reducing semantic confusion. Multiple SARB units are integrated into a Transformer-based single-stage network (STformer), which uses a U-shaped architecture to extract contextual information, combined with multi-scale wavelet blocks (MSWB) to align semantic information in the spatial-frequency dual domain. Multiple STformer modules are cascaded to form MTSIC, progressively optimizing the reconstruction quality. Experimental results demonstrate that the proposed method significantly outperforms traditional techniques and effectively enhances the visual quality of infrared images.
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
Ming-Omni: A Unified Multimodal Model for Perception and Generation
Authors:
Inclusion AI,
Biao Gong,
Cheng Zou,
Chuanyang Zheng,
Chunluan Zhou,
Canxiang Yan,
Chunxiang Jin,
Chunjie Shen,
Dandan Zheng,
Fudong Wang,
Furong Xu,
GuangMing Yao,
Jun Zhou,
Jingdong Chen,
Jianxin Sun,
Jiajia Liu,
Jianjiang Zhu,
Jun Peng,
Kaixiang Ji,
Kaiyou Song,
Kaimeng Ren,
Libin Wang,
Lixiang Ru,
Lele Xie,
Longhua Tan
, et al. (33 additional authors not shown)
Abstract:
We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single…
▽ More
We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single model to efficiently process and fuse multimodal inputs within a unified framework, thereby facilitating diverse tasks without requiring separate models, task-specific fine-tuning, or structural redesign. Importantly, Ming-Omni extends beyond conventional multimodal models by supporting audio and image generation. This is achieved through the integration of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for high-quality image generation, which also allow the model to engage in context-aware chatting, perform text-to-speech conversion, and conduct versatile image editing. Our experimental results showcase Ming-Omni offers a powerful solution for unified perception and generation across all modalities. Notably, our proposed Ming-Omni is the first open-source model we are aware of to match GPT-4o in modality support, and we release all code and model weights to encourage further research and development in the community.
△ Less
Submitted 10 June, 2025;
originally announced June 2025.
-
Contextual Paralinguistic Data Creation for Multi-Modal Speech-LLM: Data Condensation and Spoken QA Generation
Authors:
Qiongqiong Wang,
Hardik B. Sailor,
Tianchi Liu,
Ai Ti Aw
Abstract:
Current speech-LLMs exhibit limited capability in contextual reasoning alongside paralinguistic understanding, primarily due to the lack of Question-Answer (QA) datasets that cover both aspects. We propose a novel framework for dataset generation from in-the-wild speech data, that integrates contextual reasoning with paralinguistic information. It consists of a pseudo paralinguistic label-based da…
▽ More
Current speech-LLMs exhibit limited capability in contextual reasoning alongside paralinguistic understanding, primarily due to the lack of Question-Answer (QA) datasets that cover both aspects. We propose a novel framework for dataset generation from in-the-wild speech data, that integrates contextual reasoning with paralinguistic information. It consists of a pseudo paralinguistic label-based data condensation of in-the-wild speech and LLM-based Contextual Paralinguistic QA (CPQA) generation. The effectiveness is validated by a strong correlation in evaluations of the Qwen2-Audio-7B-Instruct model on a dataset created by our framework and human-generated CPQA dataset. The results also reveal the speech-LLM's limitations in handling empathetic reasoning tasks, highlighting the need for such datasets and more robust models. The proposed framework is first of its kind and has potential in training more robust speech-LLMs with paralinguistic reasoning capabilities.
△ Less
Submitted 3 June, 2025; v1 submitted 19 May, 2025;
originally announced May 2025.
-
SAR-GTR: Attributed Scattering Information Guided SAR Graph Transformer Recognition Algorithm
Authors:
Xuying Xiong,
Xinyu Zhang,
Weidong Jiang,
Li Liu,
Yongxiang Liu,
Tianpeng Liu
Abstract:
Utilizing electromagnetic scattering information for SAR data interpretation is currently a prominent research focus in the SAR interpretation domain. Graph Neural Networks (GNNs) can effectively integrate domain-specific physical knowledge and human prior knowledge, thereby alleviating challenges such as limited sample availability and poor generalization in SAR interpretation. In this study, we…
▽ More
Utilizing electromagnetic scattering information for SAR data interpretation is currently a prominent research focus in the SAR interpretation domain. Graph Neural Networks (GNNs) can effectively integrate domain-specific physical knowledge and human prior knowledge, thereby alleviating challenges such as limited sample availability and poor generalization in SAR interpretation. In this study, we thoroughly investigate the electromagnetic inverse scattering information of single-channel SAR and re-examine the limitations of applying GNNs to SAR interpretation. We propose the SAR Graph Transformer Recognition Algorithm (SAR-GTR). SAR-GTR carefully considers the attributes and characteristics of different electromagnetic scattering parameters by distinguishing the mapping methods for discrete and continuous parameters, thereby avoiding information confusion and loss. Furthermore, the GTR combines GNNs with the Transformer mechanism and introduces an edge information enhancement channel to facilitate interactive learning of node and edge features, enabling the capture of robust and global structural characteristics of targets. Additionally, the GTR constructs a hierarchical topology-aware system through global node encoding and edge position encoding, fully exploiting the hierarchical structural information of targets. Finally, the effectiveness of the algorithm is validated using the ATRNet-STAR large-scale vehicle dataset.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Design and Experimental Test of Datatic Approximate Optimal Filter in Nonlinear Dynamic Systems
Authors:
Weixian He,
Zeyu He,
Wenhan Cao,
Haoyu Gao,
Tong Liu,
Bin Shuai,
Chang Liu,
Shengbo Eben Li
Abstract:
Filtering is crucial in engineering fields, providing vital state estimation for control systems. However, the nonlinear nature of complex systems and the presence of non-Gaussian noises pose significant challenges to the performance of conventional filtering methods in terms of estimation accuracy and computational efficiency. In this work, we present a data-driven closed-loop filter, termed data…
▽ More
Filtering is crucial in engineering fields, providing vital state estimation for control systems. However, the nonlinear nature of complex systems and the presence of non-Gaussian noises pose significant challenges to the performance of conventional filtering methods in terms of estimation accuracy and computational efficiency. In this work, we present a data-driven closed-loop filter, termed datatic approximate optimal filter (DAOF), specifically designed for nonlinear systems under non-Gaussian conditions. We first formulate a Markovian filtering problem (MFP), which inherently shares a connection with reinforcement learning (RL) as it aims to compute the optimal state estimate by minimizing the accumulated error. To solve MFP, we propose DAOF, which primarily incorporates a trained RL policy and features two distinct structural designs: DAOF-v1 and DAOF-v2. Designed for systems with explicit models, DAOF-v1 combines prediction and update phases, with the RL policy generating the update value. Meanwhile, DAOF-v2 bypasses system modeling by directly outputting the state estimate. Then, we utilize an actor-critic algorithm to learn the parameterized policy for DAOF. Experimental results on a 2-degree-of-freedom (2-DOF) vehicle system, equipped with explicit system models, demonstrate the superior accuracy and computational efficiency of DAOF-v1 compared to existing nonlinear filters. Moreover, DAOF-v2 showcases its unique ability to perform filtering without requiring explicit system modeling, as validated by a 14-DOF vehicle system.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
Physical-Layer Security in Mixed Near-Field and Far-Field Communication Systems
Authors:
Tianyu Liu,
Changsheng You,
Cong Zhou,
Yunpu Zhang,
Shiqi Gong,
Heng Liu,
Guangchi Zhang
Abstract:
Extremely large-scale arrays (XL-arrays) have emerged as a promising technology to improve the spectrum efficiency and spatial resolution of future wireless systems. Different from existing works that mostly considered physical layer security (PLS) in either the far-field or near-field, we consider in this paper a new and practical scenario, where legitimate users (Bobs) are located in the far-fie…
▽ More
Extremely large-scale arrays (XL-arrays) have emerged as a promising technology to improve the spectrum efficiency and spatial resolution of future wireless systems. Different from existing works that mostly considered physical layer security (PLS) in either the far-field or near-field, we consider in this paper a new and practical scenario, where legitimate users (Bobs) are located in the far-field of a base station (BS) while eavesdroppers (Eves) are located in the near-field for intercepting confidential information at short distance, referred to as the mixed near-field and far-field PLS. Specifically, we formulate an optimization problem to maximize the sum-secrecy-rate of all Bobs by optimizing the power allocation of the BS, subject to the constraint on the total BS transmit power. To shed useful insights, we first consider a one-Bob-one-Eve system and characterize the insecure-transmission region of the Bob in closed form. Interestingly, we show that the insecure-transmission region is significantly \emph{expanded} as compared to that in conventional far-field PLS systems, due to the energy-spread effect in the mixed-field scenario. Then, we further extend the analysis to a two-Bob-one-Eve system. It is revealed that as compared to the one-Bob system, the interferences from the other Bob can be effectively used to weaken the capability of Eve for intercepting signals of target Bobs, thus leading to enhanced secrecy rates. Furthermore, we propose an efficient algorithm to obtain a high-quality solution to the formulated non-convex problem by leveraging the successive convex approximation (SCA) technique. Finally, numerical results demonstrate that our proposed algorithm achieves a higher sum-secrecy-rate than the benchmark scheme where the power allocation is designed based on the (simplified) far-field channel model.
△ Less
Submitted 4 May, 2025; v1 submitted 28 April, 2025;
originally announced April 2025.
-
Deployment Optimization for XL-IRS Assisted Multi-User Communications
Authors:
Chao Zhou,
Changsheng You,
Tianyu Liu,
Bin Lyu
Abstract:
In this paper, we study the deployment optimization for an extremely large-scale intelligent reflecting surface (XL-IRS) assisted multi-user communication system, within which the channels between the XL-IRS and the BS (or user) are modeled by the near-field spherical wavefronts. To draw some valuable insights, we first consider the single-user case, where an alternating optimization (AO) based al…
▽ More
In this paper, we study the deployment optimization for an extremely large-scale intelligent reflecting surface (XL-IRS) assisted multi-user communication system, within which the channels between the XL-IRS and the BS (or user) are modeled by the near-field spherical wavefronts. To draw some valuable insights, we first consider the single-user case, where an alternating optimization (AO) based algorithm is devised to maximize the received signal-to-noise ratio (SNR) at the user. To address the high computational complexity issue incurred by the AO based algorithm, three approximate received SNR expressions are obtained to yield useful insights, corresponding to the upper bound, approximate expression, and closed-form. It is demonstrated that the XL-IRS ought to be positioned near the user (rather than the BS) to obtain a higher beamforming gain. Then, for the multi-user scenario, an efficient algorithm is proposed to obtain a high-quality XL-IRS placement solution by using the AO and successive convex approximation (SCA) techniques. Furthermore, the effective degree of freedom (DoF) of the BS-IRS channel is provided, which indicates that the additional effective DoF can be leveraged to improve multi-user spatial multiplexing. Last, numerical results confirm the existence of a trade-off between near-field beam-focusing gain and multiplexing gain.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
Kimi-Audio Technical Report
Authors:
KimiTeam,
Ding Ding,
Zeqian Ju,
Yichong Leng,
Songxiang Liu,
Tong Liu,
Zeyu Shang,
Kai Shen,
Wei Song,
Xu Tan,
Heyi Tang,
Zhengtao Wang,
Chu Wei,
Yifei Xin,
Xinran Xu,
Jianwei Yu,
Yutao Zhang,
Xinyu Zhou,
Y. Charles,
Jun Chen,
Yanru Chen,
Yulun Du,
Weiran He,
Zhenxing Hu,
Guokun Lai
, et al. (15 additional authors not shown)
Abstract:
We present Kimi-Audio, an open-source audio foundation model that excels in audio understanding, generation, and conversation. We detail the practices in building Kimi-Audio, including model architecture, data curation, training recipe, inference deployment, and evaluation. Specifically, we leverage a 12.5Hz audio tokenizer, design a novel LLM-based architecture with continuous features as input a…
▽ More
We present Kimi-Audio, an open-source audio foundation model that excels in audio understanding, generation, and conversation. We detail the practices in building Kimi-Audio, including model architecture, data curation, training recipe, inference deployment, and evaluation. Specifically, we leverage a 12.5Hz audio tokenizer, design a novel LLM-based architecture with continuous features as input and discrete tokens as output, and develop a chunk-wise streaming detokenizer based on flow matching. We curate a pre-training dataset that consists of more than 13 million hours of audio data covering a wide range of modalities including speech, sound, and music, and build a pipeline to construct high-quality and diverse post-training data. Initialized from a pre-trained LLM, Kimi-Audio is continual pre-trained on both audio and text data with several carefully designed tasks, and then fine-tuned to support a diverse of audio-related tasks. Extensive evaluation shows that Kimi-Audio achieves state-of-the-art performance on a range of audio benchmarks including speech recognition, audio understanding, audio question answering, and speech conversation. We release the codes, model checkpoints, as well as the evaluation toolkits in https://github.com/MoonshotAI/Kimi-Audio.
△ Less
Submitted 25 April, 2025;
originally announced April 2025.
-
Optimal Power Allocation for OFDM-based Ranging Using Random Communication Signals
Authors:
Ying Zhang,
Fan Liu,
Tao Liu,
Shi Jin
Abstract:
High-precision ranging plays a crucial role in future 6G Integrated Sensing and Communication (ISAC) systems. To improve the ranging performance while maximizing the resource utilization efficiency, future 6G ISAC networks have to reuse data payload signals for both communication and sensing, whose inherent randomness may deteriorate the ranging performance. To address this issue, this paper inves…
▽ More
High-precision ranging plays a crucial role in future 6G Integrated Sensing and Communication (ISAC) systems. To improve the ranging performance while maximizing the resource utilization efficiency, future 6G ISAC networks have to reuse data payload signals for both communication and sensing, whose inherent randomness may deteriorate the ranging performance. To address this issue, this paper investigates the power allocation (PA) design for an OFDM-based ISAC system under random signaling, aiming to reduce the ranging sidelobe level of both periodic and aperiodic auto-correlation functions (P-ACF and A-ACF) of the ISAC signal. Towards that end, we first derive the closed-form expressions of the average squared P-ACF and A-ACF, and then propose to minimize the expectation of the integrated sidelobe level (EISL) under arbitrary constellation mapping. We then rigorously prove that the uniform PA scheme achieves the global minimum of the EISL for both P-ACF and A-ACF. As a step further, we show that this scheme also minimizes the P-ACF sidelobe level at every lag. Moreover, we extend our analysis to the P-ACF case with frequency-domain zero-padding, which is a typical approach to improve the ranging resolution. We reveal that there exists a tradeoff between sidelobe level and mainlobe width, and propose a project gradient descent algorithm to seek a locally optimal PA scheme that reduces the EISL. Finally, we validate our theoretical findings through extensive simulation results, confirming the effectiveness of the proposed PA methods in reducing the ranging sidelobe level for random OFDM signals.
△ Less
Submitted 24 April, 2025;
originally announced April 2025.
-
PID-GM: PID Control with Gain Mapping
Authors:
Bo Zhu,
Wei Yu,
Hugh H. T. Liu
Abstract:
Proportional-Integral-Differential (PID) control is widely used in industrial control systems. However, up to now there are at least two open problems related with PID control. One is to have a comprehensive understanding of its robustness with respect to model uncertainties and disturbances. The other is to build intuitive, explicit and mathematically provable guidelines for PID gain tuning. In t…
▽ More
Proportional-Integral-Differential (PID) control is widely used in industrial control systems. However, up to now there are at least two open problems related with PID control. One is to have a comprehensive understanding of its robustness with respect to model uncertainties and disturbances. The other is to build intuitive, explicit and mathematically provable guidelines for PID gain tuning. In this paper, we introduce a simple nonlinear mapping to determine PID gains from three auxiliary parameters. By the mapping, PID control is shown to be equivalent to a new PD control (serving as a nominal control) plus an uncertainty and disturbance compensator (to recover the nominal performance). Then PID control can be understood, designed and tuned in a Two-Degree-of-Freedom (2-DoF) control framework. We discuss some basic properties of the mapping, including the existence, uniqueness and invertibility. Taking as an example the PID control applied to a general uncertain second-order plant, we prove by the singular perturbation theory that the closed-loop steady-state and transient performance depends explicitly on one auxiliary parameter which can be viewed as the virtual singular perturbation parameter (SPP) of PID control. All the three PID gains are monotonically decreasing functions of the SPP, indicating that the smaller the SPP is, the higher the PID gains are, and the better the robustness of PID control is. Simulation and experimental examples are provided to demonstrate the properties of the mapping as well as the effectiveness of the mapping based PID gain turning.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-spoofing
Authors:
Tianchi Liu,
Duc-Tuan Truong,
Rohan Kumar Das,
Kong Aik Lee,
Haizhou Li
Abstract:
Speech foundation models have significantly advanced various speech-related tasks by providing exceptional representation capabilities. However, their high-dimensional output features often create a mismatch with downstream task models, which typically require lower-dimensional inputs. A common solution is to apply a dimensionality reduction (DR) layer, but this approach increases parameter overhe…
▽ More
Speech foundation models have significantly advanced various speech-related tasks by providing exceptional representation capabilities. However, their high-dimensional output features often create a mismatch with downstream task models, which typically require lower-dimensional inputs. A common solution is to apply a dimensionality reduction (DR) layer, but this approach increases parameter overhead, computational costs, and risks losing valuable information. To address these issues, we propose Nested Res2Net (Nes2Net), a lightweight back-end architecture designed to directly process high-dimensional features without DR layers. The nested structure enhances multi-scale feature extraction, improves feature interaction, and preserves high-dimensional information. We first validate Nes2Net on CtrSVDD, a singing voice deepfake detection dataset, and report a 22% performance improvement and an 87% back-end computational cost reduction over the state-of-the-art baseline. Additionally, extensive testing across four diverse datasets: ASVspoof 2021, ASVspoof 5, PartialSpoof, and In-the-Wild, covering fully spoofed speech, adversarial attacks, partial spoofing, and real-world scenarios, consistently highlights Nes2Net's superior robustness and generalization capabilities. The code package and pre-trained models are available at https://github.com/Liu-Tianchi/Nes2Net.
△ Less
Submitted 8 April, 2025;
originally announced April 2025.
-
Core-Periphery Principle Guided State Space Model for Functional Connectome Classification
Authors:
Minheng Chen,
Xiaowei Yu,
Jing Zhang,
Tong Chen,
Chao Cao,
Yan Zhuang,
Yanjun Lyu,
Lu Zhang,
Tianming Liu,
Dajiang Zhu
Abstract:
Understanding the organization of human brain networks has become a central focus in neuroscience, particularly in the study of functional connectivity, which plays a crucial role in diagnosing neurological disorders. Advances in functional magnetic resonance imaging and machine learning techniques have significantly improved brain network analysis. However, traditional machine learning approaches…
▽ More
Understanding the organization of human brain networks has become a central focus in neuroscience, particularly in the study of functional connectivity, which plays a crucial role in diagnosing neurological disorders. Advances in functional magnetic resonance imaging and machine learning techniques have significantly improved brain network analysis. However, traditional machine learning approaches struggle to capture the complex relationships between brain regions, while deep learning methods, particularly Transformer-based models, face computational challenges due to their quadratic complexity in long-sequence modeling. To address these limitations, we propose a Core-Periphery State-Space Model (CP-SSM), an innovative framework for functional connectome classification. Specifically, we introduce Mamba, a selective state-space model with linear complexity, to effectively capture long-range dependencies in functional brain networks. Furthermore, inspired by the core-periphery (CP) organization, a fundamental characteristic of brain networks that enhances efficient information transmission, we design CP-MoE, a CP-guided Mixture-of-Experts that improves the representation learning of brain connectivity patterns. We evaluate CP-SSM on two benchmark fMRI datasets: ABIDE and ADNI. Experimental results demonstrate that CP-SSM surpasses Transformer-based models in classification performance while significantly reducing computational complexity. These findings highlight the effectiveness and efficiency of CP-SSM in modeling brain functional connectivity, offering a promising direction for neuroimaging-based neurological disease diagnosis.
△ Less
Submitted 18 March, 2025;
originally announced March 2025.
-
A Lightweight Deep Exclusion Unfolding Network for Single Image Reflection Removal
Authors:
Jun-Jie Huang,
Tianrui Liu,
Zihan Chen,
Xinwang Liu,
Meng Wang,
Pier Luigi Dragotti
Abstract:
Single Image Reflection Removal (SIRR) is a canonical blind source separation problem and refers to the issue of separating a reflection-contaminated image into a transmission and a reflection image. The core challenge lies in minimizing the commonalities among different sources. Existing deep learning approaches either neglect the significance of feature interactions or rely on heuristically desi…
▽ More
Single Image Reflection Removal (SIRR) is a canonical blind source separation problem and refers to the issue of separating a reflection-contaminated image into a transmission and a reflection image. The core challenge lies in minimizing the commonalities among different sources. Existing deep learning approaches either neglect the significance of feature interactions or rely on heuristically designed architectures. In this paper, we propose a novel Deep Exclusion unfolding Network (DExNet), a lightweight, interpretable, and effective network architecture for SIRR. DExNet is principally constructed by unfolding and parameterizing a simple iterative Sparse and Auxiliary Feature Update (i-SAFU) algorithm, which is specifically designed to solve a new model-based SIRR optimization formulation incorporating a general exclusion prior. This general exclusion prior enables the unfolded SAFU module to inherently identify and penalize commonalities between the transmission and reflection features, ensuring more accurate separation. The principled design of DExNet not only enhances its interpretability but also significantly improves its performance. Comprehensive experiments on four benchmark datasets demonstrate that DExNet achieves state-of-the-art visual and quantitative results while utilizing only approximately 8\% of the parameters required by leading methods.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Clip-TTS: Contrastive Text-content and Mel-spectrogram, A High-Quality Text-to-Speech Method based on Contextual Semantic Understanding
Authors:
Tianyun Liu
Abstract:
Traditional text-to-speech (TTS) methods primarily focus on establishing a mapping between phonemes and mel-spectrograms. However, during the phoneme encoding stage, there is often a lack of real mel-spectrogram auxiliary information, which results in the encoding process lacking true semantic understanding. At the same time, traditional TTS systems often struggle to balance the inference speed of…
▽ More
Traditional text-to-speech (TTS) methods primarily focus on establishing a mapping between phonemes and mel-spectrograms. However, during the phoneme encoding stage, there is often a lack of real mel-spectrogram auxiliary information, which results in the encoding process lacking true semantic understanding. At the same time, traditional TTS systems often struggle to balance the inference speed of the model with the quality of the synthesized speech. Methods that generate high-quality synthesized speech tend to have slower inference speeds, while faster inference methods often sacrifice speech quality. In this paper, I propose Clip-TTS, a TTS method based on the Clip architecture. This method uses the Clip framework to establish a connection between text content and real mel-spectrograms during the text encoding stage, enabling the text encoder to directly learn the true semantics of the global context, thereby ensuring the quality of the synthesized speech. In terms of model architecture, I adopt the basic structure of Transformer, which allows Clip-TTS to achieve fast inference speeds. Experimental results show that on the LJSpeech and Baker datasets, the speech generated by Clip-TTS achieves state-of-the-art MOS scores, and it also performs excellently on multi-emotion datasets.Audio samples are available at: https://ltydd1314.github.io/.
△ Less
Submitted 8 March, 2025; v1 submitted 26 February, 2025;
originally announced February 2025.
-
Audio-FLAN: A Preliminary Release
Authors:
Liumeng Xue,
Ziya Zhou,
Jiahao Pan,
Zixuan Li,
Shuai Fan,
Yinghao Ma,
Sitong Cheng,
Dongchao Yang,
Haohan Guo,
Yujia Xiao,
Xinsheng Wang,
Zixuan Shen,
Chuanbo Zhu,
Xinshen Zhang,
Tianchi Liu,
Ruibin Yuan,
Zeyue Tian,
Haohe Liu,
Emmanouil Benetos,
Ge Zhang,
Yike Guo,
Wei Xue
Abstract:
Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learnin…
▽ More
Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains largely unexplored. A major obstacle is the lack of comprehensive datasets that unify audio understanding and generation. To address this, we introduce Audio-FLAN, a large-scale instruction-tuning dataset covering 80 diverse tasks across speech, music, and sound domains, with over 100 million instances. Audio-FLAN lays the foundation for unified audio-language models that can seamlessly handle both understanding (e.g., transcription, comprehension) and generation (e.g., speech, music, sound) tasks across a wide range of audio domains in a zero-shot manner. The Audio-FLAN dataset is available on HuggingFace and GitHub and will be continuously updated.
△ Less
Submitted 23 February, 2025;
originally announced February 2025.
-
IncepFormerNet: A multi-scale multi-head attention network for SSVEP classification
Authors:
Yan Huang,
Yongru Chen,
Lei Cao,
Yongnian Cao,
Xuechun Yang,
Yilin Dong,
Tianyu Liu
Abstract:
In recent years, deep learning (DL) models have shown outstanding performance in EEG classification tasks, particularly in Steady-State Visually Evoked Potential(SSVEP)-based Brain-Computer-Interfaces(BCI)systems. DL methods have been successfully applied to SSVEP-BCI. This study proposes a new model called IncepFormerNet, which is a hybrid of the Inception and Transformer architectures. IncepForm…
▽ More
In recent years, deep learning (DL) models have shown outstanding performance in EEG classification tasks, particularly in Steady-State Visually Evoked Potential(SSVEP)-based Brain-Computer-Interfaces(BCI)systems. DL methods have been successfully applied to SSVEP-BCI. This study proposes a new model called IncepFormerNet, which is a hybrid of the Inception and Transformer architectures. IncepFormerNet adeptly extracts multi-scale temporal information from time series data using parallel convolution kernels of varying sizes, accurately capturing the subtle variations and critical features within SSVEP signals.Furthermore, the model integrates the multi-head attention mechanism from the Transformer architecture, which not only provides insights into global dependencies but also significantly enhances the understanding and representation of complex patterns.Additionally, it takes advantage of filter bank techniques to extract features based on the spectral characteristics of SSVEP data. To validate the effectiveness of the proposed model, we conducted experiments on two public datasets, . The experimental results show that IncepFormerNet achieves an accuracy of 87.41 on Dataset 1 and 71.97 on Dataset 2 using a 1.0-second time window. To further verify the superiority of the proposed model, we compared it with other deep learning models, and the results indicate that our method achieves significantly higher accuracy than the others.The source codes in this work are available at: https://github.com/CECNL/SSVEP-DAN.
△ Less
Submitted 4 February, 2025;
originally announced February 2025.
-
Low-Complexity Cooperative Payload Transportation for Nonholonomic Mobile Robots Under Scalable Constraints
Authors:
Renhe Guan,
Yuanzhe Wang,
Tao Liu,
Yan Wang
Abstract:
Cooperative transportation, a key aspect of logistics
cyber-physical systems (CPS), is typically approached using dis tributed control and optimization-based methods. The distributed
control methods consume less time, but poorly handle and extend
to multiple constraints. Instead, optimization-based methods
handle constraints effectively, but they are usually centralized,
time-consuming a…
▽ More
Cooperative transportation, a key aspect of logistics
cyber-physical systems (CPS), is typically approached using dis tributed control and optimization-based methods. The distributed
control methods consume less time, but poorly handle and extend
to multiple constraints. Instead, optimization-based methods
handle constraints effectively, but they are usually centralized,
time-consuming and thus not easily scalable to numerous robots.
To overcome drawbacks of both, we propose a novel cooperative
transportation method for nonholonomic mobile robots by im proving conventional formation control, which is distributed, has
a low time-complexity and accommodates scalable constraints.
The proposed control-based method is testified on a cable suspended payload and divided into two parts, including robot
trajectory generation and trajectory tracking. Unlike most time consuming trajectory generation methods, ours can generate
trajectories with only constant time-complexity, needless of global
maps. As for trajectory tracking, our control-based method not
only scales easily to multiple constraints as those optimization based methods, but reduces their time-complexity from poly nomial to linear. Simulations and experiments can verify the
feasibility of our method.
△ Less
Submitted 18 February, 2025;
originally announced February 2025.
-
Classification of Mild Cognitive Impairment Based on Dynamic Functional Connectivity Using Spatio-Temporal Transformer
Authors:
Jing Zhang,
Yanjun Lyu,
Xiaowei Yu,
Lu Zhang,
Chao Cao,
Tong Chen,
Minheng Chen,
Yan Zhuang,
Tianming Liu,
Dajiang Zhu
Abstract:
Dynamic functional connectivity (dFC) using resting-state functional magnetic resonance imaging (rs-fMRI) is an advanced technique for capturing the dynamic changes of neural activities, and can be very useful in the studies of brain diseases such as Alzheimer's disease (AD). Yet, existing studies have not fully leveraged the sequential information embedded within dFC that can potentially provide…
▽ More
Dynamic functional connectivity (dFC) using resting-state functional magnetic resonance imaging (rs-fMRI) is an advanced technique for capturing the dynamic changes of neural activities, and can be very useful in the studies of brain diseases such as Alzheimer's disease (AD). Yet, existing studies have not fully leveraged the sequential information embedded within dFC that can potentially provide valuable information when identifying brain conditions. In this paper, we propose a novel framework that jointly learns the embedding of both spatial and temporal information within dFC based on the transformer architecture. Specifically, we first construct dFC networks from rs-fMRI data through a sliding window strategy. Then, we simultaneously employ a temporal block and a spatial block to capture higher-order representations of dynamic spatio-temporal dependencies, via mapping them into an efficient fused feature representation. To further enhance the robustness of these feature representations by reducing the dependency on labeled data, we also introduce a contrastive learning strategy to manipulate different brain states. Experimental results on 345 subjects with 570 scans from the Alzheimer's Disease Neuroimaging Initiative (ADNI) demonstrate the superiority of our proposed method for MCI (Mild Cognitive Impairment, the prodromal stage of AD) prediction, highlighting its potential for early identification of AD.
△ Less
Submitted 27 January, 2025;
originally announced January 2025.
-
Brain-Adapter: Enhancing Neurological Disorder Analysis with Adapter-Tuning Multimodal Large Language Models
Authors:
Jing Zhang,
Xiaowei Yu,
Yanjun Lyu,
Lu Zhang,
Tong Chen,
Chao Cao,
Yan Zhuang,
Minheng Chen,
Tianming Liu,
Dajiang Zhu
Abstract:
Understanding brain disorders is crucial for accurate clinical diagnosis and treatment. Recent advances in Multimodal Large Language Models (MLLMs) offer a promising approach to interpreting medical images with the support of text descriptions. However, previous research has primarily focused on 2D medical images, leaving richer spatial information of 3D images under-explored, and single-modality-…
▽ More
Understanding brain disorders is crucial for accurate clinical diagnosis and treatment. Recent advances in Multimodal Large Language Models (MLLMs) offer a promising approach to interpreting medical images with the support of text descriptions. However, previous research has primarily focused on 2D medical images, leaving richer spatial information of 3D images under-explored, and single-modality-based methods are limited by overlooking the critical clinical information contained in other modalities. To address this issue, this paper proposes Brain-Adapter, a novel approach that incorporates an extra bottleneck layer to learn new knowledge and instill it into the original pre-trained knowledge. The major idea is to incorporate a lightweight bottleneck layer to train fewer parameters while capturing essential information and utilize a Contrastive Language-Image Pre-training (CLIP) strategy to align multimodal data within a unified representation space. Extensive experiments demonstrated the effectiveness of our approach in integrating multimodal data to significantly improve the diagnosis accuracy without high computational costs, highlighting the potential to enhance real-world diagnostic workflows.
△ Less
Submitted 27 January, 2025;
originally announced January 2025.
-
Deep Learning-Powered Classification of Thoracic Diseases in Chest X-Rays
Authors:
Yiming Lei,
Michael Nguyen,
Tzu Chia Liu,
Hyounkyun Oh
Abstract:
Chest X-rays play a pivotal role in diagnosing respiratory diseases such as pneumonia, tuberculosis, and COVID-19, which are prevalent and present unique diagnostic challenges due to overlapping visual features and variability in image quality. Severe class imbalance and the complexity of medical images hinder automated analysis. This study leverages deep learning techniques, including transfer le…
▽ More
Chest X-rays play a pivotal role in diagnosing respiratory diseases such as pneumonia, tuberculosis, and COVID-19, which are prevalent and present unique diagnostic challenges due to overlapping visual features and variability in image quality. Severe class imbalance and the complexity of medical images hinder automated analysis. This study leverages deep learning techniques, including transfer learning on pre-trained models (AlexNet, ResNet, and InceptionNet), to enhance disease detection and classification. By fine-tuning these models and incorporating focal loss to address class imbalance, significant performance improvements were achieved. Grad-CAM visualizations further enhance model interpretability, providing insights into clinically relevant regions influencing predictions. The InceptionV3 model, for instance, achieved a 28% improvement in AUC and a 15% increase in F1-Score. These findings highlight the potential of deep learning to improve diagnostic workflows and support clinical decision-making.
△ Less
Submitted 24 January, 2025;
originally announced January 2025.
-
Physics-informed DeepCT: Sinogram Wavelet Decomposition Meets Masked Diffusion
Authors:
Zekun Zhou,
Tan Liu,
Bing Yu,
Yanru Gong,
Liu Shi,
Qiegen Liu
Abstract:
Diffusion model shows remarkable potential on sparse-view computed tomography (SVCT) reconstruction. However, when a network is trained on a limited sample space, its generalization capability may be constrained, which degrades performance on unfamiliar data. For image generation tasks, this can lead to issues such as blurry details and inconsistencies between regions. To alleviate this problem, w…
▽ More
Diffusion model shows remarkable potential on sparse-view computed tomography (SVCT) reconstruction. However, when a network is trained on a limited sample space, its generalization capability may be constrained, which degrades performance on unfamiliar data. For image generation tasks, this can lead to issues such as blurry details and inconsistencies between regions. To alleviate this problem, we propose a Sinogram-based Wavelet random decomposition And Random mask diffusion Model (SWARM) for SVCT reconstruction. Specifically, introducing a random mask strategy in the sinogram effectively expands the limited training sample space. This enables the model to learn a broader range of data distributions, enhancing its understanding and generalization of data uncertainty. In addition, applying a random training strategy to the high-frequency components of the sinogram wavelet enhances feature representation and improves the ability to capture details in different frequency bands, thereby improving performance and robustness. Two-stage iterative reconstruction method is adopted to ensure the global consistency of the reconstructed image while refining its details. Experimental results demonstrate that SWARM outperforms competing approaches in both quantitative and qualitative performance across various datasets.
△ Less
Submitted 15 June, 2025; v1 submitted 16 January, 2025;
originally announced January 2025.
-
NVS-SQA: Exploring Self-Supervised Quality Representation Learning for Neurally Synthesized Scenes without References
Authors:
Qiang Qu,
Yiran Shen,
Xiaoming Chen,
Yuk Ying Chung,
Weidong Cai,
Tongliang Liu
Abstract:
Neural View Synthesis (NVS), such as NeRF and 3D Gaussian Splatting, effectively creates photorealistic scenes from sparse viewpoints, typically evaluated by quality assessment methods like PSNR, SSIM, and LPIPS. However, these full-reference methods, which compare synthesized views to reference views, may not fully capture the perceptual quality of neurally synthesized scenes (NSS), particularly…
▽ More
Neural View Synthesis (NVS), such as NeRF and 3D Gaussian Splatting, effectively creates photorealistic scenes from sparse viewpoints, typically evaluated by quality assessment methods like PSNR, SSIM, and LPIPS. However, these full-reference methods, which compare synthesized views to reference views, may not fully capture the perceptual quality of neurally synthesized scenes (NSS), particularly due to the limited availability of dense reference views. Furthermore, the challenges in acquiring human perceptual labels hinder the creation of extensive labeled datasets, risking model overfitting and reduced generalizability. To address these issues, we propose NVS-SQA, a NSS quality assessment method to learn no-reference quality representations through self-supervision without reliance on human labels. Traditional self-supervised learning predominantly relies on the "same instance, similar representation" assumption and extensive datasets. However, given that these conditions do not apply in NSS quality assessment, we employ heuristic cues and quality scores as learning objectives, along with a specialized contrastive pair preparation process to improve the effectiveness and efficiency of learning. The results show that NVS-SQA outperforms 17 no-reference methods by a large margin (i.e., on average 109.5% in SRCC, 98.6% in PLCC, and 91.5% in KRCC over the second best) and even exceeds 16 full-reference methods across all evaluation metrics (i.e., 22.9% in SRCC, 19.1% in PLCC, and 18.6% in KRCC over the second best).
△ Less
Submitted 11 January, 2025;
originally announced January 2025.
-
ExPO: Explainable Phonetic Trait-Oriented Network for Speaker Verification
Authors:
Yi Ma,
Shuai Wang,
Tianchi Liu,
Haizhou Li
Abstract:
In speaker verification, we use computational method to verify if an utterance matches the identity of an enrolled speaker. This task is similar to the manual task of forensic voice comparison, where linguistic analysis is combined with auditory measurements to compare and evaluate voice samples. Despite much success, we have yet to develop a speaker verification system that offers explainable res…
▽ More
In speaker verification, we use computational method to verify if an utterance matches the identity of an enrolled speaker. This task is similar to the manual task of forensic voice comparison, where linguistic analysis is combined with auditory measurements to compare and evaluate voice samples. Despite much success, we have yet to develop a speaker verification system that offers explainable results comparable to those from manual forensic voice comparison. A novel approach, Explainable Phonetic Trait-Oriented (ExPO) network, is proposed in this paper to introduce the speaker's phonetic trait which describes the speaker's characteristics at the phonetic level, resembling what forensic comparison does. ExPO not only generates utterance-level speaker embeddings but also allows for fine-grained analysis and visualization of phonetic traits, offering an explainable speaker verification process. Furthermore, we investigate phonetic traits from within-speaker and between-speaker variation perspectives to determine which trait is most effective for speaker verification, marking an important step towards explainable speaker verification. Our code is available at https://github.com/mmmmayi/ExPO.
△ Less
Submitted 14 January, 2025; v1 submitted 10 January, 2025;
originally announced January 2025.
-
JammingSnake: A follow-the-leader continuum robot with variable stiffness based on fiber jamming
Authors:
Chen Qian,
Tangyou Liu,
Liao Wu
Abstract:
Follow-the-leader (FTL) motion is essential for continuum robots operating in fragile and confined environments. It allows the robot to exert minimal force on its surroundings, reducing the risk of damage. This paper presents a novel design of a snake-like robot capable of achieving FTL motion by integrating fiber jamming modules (FJMs). The proposed robot can dynamically adjust its stiffness duri…
▽ More
Follow-the-leader (FTL) motion is essential for continuum robots operating in fragile and confined environments. It allows the robot to exert minimal force on its surroundings, reducing the risk of damage. This paper presents a novel design of a snake-like robot capable of achieving FTL motion by integrating fiber jamming modules (FJMs). The proposed robot can dynamically adjust its stiffness during propagation and interaction with the environment. An algorithm is developed to independently control the tendon and FJM insertion movements, allowing the robot to maintain its shape while minimizing the forces exerted on surrounding structures. To validate the proposed design, comparative tests were conducted between a traditional tendon-driven robot and the novel design under different configurations. The results demonstrate that our design relies significantly less on contact with the surroundings to maintain its shape. This highlights its potential for safer and more effective operations in delicate environments, such as minimally invasive surgery (MIS) or industrial in-situ inspection.
△ Less
Submitted 19 June, 2025; v1 submitted 4 January, 2025;
originally announced January 2025.
-
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
Authors:
Liang Chen,
Zekun Wang,
Shuhuai Ren,
Lei Li,
Haozhe Zhao,
Yunshui Li,
Zefan Cai,
Hongcheng Guo,
Lei Zhang,
Yizhe Xiong,
Yichi Zhang,
Ruoyu Wu,
Qingxiu Dong,
Ge Zhang,
Jian Yang,
Lingwei Meng,
Shujie Hu,
Yulong Chen,
Junyang Lin,
Shuai Bai,
Andreas Vlachos,
Xu Tan,
Minjia Zhang,
Wen Xiao,
Aaron Yee
, et al. (2 additional authors not shown)
Abstract:
Building on the foundations of language modeling in natural language processing, Next Token Prediction (NTP) has evolved into a versatile training objective for machine learning tasks across various modalities, achieving considerable success. As Large Language Models (LLMs) have advanced to unify understanding and generation tasks within the textual modality, recent research has shown that tasks f…
▽ More
Building on the foundations of language modeling in natural language processing, Next Token Prediction (NTP) has evolved into a versatile training objective for machine learning tasks across various modalities, achieving considerable success. As Large Language Models (LLMs) have advanced to unify understanding and generation tasks within the textual modality, recent research has shown that tasks from different modalities can also be effectively encapsulated within the NTP framework, transforming the multimodal information into tokens and predict the next one given the context. This survey introduces a comprehensive taxonomy that unifies both understanding and generation within multimodal learning through the lens of NTP. The proposed taxonomy covers five key aspects: Multimodal tokenization, MMNTP model architectures, unified task representation, datasets \& evaluation, and open challenges. This new taxonomy aims to aid researchers in their exploration of multimodal intelligence. An associated GitHub repository collecting the latest papers and repos is available at https://github.com/LMM101/Awesome-Multimodal-Next-Token-Prediction
△ Less
Submitted 29 December, 2024; v1 submitted 16 December, 2024;
originally announced December 2024.
-
Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling
Authors:
Shao-Syuan Huang,
Kuan-Po Huang,
Andy T. Liu,
Hung-yi Lee
Abstract:
Multilingual Automatic Speech Recognition (ASR) aims to recognize and transcribe speech from multiple languages within a single system. Whisper, one of the most advanced ASR models, excels in this domain by handling 99 languages effectively, leveraging a vast amount of data and incorporating language tags as prefixes to guide the recognition process. However, despite its success, Whisper struggles…
▽ More
Multilingual Automatic Speech Recognition (ASR) aims to recognize and transcribe speech from multiple languages within a single system. Whisper, one of the most advanced ASR models, excels in this domain by handling 99 languages effectively, leveraging a vast amount of data and incorporating language tags as prefixes to guide the recognition process. However, despite its success, Whisper struggles with unseen languages, those not included in its pre-training. Motivated by the observation that many languages share linguistic characteristics, we propose methods that exploit these relationships to enhance ASR performance on unseen languages. Specifically, we introduce a weighted sum method, which computes a weighted sum of the embeddings of language tags, using Whisper's predicted language probabilities. In addition, we develop a predictor-based approach that refines the weighted sum embedding to more closely approximate the true embedding for unseen languages. Experimental results demonstrate substantial improvements in ASR performance, both in zero-shot and fine-tuning settings. Our proposed methods outperform baseline approaches, providing an effective solution for addressing unseen languages in multilingual ASR.
△ Less
Submitted 20 December, 2024;
originally announced December 2024.
-
A Real-Time System for Scheduling and Managing UAV Delivery in Urban
Authors:
Han Liu,
Tian Liu,
Kai Huang
Abstract:
As urban logistics demand continues to grow, UAV delivery has become a key solution to improve delivery efficiency, reduce traffic congestion, and lower logistics costs. However, to fully leverage the potential of UAV delivery networks, efficient swarm scheduling and management are crucial. In this paper, we propose a real-time scheduling and management system based on the ``Airport-Unloading Stat…
▽ More
As urban logistics demand continues to grow, UAV delivery has become a key solution to improve delivery efficiency, reduce traffic congestion, and lower logistics costs. However, to fully leverage the potential of UAV delivery networks, efficient swarm scheduling and management are crucial. In this paper, we propose a real-time scheduling and management system based on the ``Airport-Unloading Station" model, aiming to bridge the gap between high-level scheduling algorithms and low-level execution systems. This system, acting as middleware, accurately translates the requirements from the scheduling layer into specific execution instructions, ensuring that the scheduling algorithms perform effectively in real-world environments. Additionally, we implement three collaborative scheduling schemes involving autonomous ground vehicles (AGVs), unmanned aerial vehicles (UAVs), and ground staff to further optimize overall delivery efficiency. Through extensive experiments, this study demonstrates the rationality and feasibility of the proposed management system, providing practical solution for the commercial application of UAVs delivery in urban.
Code: https://github.com/chengji253/UAVDeliverySystem
△ Less
Submitted 16 December, 2024;
originally announced December 2024.
-
MERaLiON-SpeechEncoder: Towards a Speech Foundation Model for Singapore and Beyond
Authors:
Muhammad Huzaifah,
Geyu Lin,
Tianchi Liu,
Hardik B. Sailor,
Kye Min Tan,
Tarun K. Vangani,
Qiongqiong Wang,
Jeremy H. M. Wong,
Nancy F. Chen,
Ai Ti Aw
Abstract:
This technical report describes the MERaLiON-SpeechEncoder, a foundation model designed to support a wide range of downstream speech applications. Developed as part of Singapore's National Multimodal Large Language Model Programme, the MERaLiON-SpeechEncoder is tailored to address the speech processing needs in Singapore and the surrounding Southeast Asian region. The model currently supports main…
▽ More
This technical report describes the MERaLiON-SpeechEncoder, a foundation model designed to support a wide range of downstream speech applications. Developed as part of Singapore's National Multimodal Large Language Model Programme, the MERaLiON-SpeechEncoder is tailored to address the speech processing needs in Singapore and the surrounding Southeast Asian region. The model currently supports mainly English, including the variety spoken in Singapore. We are actively expanding our datasets to gradually cover other languages in subsequent releases. The MERaLiON-SpeechEncoder was pre-trained from scratch on 200,000 hours of unlabelled speech data using a self-supervised learning approach based on masked language modelling. We describe our training procedure and hyperparameter tuning experiments in detail below. Our evaluation demonstrates improvements to spontaneous and Singapore speech benchmarks for speech recognition, while remaining competitive to other state-of-the-art speech encoders across ten other speech tasks. We commit to releasing our model, supporting broader research endeavours, both in Singapore and beyond.
△ Less
Submitted 20 December, 2024; v1 submitted 16 December, 2024;
originally announced December 2024.
-
An Enhanced Classification Method Based on Adaptive Multi-Scale Fusion for Long-tailed Multispectral Point Clouds
Authors:
TianZhu Liu,
BangYan Hu,
YanFeng Gu,
Xian Li,
Aleksandra Pižurica
Abstract:
Multispectral point cloud (MPC) captures 3D spatial-spectral information from the observed scene, which can be used for scene understanding and has a wide range of applications. However, most of the existing classification methods were extensively tested on indoor datasets, and when applied to outdoor datasets they still face problems including sparse labeled targets, differences in land-covers sc…
▽ More
Multispectral point cloud (MPC) captures 3D spatial-spectral information from the observed scene, which can be used for scene understanding and has a wide range of applications. However, most of the existing classification methods were extensively tested on indoor datasets, and when applied to outdoor datasets they still face problems including sparse labeled targets, differences in land-covers scales, and long-tailed distributions. To address the above issues, an enhanced classification method based on adaptive multi-scale fusion for MPCs with long-tailed distributions is proposed. In the training set generation stage, a grid-balanced sampling strategy is designed to reliably generate training samples from sparse labeled datasets. In the feature learning stage, a multi-scale feature fusion module is proposed to fuse shallow features of land-covers at different scales, addressing the issue of losing fine features due to scale variations in land-covers. In the classification stage, an adaptive hybrid loss module is devised to utilize multi-classification heads with adaptive weights to balance the learning ability of different classes, improving the classification performance of small classes due to various-scales and long-tailed distributions in land-covers. Experimental results on three MPC datasets demonstrate the effectiveness of the proposed method compared with the state-of-the-art methods.
△ Less
Submitted 15 December, 2024;
originally announced December 2024.
-
High-Speed Cornering Control and Real-Vehicle Deployment for Autonomous Electric Vehicles
Authors:
Shiyue Zhao,
Junzhi Zhang,
Neda Masoud,
Yuhong Jiang,
Heye Huang,
Tao Liu
Abstract:
Executing drift maneuvers during high-speed cornering presents significant challenges for autonomous vehicles, yet offers the potential to minimize turning time and enhance driving dynamics. While reinforcement learning (RL) has shown promising results in simulated environments, discrepancies between simulations and real-world conditions have limited its practical deployment. This study introduces…
▽ More
Executing drift maneuvers during high-speed cornering presents significant challenges for autonomous vehicles, yet offers the potential to minimize turning time and enhance driving dynamics. While reinforcement learning (RL) has shown promising results in simulated environments, discrepancies between simulations and real-world conditions have limited its practical deployment. This study introduces an innovative control framework that integrates trajectory optimization with drift maneuvers, aiming to improve the algorithm's adaptability for real-vehicle implementation. We leveraged Bezier-based pre-trajectory optimization to enhance rewards and optimize the controller through Twin Delayed Deep Deterministic Policy Gradient (TD3) in a simulated environment. For real-world deployment, we implement a hybrid RL-MPC fusion mechanism, , where TD3-derived maneuvers serve as primary inputs for a Model Predictive Controller (MPC). This integration enables precise real-time tracking of the optimal trajectory, with MPC providing corrective inputs to bridge the gap between simulation and reality. The efficacy of this method is validated through real-vehicle tests on consumer-grade electric vehicles, focusing on drift U-turns and drift right-angle turns. The control outcomes of these real-vehicle tests are thoroughly documented in the paper, supported by supplementary video evidence (https://youtu.be/5wp67FcpfL8). Notably, this study is the first to deploy and apply an RL-based transient drift cornering algorithm on consumer-grade electric vehicles.
△ Less
Submitted 21 November, 2024; v1 submitted 18 November, 2024;
originally announced November 2024.
-
MaDiNet: Mamba Diffusion Network for SAR Target Detection
Authors:
Jie Zhou,
Chao Xiao,
Bowen Peng,
Tianpeng Liu,
Zhen Liu,
Yongxiang Liu,
Li Liu
Abstract:
The fundamental challenge in SAR target detection lies in developing discriminative, efficient, and robust representations of target characteristics within intricate non-cooperative environments. However, accurate target detection is impeded by factors including the sparse distribution and discrete features of the targets, as well as complex background interference. In this study, we propose a \te…
▽ More
The fundamental challenge in SAR target detection lies in developing discriminative, efficient, and robust representations of target characteristics within intricate non-cooperative environments. However, accurate target detection is impeded by factors including the sparse distribution and discrete features of the targets, as well as complex background interference. In this study, we propose a \textbf{Ma}mba \textbf{Di}ffusion \textbf{Net}work (MaDiNet) for SAR target detection. Specifically, MaDiNet conceptualizes SAR target detection as the task of generating the position (center coordinates) and size (width and height) of the bounding boxes in the image space. Furthermore, we design a MambaSAR module to capture intricate spatial structural information of targets and enhance the capability of the model to differentiate between targets and complex backgrounds. The experimental results on extensive SAR target detection datasets achieve SOTA, proving the effectiveness of the proposed network. Code is available at \href{https://github.com/JoyeZLearning/MaDiNet}{https://github.com/JoyeZLearning/MaDiNet}.
△ Less
Submitted 11 November, 2024;
originally announced November 2024.
-
Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks
Authors:
Chien-yu Huang,
Wei-Chih Chen,
Shu-wen Yang,
Andy T. Liu,
Chen-An Li,
Yu-Xiang Lin,
Wei-Cheng Tseng,
Anuj Diwan,
Yi-Jen Shih,
Jiatong Shi,
William Chen,
Chih-Kai Yang,
Wenze Ren,
Xuanjun Chen,
Chi-Yuan Hsiao,
Puyuan Peng,
Shih-Heng Wang,
Chun-Yi Kuan,
Ke-Han Lu,
Kai-Wei Chang,
Fabian Ritter-Gutierrez,
Kuan-Po Huang,
Siddhant Arora,
You-Kuan Lin,
Ming To Chuang
, et al. (55 additional authors not shown)
Abstract:
Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluati…
▽ More
Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluation benchmark poses a significant challenge. We present Dynamic-SUPERB Phase-2, an open and evolving benchmark for the comprehensive evaluation of instruction-based universal speech models. Building upon the first generation, this second version incorporates 125 new tasks contributed collaboratively by the global research community, expanding the benchmark to a total of 180 tasks, making it the largest benchmark for speech and audio evaluation. While the first generation of Dynamic-SUPERB was limited to classification tasks, Dynamic-SUPERB Phase-2 broadens its evaluation capabilities by introducing a wide array of novel and diverse tasks, including regression and sequence generation, across speech, music, and environmental audio. Evaluation results show that no model performed well universally. SALMONN-13B excelled in English ASR and Qwen2-Audio-7B-Instruct showed high accuracy in emotion recognition, but current models still require further innovations to handle a broader range of tasks. We open-source all task data and the evaluation pipeline at https://github.com/dynamic-superb/dynamic-superb.
△ Less
Submitted 9 June, 2025; v1 submitted 8 November, 2024;
originally announced November 2024.
-
PGDiffSeg: Prior-Guided Denoising Diffusion Model with Parameter-Shared Attention for Breast Cancer Segmentation
Authors:
Feiyan Feng,
Tianyu Liu,
Hong Wang,
Jun Zhao,
Wei Li,
Yanshen Sun
Abstract:
Early detection through imaging and accurate diagnosis is crucial in mitigating the high mortality rate associated with breast cancer. However, locating tumors from low-resolution and high-noise medical images is extremely challenging. Therefore, this paper proposes a novel PGDiffSeg (Prior-Guided Diffusion Denoising Model with Parameter-Shared Attention) that applies diffusion denoising methods t…
▽ More
Early detection through imaging and accurate diagnosis is crucial in mitigating the high mortality rate associated with breast cancer. However, locating tumors from low-resolution and high-noise medical images is extremely challenging. Therefore, this paper proposes a novel PGDiffSeg (Prior-Guided Diffusion Denoising Model with Parameter-Shared Attention) that applies diffusion denoising methods to breast cancer medical image segmentation, accurately recovering the affected areas from Gaussian noise. Firstly, we design a parallel pipeline for noise processing and semantic information processing and propose a parameter-shared attention module (PSA) in multi-layer that seamlessly integrates these two pipelines. This integration empowers PGDiffSeg to incorporate semantic details at multiple levels during the denoising process, producing highly accurate segmentation maps. Secondly, we introduce a guided strategy that leverages prior knowledge to simulate the decision-making process of medical professionals, thereby enhancing the model's ability to locate tumor positions precisely. Finally, we provide the first-ever discussion on the interpretability of the generative diffusion model in the context of breast cancer segmentation. Extensive experiments have demonstrated the superiority of our model over the current state-of-the-art approaches, confirming its effectiveness as a flexible diffusion denoising method suitable for medical image research. Our code will be publicly available later.
△ Less
Submitted 23 October, 2024;
originally announced October 2024.
-
EG-SpikeFormer: Eye-Gaze Guided Transformer on Spiking Neural Networks for Medical Image Analysis
Authors:
Yi Pan,
Hanqi Jiang,
Junhao Chen,
Yiwei Li,
Huaqin Zhao,
Yifan Zhou,
Peng Shu,
Zihao Wu,
Zhengliang Liu,
Dajiang Zhu,
Xiang Li,
Yohannes Abate,
Tianming Liu
Abstract:
Neuromorphic computing has emerged as a promising energy-efficient alternative to traditional artificial intelligence, predominantly utilizing spiking neural networks (SNNs) implemented on neuromorphic hardware. Significant advancements have been made in SNN-based convolutional neural networks (CNNs) and Transformer architectures. However, neuromorphic computing for the medical imaging domain rema…
▽ More
Neuromorphic computing has emerged as a promising energy-efficient alternative to traditional artificial intelligence, predominantly utilizing spiking neural networks (SNNs) implemented on neuromorphic hardware. Significant advancements have been made in SNN-based convolutional neural networks (CNNs) and Transformer architectures. However, neuromorphic computing for the medical imaging domain remains underexplored. In this study, we introduce EG-SpikeFormer, an SNN architecture tailored for clinical tasks that incorporates eye-gaze data to guide the model's attention to the diagnostically relevant regions in medical images. Our developed approach effectively addresses shortcut learning issues commonly observed in conventional models, especially in scenarios with limited clinical data and high demands for model reliability, generalizability, and transparency. Our EG-SpikeFormer not only demonstrates superior energy efficiency and performance in medical image prediction tasks but also enhances clinical relevance through multi-modal information alignment. By incorporating eye-gaze data, the model improves interpretability and generalization, opening new directions for applying neuromorphic computing in healthcare.
△ Less
Submitted 29 October, 2024; v1 submitted 12 October, 2024;
originally announced October 2024.
-
Design and Characterization of High Efficiency Single-stage Electromagnetic Coil Guns
Authors:
Sophia Chen,
Annie Peng,
Ava Chen,
Takyiu Liu
Abstract:
This study presents several novel approaches to improve the efficiency of a single-stage coil gun. Conventional designs typically feature a uniformly wound solenoid and a ferrite projectile. For our research, we constructed a microcontroller-based prototype to test several new enhancements, including the use of a bipolar current pulse, a stepped multilayer coil with non-uniform winding densities,…
▽ More
This study presents several novel approaches to improve the efficiency of a single-stage coil gun. Conventional designs typically feature a uniformly wound solenoid and a ferrite projectile. For our research, we constructed a microcontroller-based prototype to test several new enhancements, including the use of a bipolar current pulse, a stepped multilayer coil with non-uniform winding densities, and the replacement of conventional ferrite projectiles with a neodymium permanent magnet. These modifications were designed to reduce energy loss and improve projectile acceleration by changing magnetic field strength and effectively controlling the magnetic flux. The experimental results show that the proposed methods resulted in significant efficiency improvements, with the varying current pulse and stepped coil design providing enhanced magnetic force at key points in the projectile's path, and the permanent magnet projectile contributing to higher velocities and efficiencies by leveraging the current pulses. Our findings suggest that combining these enhancements significantly improves coil gun performance, achieving higher velocities and efficiencies. These findings can be applied to future coil gun developments, such as multi-stage coil gun systems.
△ Less
Submitted 10 October, 2024;
originally announced October 2024.
-
Epsilon-VAE: Denoising as Visual Decoding
Authors:
Long Zhao,
Sanghyun Woo,
Ziyu Wan,
Yandong Li,
Han Zhang,
Boqing Gong,
Hartwig Adam,
Xuhui Jia,
Ting Liu
Abstract:
In generative modeling, tokenization simplifies complex data into compact, structured representations, creating a more efficient, learnable space. For high-dimensional visual data, it reduces redundancy and emphasizes key features for high-quality generation. Current visual tokenization methods rely on a traditional autoencoder framework, where the encoder compresses data into latent representatio…
▽ More
In generative modeling, tokenization simplifies complex data into compact, structured representations, creating a more efficient, learnable space. For high-dimensional visual data, it reduces redundancy and emphasizes key features for high-quality generation. Current visual tokenization methods rely on a traditional autoencoder framework, where the encoder compresses data into latent representations, and the decoder reconstructs the original input. In this work, we offer a new perspective by proposing denoising as decoding, shifting from single-step reconstruction to iterative refinement. Specifically, we replace the decoder with a diffusion process that iteratively refines noise to recover the original image, guided by the latents provided by the encoder. We evaluate our approach by assessing both reconstruction (rFID) and generation quality (FID), comparing it to state-of-the-art autoencoding approaches. By adopting iterative reconstruction through diffusion, our autoencoder, namely Epsilon-VAE, achieves high reconstruction quality, which in turn enhances downstream generation quality by 22% at the same compression rates or provides 2.3x inference speedup through increasing compression rates. We hope this work offers new insights into integrating iterative generation and autoencoding for improved compression and generation.
△ Less
Submitted 28 May, 2025; v1 submitted 5 October, 2024;
originally announced October 2024.
-
ECHOPulse: ECG controlled echocardio-grams video generation
Authors:
Yiwei Li,
Sekeun Kim,
Zihao Wu,
Hanqi Jiang,
Yi Pan,
Pengfei Jin,
Sifan Song,
Yucheng Shi,
Tianming Liu,
Quanzheng Li,
Xiang Li
Abstract:
Echocardiography (ECHO) is essential for cardiac assessments, but its video quality and interpretation heavily relies on manual expertise, leading to inconsistent results from clinical and portable devices. ECHO video generation offers a solution by improving automated monitoring through synthetic data and generating high-quality videos from routine health data. However, existing models often face…
▽ More
Echocardiography (ECHO) is essential for cardiac assessments, but its video quality and interpretation heavily relies on manual expertise, leading to inconsistent results from clinical and portable devices. ECHO video generation offers a solution by improving automated monitoring through synthetic data and generating high-quality videos from routine health data. However, existing models often face high computational costs, slow inference, and rely on complex conditional prompts that require experts' annotations. To address these challenges, we propose ECHOPULSE, an ECG-conditioned ECHO video generation model. ECHOPULSE introduces two key advancements: (1) it accelerates ECHO video generation by leveraging VQ-VAE tokenization and masked visual token modeling for fast decoding, and (2) it conditions on readily accessible ECG signals, which are highly coherent with ECHO videos, bypassing complex conditional prompts. To the best of our knowledge, this is the first work to use time-series prompts like ECG signals for ECHO video generation. ECHOPULSE not only enables controllable synthetic ECHO data generation but also provides updated cardiac function information for disease monitoring and prediction beyond ECG alone. Evaluations on three public and private datasets demonstrate state-of-the-art performance in ECHO video generation across both qualitative and quantitative measures. Additionally, ECHOPULSE can be easily generalized to other modality generation tasks, such as cardiac MRI, fMRI, and 3D CT generation. Demo can seen from \url{https://github.com/levyisthebest/ECHOPulse_Prelease}.
△ Less
Submitted 11 October, 2024; v1 submitted 4 October, 2024;
originally announced October 2024.
-
Mind the Gap: Promoting Missing Modality Brain Tumor Segmentation with Alignment
Authors:
Tianyi Liu,
Zhaorui Tan,
Haochuan Jiang,
Xi Yang,
Kaizhu Huang
Abstract:
Brain tumor segmentation is often based on multiple magnetic resonance imaging (MRI). However, in clinical practice, certain modalities of MRI may be missing, which presents an even more difficult scenario. To cope with this challenge, knowledge distillation has emerged as one promising strategy. However, recent efforts typically overlook the modality gaps and thus fail to learn invariant feature…
▽ More
Brain tumor segmentation is often based on multiple magnetic resonance imaging (MRI). However, in clinical practice, certain modalities of MRI may be missing, which presents an even more difficult scenario. To cope with this challenge, knowledge distillation has emerged as one promising strategy. However, recent efforts typically overlook the modality gaps and thus fail to learn invariant feature representations across different modalities. Such drawback consequently leads to limited performance for both teachers and students. To ameliorate these problems, in this paper, we propose a novel paradigm that aligns latent features of involved modalities to a well-defined distribution anchor. As a major contribution, we prove that our novel training paradigm ensures a tight evidence lower bound, thus theoretically certifying its effectiveness. Extensive experiments on different backbones validate that the proposed paradigm can enable invariant feature representations and produce a teacher with narrowed modality gaps. This further offers superior guidance for missing modality students, achieving an average improvement of 1.75 on dice score.
△ Less
Submitted 28 September, 2024;
originally announced September 2024.
-
Efficient Training of Self-Supervised Speech Foundation Models on a Compute Budget
Authors:
Andy T. Liu,
Yi-Cheng Lin,
Haibin Wu,
Stefan Winkler,
Hung-yi Lee
Abstract:
Despite their impressive success, training foundation models remains computationally costly. This paper investigates how to efficiently train speech foundation models with self-supervised learning (SSL) under a limited compute budget. We examine critical factors in SSL that impact the budget, including model architecture, model size, and data size. Our goal is to make analytical steps toward under…
▽ More
Despite their impressive success, training foundation models remains computationally costly. This paper investigates how to efficiently train speech foundation models with self-supervised learning (SSL) under a limited compute budget. We examine critical factors in SSL that impact the budget, including model architecture, model size, and data size. Our goal is to make analytical steps toward understanding the training dynamics of speech foundation models. We benchmark SSL objectives in an entirely comparable setting and find that other factors contribute more significantly to the success of SSL. Our results show that slimmer model architectures outperform common small architectures under the same compute and parameter budget. We demonstrate that the size of the pre-training data remains crucial, even with data augmentation during SSL training, as performance suffers when iterating over limited data. Finally, we identify a trade-off between model size and data size, highlighting an optimal model size for a given compute budget.
△ Less
Submitted 4 February, 2025; v1 submitted 9 September, 2024;
originally announced September 2024.
-
ERIC: Estimating Rainfall with Commodity Doorbell Camera for Precision Residential Irrigation
Authors:
Tian Liu,
Liuyi Jin,
Radu Stoleru,
Amran Haroon,
Charles Swanson,
Kexin Feng
Abstract:
Current state-of-the-art residential irrigation systems, such as WaterMyYard, rely on rainfall data from nearby weather stations to adjust irrigation amounts. However, the accuracy of rainfall data is compromised by the limited spatial resolution of rain gauges and the significant variability of hyperlocal rainfall, leading to substantial water waste. To improve irrigation efficiency, we developed…
▽ More
Current state-of-the-art residential irrigation systems, such as WaterMyYard, rely on rainfall data from nearby weather stations to adjust irrigation amounts. However, the accuracy of rainfall data is compromised by the limited spatial resolution of rain gauges and the significant variability of hyperlocal rainfall, leading to substantial water waste. To improve irrigation efficiency, we developed a cost-effective irrigation system, dubbed ERIC, which employs machine learning models to estimate rainfall from commodity doorbell camera footage and optimizes irrigation schedules without human intervention. Specifically, we: a) designed novel visual and audio features with lightweight neural network models to infer rainfall from the camera at the edge, preserving user privacy; b) built a complete end-to-end irrigation system on Raspberry Pi 4, costing only \$75. We deployed the system across five locations (collecting over 750 hours of video) with varying backgrounds and light conditions. Comprehensive evaluation validates that ERIC achieves state-of-the-art rainfall estimation performance ($\sim$ 5mm/day), saving 9,112 gallons/month of water, translating to \$28.56/month in utility savings. Data and code are available at https://github.com/LENSS/ERIC-BuildSys2024.git
△ Less
Submitted 3 October, 2024; v1 submitted 19 September, 2024;
originally announced September 2024.
-
A Survey of Foundation Models for Music Understanding
Authors:
Wenjun Li,
Ying Cai,
Ziyang Wu,
Wenyi Zhang,
Yifan Chen,
Rundong Qi,
Mengqi Dong,
Peigen Chen,
Xiao Dong,
Fenghao Shi,
Lei Guo,
Junwei Han,
Bao Ge,
Tianming Liu,
Lin Gan,
Tuo Zhang
Abstract:
Music is essential in daily life, fulfilling emotional and entertainment needs, and connecting us personally, socially, and culturally. A better understanding of music can enhance our emotions, cognitive skills, and cultural connections. The rapid advancement of artificial intelligence (AI) has introduced new ways to analyze music, aiming to replicate human understanding of music and provide relat…
▽ More
Music is essential in daily life, fulfilling emotional and entertainment needs, and connecting us personally, socially, and culturally. A better understanding of music can enhance our emotions, cognitive skills, and cultural connections. The rapid advancement of artificial intelligence (AI) has introduced new ways to analyze music, aiming to replicate human understanding of music and provide related services. While the traditional models focused on audio features and simple tasks, the recent development of large language models (LLMs) and foundation models (FMs), which excel in various fields by integrating semantic information and demonstrating strong reasoning abilities, could capture complex musical features and patterns, integrate music with language and incorporate rich musical, emotional and psychological knowledge. Therefore, they have the potential in handling complex music understanding tasks from a semantic perspective, producing outputs closer to human perception. This work, to our best knowledge, is one of the early reviews of the intersection of AI techniques and music understanding. We investigated, analyzed, and tested recent large-scale music foundation models in respect of their music comprehension abilities. We also discussed their limitations and proposed possible future directions, offering insights for researchers in this field.
△ Less
Submitted 14 September, 2024;
originally announced September 2024.
-
Towards Quantifying and Reducing Language Mismatch Effects in Cross-Lingual Speech Anti-Spoofing
Authors:
Tianchi Liu,
Ivan Kukanov,
Zihan Pan,
Qiongqiong Wang,
Hardik B. Sailor,
Kong Aik Lee
Abstract:
The effects of language mismatch impact speech anti-spoofing systems, while investigations and quantification of these effects remain limited. Existing anti-spoofing datasets are mainly in English, and the high cost of acquiring multilingual datasets hinders training language-independent models. We initiate this work by evaluating top-performing speech anti-spoofing systems that are trained on Eng…
▽ More
The effects of language mismatch impact speech anti-spoofing systems, while investigations and quantification of these effects remain limited. Existing anti-spoofing datasets are mainly in English, and the high cost of acquiring multilingual datasets hinders training language-independent models. We initiate this work by evaluating top-performing speech anti-spoofing systems that are trained on English data but tested on other languages, observing notable performance declines. We propose an innovative approach - Accent-based data expansion via TTS (ACCENT), which introduces diverse linguistic knowledge to monolingual-trained models, improving their cross-lingual capabilities. We conduct experiments on a large-scale dataset consisting of over 3 million samples, including 1.8 million training samples and nearly 1.2 million testing samples across 12 languages. The language mismatch effects are preliminarily quantified and remarkably reduced over 15% by applying the proposed ACCENT. This easily implementable method shows promise for multilingual and low-resource language scenarios.
△ Less
Submitted 12 September, 2024;
originally announced September 2024.
-
Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024
Authors:
Anmol Guragain,
Tianchi Liu,
Zihan Pan,
Hardik B. Sailor,
Qiongqiong Wang
Abstract:
This work details our approach to achieving a leading system with a 1.79% pooled equal error rate (EER) on the evaluation set of the Controlled Singing Voice Deepfake Detection (CtrSVDD). The rapid advancement of generative AI models presents significant challenges for detecting AI-generated deepfake singing voices, attracting increased research attention. The Singing Voice Deepfake Detection (SVD…
▽ More
This work details our approach to achieving a leading system with a 1.79% pooled equal error rate (EER) on the evaluation set of the Controlled Singing Voice Deepfake Detection (CtrSVDD). The rapid advancement of generative AI models presents significant challenges for detecting AI-generated deepfake singing voices, attracting increased research attention. The Singing Voice Deepfake Detection (SVDD) Challenge 2024 aims to address this complex task. In this work, we explore the ensemble methods, utilizing speech foundation models to develop robust singing voice anti-spoofing systems. We also introduce a novel Squeeze-and-Excitation Aggregation (SEA) method, which efficiently and effectively integrates representation features from the speech foundation models, surpassing the performance of our other individual systems. Evaluation results confirm the efficacy of our approach in detecting deepfake singing voices. The codes can be accessed at https://github.com/Anmol2059/SVDD2024.
△ Less
Submitted 3 September, 2024;
originally announced September 2024.
-
SpineMamba: Enhancing 3D Spinal Segmentation in Clinical Imaging through Residual Visual Mamba Layers and Shape Priors
Authors:
Zhiqing Zhang,
Tianyong Liu,
Guojia Fan,
Bin Li,
Qianjin Feng,
Shoujun Zhou
Abstract:
Accurate segmentation of 3D clinical medical images is critical in the diagnosis and treatment of spinal diseases. However, the inherent complexity of spinal anatomy and uncertainty inherent in current imaging technologies, poses significant challenges for semantic segmentation of spinal images. Although convolutional neural networks (CNNs) and Transformer-based models have made some progress in s…
▽ More
Accurate segmentation of 3D clinical medical images is critical in the diagnosis and treatment of spinal diseases. However, the inherent complexity of spinal anatomy and uncertainty inherent in current imaging technologies, poses significant challenges for semantic segmentation of spinal images. Although convolutional neural networks (CNNs) and Transformer-based models have made some progress in spinal segmentation, their limitations in handling long-range dependencies hinder further improvements in segmentation accuracy.To address these challenges, we introduce a residual visual Mamba layer to effectively capture and model the deep semantic features and long-range spatial dependencies of 3D spinal data. To further enhance the structural semantic understanding of the vertebrae, we also propose a novel spinal shape prior module that captures specific anatomical information of the spine from medical images, significantly enhancing the model's ability to extract structural semantic information of the vertebrae. Comparative and ablation experiments on two datasets demonstrate that SpineMamba outperforms existing state-of-the-art models. On the CT dataset, the average Dice similarity coefficient for segmentation reaches as high as 94.40, while on the MR dataset, it reaches 86.95. Notably, compared to the renowned nnU-Net, SpineMamba achieves superior segmentation performance, exceeding it by up to 2 percentage points. This underscores its accuracy, robustness, and excellent generalization capabilities.
△ Less
Submitted 28 August, 2024;
originally announced August 2024.
-
Multi-Channel Masked Autoencoder and Comprehensive Evaluations for Reconstructing 12-Lead ECG from Arbitrary Single-Lead ECG
Authors:
Jiarong Chen,
Wanqing Wu,
Tong Liu,
Shenda Hong
Abstract:
Electrocardiogram (ECG) has emerged as a widely accepted diagnostic instrument for cardiovascular diseases (CVD). The standard clinical 12-lead ECG configuration causes considerable inconvenience and discomfort, while wearable devices offers a more practical alternative. To reduce information gap between 12-lead ECG and single-lead ECG, this study proposes a multi-channel masked autoencoder (MCMA)…
▽ More
Electrocardiogram (ECG) has emerged as a widely accepted diagnostic instrument for cardiovascular diseases (CVD). The standard clinical 12-lead ECG configuration causes considerable inconvenience and discomfort, while wearable devices offers a more practical alternative. To reduce information gap between 12-lead ECG and single-lead ECG, this study proposes a multi-channel masked autoencoder (MCMA) for reconstructing 12-Lead ECG from arbitrary single-lead ECG, and a comprehensive evaluation benchmark, ECGGenEval, encompass the signal-level, feature-level, and diagnostic-level evaluations. MCMA can achieve the state-of-the-art performance. In the signal-level evaluation, the mean square errors of 0.0317 and 0.1034, Pearson correlation coefficients of 0.7885 and 0.7420. In the feature-level evaluation, the average standard deviation of the mean heart rate across the generated 12-lead ECG is 1.0481, the coefficient of variation is 1.58%, and the range is 3.2874. In the diagnostic-level evaluation, the average F1-score with two generated 12-lead ECG from different single-lead ECG are 0.8233 and 0.8410.
△ Less
Submitted 3 October, 2024; v1 submitted 16 July, 2024;
originally announced July 2024.
-
Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports
Authors:
Yutong Zhang,
Yi Pan,
Tianyang Zhong,
Peixin Dong,
Kangni Xie,
Yuxiao Liu,
Hanqi Jiang,
Zhengliang Liu,
Shijie Zhao,
Tuo Zhang,
Xi Jiang,
Dinggang Shen,
Tianming Liu,
Xin Zhang
Abstract:
Medical images and radiology reports are crucial for diagnosing medical conditions, highlighting the importance of quantitative analysis for clinical decision-making. However, the diversity and cross-source heterogeneity of these data challenge the generalizability of current data-mining methods. Multimodal large language models (MLLMs) have recently transformed many domains, significantly affecti…
▽ More
Medical images and radiology reports are crucial for diagnosing medical conditions, highlighting the importance of quantitative analysis for clinical decision-making. However, the diversity and cross-source heterogeneity of these data challenge the generalizability of current data-mining methods. Multimodal large language models (MLLMs) have recently transformed many domains, significantly affecting the medical field. Notably, Gemini-Vision-series (Gemini) and GPT-4-series (GPT-4) models have epitomized a paradigm shift in Artificial General Intelligence (AGI) for computer vision, showcasing their potential in the biomedical domain. In this study, we evaluated the performance of the Gemini, GPT-4, and 4 popular large models for an exhaustive evaluation across 14 medical imaging datasets, including 5 medical imaging categories (dermatology, radiology, dentistry, ophthalmology, and endoscopy), and 3 radiology report datasets. The investigated tasks encompass disease classification, lesion segmentation, anatomical localization, disease diagnosis, report generation, and lesion detection. Our experimental results demonstrated that Gemini-series models excelled in report generation and lesion detection but faces challenges in disease classification and anatomical localization. Conversely, GPT-series models exhibited proficiency in lesion segmentation and anatomical localization but encountered difficulties in disease diagnosis and lesion detection. Additionally, both the Gemini series and GPT series contain models that have demonstrated commendable generation efficiency. While both models hold promise in reducing physician workload, alleviating pressure on limited healthcare resources, and fostering collaboration between clinical practitioners and artificial intelligence technologies, substantial enhancements and comprehensive validations remain imperative before clinical deployment.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Gridless Parameter Estimation in Partly Calibrated Rectangular Arrays
Authors:
Tianyi Liu,
Sai Pavan Deram,
Khaled Ardah,
Martin Haardt,
Marc E. Pfetsch,
Marius Pesavento
Abstract:
Spatial frequency estimation from a mixture of noisy sinusoids finds applications in various fields. While subspace-based methods offer cost-effective super-resolution parameter estimation, they demand precise array calibration, posing challenges for large antennas. In contrast, sparsity-based approaches outperform subspace methods, especially in scenarios with limited snapshots or correlated sour…
▽ More
Spatial frequency estimation from a mixture of noisy sinusoids finds applications in various fields. While subspace-based methods offer cost-effective super-resolution parameter estimation, they demand precise array calibration, posing challenges for large antennas. In contrast, sparsity-based approaches outperform subspace methods, especially in scenarios with limited snapshots or correlated sources. This study focuses on direction-of-arrival (DOA) estimation using a partly calibrated rectangular array with fully calibrated subarrays. A gridless sparse formulation leveraging shift invariances in the array is developed, yielding two competitive algorithms under the alternating direction method of multipliers (ADMM) and successive convex approximation frameworks, respectively. Numerical simulations show the superior error performance of our proposed method, particularly in highly correlated scenarios, compared to the conventional subspace-based methods. It is demonstrated that the proposed formulation can also be adopted in the fully calibrated case to improve the robustness of the subspace-based methods to the source correlation. Furthermore, we provide a generalization of the proposed method to a more challenging case where a part of the sensors is unobservable due to failures.
△ Less
Submitted 23 June, 2024;
originally announced June 2024.
-
Capture Point Control in Thruster-Assisted Bipedal Locomotion
Authors:
Shreyansh Pitroda,
Aditya Bondada,
Kaushik Venkatesh Krishnamurthy,
Adarsh Salagame,
Chenghao Wang,
Taoran Liu,
Bibek Gupta,
Eric Sihite,
Reza Nemovi,
Alireza Ramezani,
Morteza Gharib
Abstract:
Despite major advancements in control design that are robust to unplanned disturbances, bipedal robots are still susceptible to falling over and struggle to negotiate rough terrains. By utilizing thrusters in our bipedal robot, we can perform additional posture manipulation and expand the modes of locomotion to enhance the robot's stability and ability to negotiate rough and difficult-to-navigate…
▽ More
Despite major advancements in control design that are robust to unplanned disturbances, bipedal robots are still susceptible to falling over and struggle to negotiate rough terrains. By utilizing thrusters in our bipedal robot, we can perform additional posture manipulation and expand the modes of locomotion to enhance the robot's stability and ability to negotiate rough and difficult-to-navigate terrains. In this paper, we present our efforts in designing a controller based on capture point control for our thruster-assisted walking model named Harpy and explore its control design possibilities. While capture point control based on centroidal models for bipedal systems has been extensively studied, the incorporation of external forces that can influence the dynamics of linear inverted pendulum models, often used in capture point-based works, has not been explored before. The inclusion of these external forces can lead to interesting interpretations of locomotion, such as virtual buoyancy studied in aquatic-legged locomotion. This paper outlines the dynamical model of our robot, the capture point method we use to assist the upper body stabilization, and the simulation work done to show the controller's feasibility.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
A tensor model for the calibration of air-coupled ultrasonic sensor arrays in 3D imaging
Authors:
Raphael Müller,
Gianni Allevato,
Matthias Rutsch,
Christoph Haugwitz,
Tianyi Liu,
Mario Kupnik,
Marius Pesavento
Abstract:
Arrays of ultrasonic sensors are capable of 3D imaging in air and an affordable supplement to other sensing modalities, such as radar, lidar, and camera, i.e. in heterogeneous sensing systems. However, manufacturing tolerances of air-coupled ultrasonic sensors may lead to amplitude and phase deviations. Together with artifacts from imperfect knowledge of the array geometry, there are numerous fact…
▽ More
Arrays of ultrasonic sensors are capable of 3D imaging in air and an affordable supplement to other sensing modalities, such as radar, lidar, and camera, i.e. in heterogeneous sensing systems. However, manufacturing tolerances of air-coupled ultrasonic sensors may lead to amplitude and phase deviations. Together with artifacts from imperfect knowledge of the array geometry, there are numerous factors that can impair the imaging performance of an array. We propose a reference-based calibration method to overcome possible limitations. First, we introduce a novel tensor signal model to capture the characteristics of piezoelectric ultrasonic transducers (PUTs) and the underlying multidimensional nature of a multiple-input multiple-output (MIMO) sensor array. Second, we formulate and solve an optimization problem based on this model to obtain the calibrated parameters of the array. Third, we assess both our model and the commonly used analytic model using real data from a 3D imaging experiment. The experiment reveals that our array response model we learned with calibration data yields an imaging performance similar to that of the analytic array model, which requires perfect array geometry information.
△ Less
Submitted 18 December, 2024; v1 submitted 20 June, 2024;
originally announced June 2024.