-
An Adaptive Estimation Approach based on Fisher Information to Overcome the Challenges of LFP Battery SOC Estimation
Authors:
Junzhe Shi,
Shida Jiang,
Shengyu Tao,
Jaewong Lee,
Manashita Borah,
Scott Moura
Abstract:
Robust and Real-time State of Charge (SOC) estimation is essential for Lithium Iron Phosphate (LFP) batteries, which are widely used in electric vehicles (EVs) and energy storage systems due to safety and longevity. However, the flat Open Circuit Voltage (OCV)-SOC curve makes this task particularly challenging. This challenge is complicated by hysteresis effects, and real-world conditions such as…
▽ More
Robust and Real-time State of Charge (SOC) estimation is essential for Lithium Iron Phosphate (LFP) batteries, which are widely used in electric vehicles (EVs) and energy storage systems due to safety and longevity. However, the flat Open Circuit Voltage (OCV)-SOC curve makes this task particularly challenging. This challenge is complicated by hysteresis effects, and real-world conditions such as current bias, voltage quantization errors, and temperature that must be considered in the battery management system use. In this paper, we proposed an adaptive estimation approach to overcome the challenges of LFPSOC estimation. Specifically, the method uses an adaptive fisher information fusion strategy that adaptively combines the SOC estimation from two different models, which are Coulomb counting and equivalent circuit model-based parameter identification. The effectiveness of this strategy is rationalized by the information richness excited by external cycling signals. A 3D OCV-H-SOC map that captures the relationship between OCV, hysteresis, and SOC was proposed as the backbone, and can be generalizable to other widely adopted parameter-identification methods. Extensive validation under ideal and real-world use scenarios, including SOC-OCV flat zones, current bias, voltage quantization errors, low temperatures, and insufficient current excitations, have been performed using 4 driving profiles, i.e., the Orange County Transit Bus Cycle, the California Unified Cycle, the US06 Drive Cycle, and the New York City Cycle, where the results demonstrate superiority over the state-of-the-art unscented Kalman filter, long short-term memory networks and transformer in all validation cases.
△ Less
Submitted 1 July, 2025;
originally announced July 2025.
-
The RSNA Lumbar Degenerative Imaging Spine Classification (LumbarDISC) Dataset
Authors:
Tyler J. Richards,
Adam E. Flanders,
Errol Colak,
Luciano M. Prevedello,
Robyn L. Ball,
Felipe Kitamura,
John Mongan,
Maryam Vazirabad,
Hui-Ming Lin,
Anne Kendell,
Thanat Kanthawang,
Salita Angkurawaranon,
Emre Altinmakas,
Hakan Dogan,
Paulo Eduardo de Aguiar Kuriki,
Arjuna Somasundaram,
Christopher Ruston,
Deniz Bulja,
Naida Spahovic,
Jennifer Sommer,
Sirui Jiang,
Eduardo Moreno Judice de Mattos Farina,
Eduardo Caminha Nunes,
Michael Brassil,
Megan McNamara
, et al. (11 additional authors not shown)
Abstract:
The Radiological Society of North America (RSNA) Lumbar Degenerative Imaging Spine Classification (LumbarDISC) dataset is the largest publicly available dataset of adult MRI lumbar spine examinations annotated for degenerative changes. The dataset includes 2,697 patients with a total of 8,593 image series from 8 institutions across 6 countries and 5 continents. The dataset is available for free fo…
▽ More
The Radiological Society of North America (RSNA) Lumbar Degenerative Imaging Spine Classification (LumbarDISC) dataset is the largest publicly available dataset of adult MRI lumbar spine examinations annotated for degenerative changes. The dataset includes 2,697 patients with a total of 8,593 image series from 8 institutions across 6 countries and 5 continents. The dataset is available for free for non-commercial use via Kaggle and RSNA Medical Imaging Resource of AI (MIRA). The dataset was created for the RSNA 2024 Lumbar Spine Degenerative Classification competition where competitors developed deep learning models to grade degenerative changes in the lumbar spine. The degree of spinal canal, subarticular recess, and neural foraminal stenosis was graded at each intervertebral disc level in the lumbar spine. The images were annotated by expert volunteer neuroradiologists and musculoskeletal radiologists from the RSNA, American Society of Neuroradiology, and the American Society of Spine Radiology. This dataset aims to facilitate research and development in machine learning and lumbar spine imaging to lead to improved patient care and clinical efficiency.
△ Less
Submitted 10 June, 2025;
originally announced June 2025.
-
Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model
Authors:
Ailin Huang,
Bingxin Li,
Bruce Wang,
Boyong Wu,
Chao Yan,
Chengli Feng,
Heng Wang,
Hongyu Zhou,
Hongyuan Wang,
Jingbei Li,
Jianjian Sun,
Joanna Wang,
Mingrui Chen,
Peng Liu,
Ruihang Miao,
Shilei Jiang,
Tian Fei,
Wang You,
Xi Chen,
Xuerui Yang,
Yechang Huang,
Yuxiang Zhang,
Zheng Ge,
Zheng Gong,
Zhewei Huang
, et al. (51 additional authors not shown)
Abstract:
Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a du…
▽ More
Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a dual-codebook audio tokenizer for linguistic and semantic feature extraction, a 130-billion-parameter backbone LLM and a neural vocoder for high-fidelity speech synthesis. Our post-training approach employs interleaved token-output of text and audio to enhance semantic coherence and combines Direct Preference Optimization (DPO) with model merge to improve performance. Evaluations on the StepEval-Audio-360 benchmark demonstrate that Step-Audio-AQAA excels especially in speech control, outperforming the state-of-art LALMs in key areas. This work contributes a promising solution for end-to-end LALMs and highlights the critical role of token-based vocoder in enhancing overall performance for AQAA tasks.
△ Less
Submitted 13 June, 2025; v1 submitted 10 June, 2025;
originally announced June 2025.
-
NTIRE 2025 Challenge on RAW Image Restoration and Super-Resolution
Authors:
Marcos V. Conde,
Radu Timofte,
Zihao Lu,
Xiangyu Kong,
Xiaoxia Xing,
Fan Wang,
Suejin Han,
MinKyu Park,
Tianyu Zhang,
Xin Luo,
Yeda Chen,
Dong Liu,
Li Pang,
Yuhang Yang,
Hongzhong Wang,
Xiangyong Cao,
Ruixuan Jiang,
Senyan Xu,
Siyuan Jiang,
Xueyang Fu,
Zheng-Jun Zha,
Tianyu Hao,
Yuhong He,
Ruoqi Li,
Yueqi Yang
, et al. (14 additional authors not shown)
Abstract:
This paper reviews the NTIRE 2025 RAW Image Restoration and Super-Resolution Challenge, highlighting the proposed solutions and results. New methods for RAW Restoration and Super-Resolution could be essential in modern Image Signal Processing (ISP) pipelines, however, this problem is not as explored as in the RGB domain. The goal of this challenge is two fold, (i) restore RAW images with blur and…
▽ More
This paper reviews the NTIRE 2025 RAW Image Restoration and Super-Resolution Challenge, highlighting the proposed solutions and results. New methods for RAW Restoration and Super-Resolution could be essential in modern Image Signal Processing (ISP) pipelines, however, this problem is not as explored as in the RGB domain. The goal of this challenge is two fold, (i) restore RAW images with blur and noise degradations, (ii) upscale RAW Bayer images by 2x, considering unknown noise and blur. In the challenge, a total of 230 participants registered, and 45 submitted results during thee challenge period. This report presents the current state-of-the-art in RAW Restoration.
△ Less
Submitted 4 June, 2025; v1 submitted 2 June, 2025;
originally announced June 2025.
-
COM Adjustment Mechanism Control for Multi-Configuration Motion Stability of Unmanned Deformable Vehicle
Authors:
Jun Liu,
Hongxun Liu,
Cheng Zhang,
Jiandang Xing,
Shang Jiang,
Ping Jiang
Abstract:
An unmanned deformable vehicle is a wheel-legged robot transforming between two configurations: vehicular and humanoid states, with different motion modes and stability characteristics. To address motion stability in multiple configurations, a center-of-mass adjustment mechanism was designed. Further, a motion stability hierarchical control algorithm was proposed, and an electromechanical model ba…
▽ More
An unmanned deformable vehicle is a wheel-legged robot transforming between two configurations: vehicular and humanoid states, with different motion modes and stability characteristics. To address motion stability in multiple configurations, a center-of-mass adjustment mechanism was designed. Further, a motion stability hierarchical control algorithm was proposed, and an electromechanical model based on a two-degree-of-freedom center-of-mass adjustment mechanism was established. An unmanned-deformable-vehicle vehicular-state steady-state steering dynamics model and a gait planning kinematic model of humanoid state walking were established. A stability hierarchical control strategy was designed to realize the stability control. The results showed that the steady-state steering stability in vehicular state and the walking stability in humanoid state could be significantly improved by controlling the slider motion.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
LPCM: Learning-based Predictive Coding for LiDAR Point Cloud Compression
Authors:
Chang Sun,
Hui Yuan,
Shiqi Jiang,
Da Ai,
Wei Zhang,
Raouf Hamzaoui
Abstract:
Since the data volume of LiDAR point clouds is very huge, efficient compression is necessary to reduce their storage and transmission costs. However, existing learning-based compression methods do not exploit the inherent angular resolution of LiDAR and ignore the significant differences in the correlation of geometry information at different bitrates. The predictive geometry coding method in the…
▽ More
Since the data volume of LiDAR point clouds is very huge, efficient compression is necessary to reduce their storage and transmission costs. However, existing learning-based compression methods do not exploit the inherent angular resolution of LiDAR and ignore the significant differences in the correlation of geometry information at different bitrates. The predictive geometry coding method in the geometry-based point cloud compression (G-PCC) standard uses the inherent angular resolution to predict the azimuth angles. However, it only models a simple linear relationship between the azimuth angles of neighboring points. Moreover, it does not optimize the quantization parameters for residuals on each coordinate axis in the spherical coordinate system. We propose a learning-based predictive coding method (LPCM) with both high-bitrate and low-bitrate coding modes. LPCM converts point clouds into predictive trees using the spherical coordinate system. In high-bitrate coding mode, we use a lightweight Long-Short-Term Memory-based predictive (LSTM-P) module that captures long-term geometry correlations between different coordinates to efficiently predict and compress the elevation angles. In low-bitrate coding mode, where geometry correlation degrades, we introduce a variational radius compression (VRC) module to directly compress the point radii. Then, we analyze why the quantization of spherical coordinates differs from that of Cartesian coordinates and propose a differential evolution (DE)-based quantization parameter selection method, which improves rate-distortion performance without increasing coding time. Experimental results on the LiDAR benchmark \textit{SemanticKITTI} and the MPEG-specified \textit{Ford} datasets show that LPCM outperforms G-PCC and other learning-based methods.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder
Authors:
Bowen Zhang,
Congchao Guo,
Geng Yang,
Hang Yu,
Haozhe Zhang,
Heidi Lei,
Jialong Mai,
Junjie Yan,
Kaiyue Yang,
Mingqi Yang,
Peikai Huang,
Ruiyang Jin,
Sitan Jiang,
Weihua Cheng,
Yawei Li,
Yichen Xiao,
Yiying Zhou,
Yongmao Zhang,
Yuan Lu,
Yucen He
Abstract:
We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech. A key innovation is our learnable speaker encoder, which extracts timbre features from a reference audio without requiring its transcription. This enables MiniMax-Speech to produce highly expressive speech with timbre consistent with the reference in a zero-shot manner, w…
▽ More
We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech. A key innovation is our learnable speaker encoder, which extracts timbre features from a reference audio without requiring its transcription. This enables MiniMax-Speech to produce highly expressive speech with timbre consistent with the reference in a zero-shot manner, while also supporting one-shot voice cloning with exceptionally high similarity to the reference voice. In addition, the overall quality of the synthesized audio is enhanced through the proposed Flow-VAE. Our model supports 32 languages and demonstrates excellent performance across multiple objective and subjective evaluations metrics. Notably, it achieves state-of-the-art (SOTA) results on objective voice cloning metrics (Word Error Rate and Speaker Similarity) and has secured the top position on the public TTS Arena leaderboard. Another key strength of MiniMax-Speech, granted by the robust and disentangled representations from the speaker encoder, is its extensibility without modifying the base model, enabling various applications such as: arbitrary voice emotion control via LoRA; text to voice (T2V) by synthesizing timbre features directly from text description; and professional voice cloning (PVC) by fine-tuning timbre features with additional data. We encourage readers to visit https://minimax-ai.github.io/tts_tech_report for more examples.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Zoomer: Adaptive Image Focus Optimization for Black-box MLLM
Authors:
Jiaxu Qian,
Chendong Wang,
Yifan Yang,
Chaoyun Zhang,
Huiqiang Jiang,
Xufang Luo,
Yu Kang,
Qingwei Lin,
Anlan Zhang,
Shiqi Jiang,
Ting Cao,
Tianjun Mao,
Suman Banerjee,
Guyue Liu,
Saravan Rajmohan,
Dongmei Zhang,
Yuqing Yang,
Qi Zhang,
Lili Qiu
Abstract:
Recent advancements in multimodal large language models (MLLMs) have broadened the scope of vision-language tasks, excelling in applications like image captioning and interactive question-answering. However, these models struggle with accurately processing visual data, particularly in tasks requiring precise object recognition and fine visual details. Stringent token limits often result in the omi…
▽ More
Recent advancements in multimodal large language models (MLLMs) have broadened the scope of vision-language tasks, excelling in applications like image captioning and interactive question-answering. However, these models struggle with accurately processing visual data, particularly in tasks requiring precise object recognition and fine visual details. Stringent token limits often result in the omission of critical information, hampering performance. To address these limitations, we introduce \SysName, a novel visual prompting mechanism designed to enhance MLLM performance while preserving essential visual details within token limits. \SysName features three key innovations: a prompt-aware strategy that dynamically highlights relevant image regions, a spatial-preserving orchestration schema that maintains object integrity, and a budget-aware prompting method that balances global context with crucial visual details. Comprehensive evaluations across multiple datasets demonstrate that \SysName consistently outperforms baseline methods, achieving up to a $26.9\%$ improvement in accuracy while significantly reducing token consumption.
△ Less
Submitted 29 April, 2025;
originally announced May 2025.
-
Neural Stereo Video Compression with Hybrid Disparity Compensation
Authors:
Shiyin Jiang,
Zhenghao Chen,
Minghao Han,
Xingyu Zhou,
Leheng Zhang,
Shuhang Gu
Abstract:
Disparity compensation represents the primary strategy in stereo video compression (SVC) for exploiting cross-view redundancy. These mechanisms can be broadly categorized into two types: one that employs explicit horizontal shifting, and another that utilizes an implicit cross-attention mechanism to reduce cross-view disparity redundancy. In this work, we propose a hybrid disparity compensation (H…
▽ More
Disparity compensation represents the primary strategy in stereo video compression (SVC) for exploiting cross-view redundancy. These mechanisms can be broadly categorized into two types: one that employs explicit horizontal shifting, and another that utilizes an implicit cross-attention mechanism to reduce cross-view disparity redundancy. In this work, we propose a hybrid disparity compensation (HDC) strategy that leverages explicit pixel displacement as a robust prior feature to simplify optimization and perform implicit cross-attention mechanisms for subsequent warping operations, thereby capturing a broader range of disparity information. Specifically, HDC first computes a similarity map by fusing the horizontally shifted cross-view features to capture pixel displacement information. This similarity map is then normalized into an "explicit pixel-wise attention score" to perform the cross-attention mechanism, implicitly aligning features from one view to another. Building upon HDC, we introduce a novel end-to-end optimized neural stereo video compression framework, which integrates HDC-based modules into key coding operations, including cross-view feature extraction and reconstruction (HDC-FER) and cross-view entropy modeling (HDC-EM). Extensive experiments on SVC benchmarks, including KITTI 2012, KITTI 2015, and Nagoya, which cover both autonomous driving and general scenes, demonstrate that our framework outperforms both neural and traditional SVC methodologies.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
Interleaved Block-based Learned Image Compression with Feature Enhancement and Quantization Error Compensation
Authors:
Shiqi Jiang,
Hui Yuan,
Shuai Li,
Raouf Hamzaoui,
Xu Wang,
Junyan Huo
Abstract:
In recent years, learned image compression (LIC) methods have achieved significant performance improvements. However, obtaining a more compact latent representation and reducing the impact of quantization errors remain key challenges in the field of LIC. To address these challenges, we propose a feature extraction module, a feature refinement module, and a feature enhancement module. Our feature e…
▽ More
In recent years, learned image compression (LIC) methods have achieved significant performance improvements. However, obtaining a more compact latent representation and reducing the impact of quantization errors remain key challenges in the field of LIC. To address these challenges, we propose a feature extraction module, a feature refinement module, and a feature enhancement module. Our feature extraction module shuffles the pixels in the image, splits the resulting image into sub-images, and extracts coarse features from the sub-images. Our feature refinement module stacks the coarse features and uses an attention refinement block composed of concatenated three-dimensional convolution residual blocks to learn more compact latent features by exploiting correlations across channels, within sub-images (intra-sub-image correlations), and across sub-images (inter-sub-image correlations). Our feature enhancement module reduces information loss in the decoded features following quantization. We also propose a quantization error compensation module that mitigates the quantization mismatch between training and testing. Our four modules can be readily integrated into state-of-the-art LIC methods. Experiments show that combining our modules with Tiny-LIC outperforms existing LIC methods and image compression standards in terms of peak signal-to-noise ratio (PSNR) and multi-scale structural similarity (MS-SSIM) on the Kodak dataset and the CLIC dataset.
△ Less
Submitted 20 February, 2025;
originally announced February 2025.
-
FD-LSCIC: Frequency Decomposition-based Learned Screen Content Image Compression
Authors:
Shiqi Jiang,
Hui Yuan,
Shuai Li,
Huanqiang Zeng,
Sam Kwong
Abstract:
The learned image compression (LIC) methods have already surpassed traditional techniques in compressing natural scene (NS) images. However, directly applying these methods to screen content (SC) images, which possess distinct characteristics such as sharp edges, repetitive patterns, embedded text and graphics, yields suboptimal results. This paper addresses three key challenges in SC image compre…
▽ More
The learned image compression (LIC) methods have already surpassed traditional techniques in compressing natural scene (NS) images. However, directly applying these methods to screen content (SC) images, which possess distinct characteristics such as sharp edges, repetitive patterns, embedded text and graphics, yields suboptimal results. This paper addresses three key challenges in SC image compression: learning compact latent features, adapting quantization step sizes, and the lack of large SC datasets. To overcome these challenges, we propose a novel compression method that employs a multi-frequency two-stage octave residual block (MToRB) for feature extraction, a cascaded triple-scale feature fusion residual block (CTSFRB) for multi-scale feature integration and a multi-frequency context interaction module (MFCIM) to reduce inter-frequency correlations. Additionally, we introduce an adaptive quantization module that learns scaled uniform noise for each frequency component, enabling flexible control over quantization granularity. Furthermore, we construct a large SC image compression dataset (SDU-SCICD10K), which includes over 10,000 images spanning basic SC images, computer-rendered images, and mixed NS and SC images from both PC and mobile platforms. Experimental results demonstrate that our approach significantly improves SC image compression performance, outperforming traditional standards and state-of-the-art learning-based methods in terms of peak signal-to-noise ratio (PSNR) and multi-scale structural similarity (MS-SSIM).
△ Less
Submitted 20 February, 2025;
originally announced February 2025.
-
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
Authors:
Ailin Huang,
Boyong Wu,
Bruce Wang,
Chao Yan,
Chen Hu,
Chengli Feng,
Fei Tian,
Feiyu Shen,
Jingbei Li,
Mingrui Chen,
Peng Liu,
Ruihang Miao,
Wang You,
Xi Chen,
Xuerui Yang,
Yechang Huang,
Yuxiang Zhang,
Zheng Gong,
Zixin Zhang,
Hongyu Zhou,
Jianjian Sun,
Brian Li,
Chengting Feng,
Changyi Wan,
Hanpeng Hu
, et al. (120 additional authors not shown)
Abstract:
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contribu…
▽ More
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.
△ Less
Submitted 18 February, 2025; v1 submitted 17 February, 2025;
originally announced February 2025.
-
Mode Switching-Induced Instability of Multi-source Feed DC Microgrid
Authors:
Shanshan Jiang,
Zelin Sun,
Jiankun Zhang,
Hua Geng
Abstract:
In DC microgrids (DCMGs), DC-bus signaling based control strategy is extensively used for power management, where mode switching plays a crucial role in achieving multi-source coordination. However, few studies have noticed the impact of mode switching and switching strategies on system voltage stability. To fill this gap, this paper aims to provide a general analysis framework for mode switching-…
▽ More
In DC microgrids (DCMGs), DC-bus signaling based control strategy is extensively used for power management, where mode switching plays a crucial role in achieving multi-source coordination. However, few studies have noticed the impact of mode switching and switching strategies on system voltage stability. To fill this gap, this paper aims to provide a general analysis framework for mode switching-induced instability in multi-source DCMGs. First, manifold theory is employed to analyze the stability of the DCMG switched system. Subsequently, the instability mechanism and its physical interpretation are explored. The positive feedback activated by the decreasing DC bus voltage during the switching process leads to instability. Switching strategy may inadvertently contribute to this instability. To improve stability, a novel control method based on mode scheduling is proposed, by adjusting switching strategy and thereby correcting the system trajectory. Finally, both real-time simulations and experimental tests on a DCMG system verify the correctness and effectiveness of theoretical analysis results.
△ Less
Submitted 10 April, 2025; v1 submitted 27 January, 2025;
originally announced January 2025.
-
ProjectedEx: Enhancing Generation in Explainable AI for Prostate Cancer
Authors:
Xuyin Qi,
Zeyu Zhang,
Aaron Berliano Handoko,
Huazhan Zheng,
Mingxi Chen,
Ta Duc Huy,
Vu Minh Hieu Phan,
Lei Zhang,
Linqi Cheng,
Shiyu Jiang,
Zhiwei Zhang,
Zhibin Liao,
Yang Zhao,
Minh-Son To
Abstract:
Prostate cancer, a growing global health concern, necessitates precise diagnostic tools, with Magnetic Resonance Imaging (MRI) offering high-resolution soft tissue imaging that significantly enhances diagnostic accuracy. Recent advancements in explainable AI and representation learning have significantly improved prostate cancer diagnosis by enabling automated and precise lesion classification. Ho…
▽ More
Prostate cancer, a growing global health concern, necessitates precise diagnostic tools, with Magnetic Resonance Imaging (MRI) offering high-resolution soft tissue imaging that significantly enhances diagnostic accuracy. Recent advancements in explainable AI and representation learning have significantly improved prostate cancer diagnosis by enabling automated and precise lesion classification. However, existing explainable AI methods, particularly those based on frameworks like generative adversarial networks (GANs), are predominantly developed for natural image generation, and their application to medical imaging often leads to suboptimal performance due to the unique characteristics and complexity of medical image. To address these challenges, our paper introduces three key contributions. First, we propose ProjectedEx, a generative framework that provides interpretable, multi-attribute explanations, effectively linking medical image features to classifier decisions. Second, we enhance the encoder module by incorporating feature pyramids, which enables multiscale feedback to refine the latent space and improves the quality of generated explanations. Additionally, we conduct comprehensive experiments on both the generator and classifier, demonstrating the clinical relevance and effectiveness of ProjectedEx in enhancing interpretability and supporting the adoption of AI in medical settings. Code will be released at https://github.com/Richardqiyi/ProjectedEx
△ Less
Submitted 2 January, 2025;
originally announced January 2025.
-
LINKs: Large Language Model Integrated Management for 6G Empowered Digital Twin NetworKs
Authors:
Shufan Jiang,
Bangyan Lin,
Yue Wu,
Yuan Gao
Abstract:
In the rapidly evolving landscape of digital twins (DT) and 6G networks, the integration of large language models (LLMs) presents a novel approach to network management. This paper explores the application of LLMs in managing 6G-empowered DT networks, with a focus on optimizing data retrieval and communication efficiency in smart city scenarios. The proposed framework leverages LLMs for intelligen…
▽ More
In the rapidly evolving landscape of digital twins (DT) and 6G networks, the integration of large language models (LLMs) presents a novel approach to network management. This paper explores the application of LLMs in managing 6G-empowered DT networks, with a focus on optimizing data retrieval and communication efficiency in smart city scenarios. The proposed framework leverages LLMs for intelligent DT problem analysis and radio resource management (RRM) in fully autonomous way without any manual intervention. Our proposed framework -- LINKs, builds up a lazy loading strategy which can minimize transmission delay by selectively retrieving the relevant data. Based on the data retrieval plan, LLMs transform the retrieval task into an numerical optimization problem and utilizing solvers to build an optimal RRM, ensuring efficient communication across the network. Simulation results demonstrate the performance improvements in data planning and network management, highlighting the potential of LLMs to enhance the integration of DT and 6G technologies.
△ Less
Submitted 9 December, 2024;
originally announced December 2024.
-
XLSTM-HVED: Cross-Modal Brain Tumor Segmentation and MRI Reconstruction Method Using Vision XLSTM and Heteromodal Variational Encoder-Decoder
Authors:
Shenghao Zhu,
Yifei Chen,
Shuo Jiang,
Weihong Chen,
Chang Liu,
Yuanhan Wang,
Xu Chen,
Yifan Ke,
Feiwei Qin,
Changmiao Wang,
Zhu Zhu
Abstract:
Neurogliomas are among the most aggressive forms of cancer, presenting considerable challenges in both treatment and monitoring due to their unpredictable biological behavior. Magnetic resonance imaging (MRI) is currently the preferred method for diagnosing and monitoring gliomas. However, the lack of specific imaging techniques often compromises the accuracy of tumor segmentation during the imagi…
▽ More
Neurogliomas are among the most aggressive forms of cancer, presenting considerable challenges in both treatment and monitoring due to their unpredictable biological behavior. Magnetic resonance imaging (MRI) is currently the preferred method for diagnosing and monitoring gliomas. However, the lack of specific imaging techniques often compromises the accuracy of tumor segmentation during the imaging process. To address this issue, we introduce the XLSTM-HVED model. This model integrates a hetero-modal encoder-decoder framework with the Vision XLSTM module to reconstruct missing MRI modalities. By deeply fusing spatial and temporal features, it enhances tumor segmentation performance. The key innovation of our approach is the Self-Attention Variational Encoder (SAVE) module, which improves the integration of modal features. Additionally, it optimizes the interaction of features between segmentation and reconstruction tasks through the Squeeze-Fusion-Excitation Cross Awareness (SFECA) module. Our experiments using the BraTS 2024 dataset demonstrate that our model significantly outperforms existing advanced methods in handling cases where modalities are missing. Our source code is available at https://github.com/Quanato607/XLSTM-HVED.
△ Less
Submitted 5 March, 2025; v1 submitted 9 December, 2024;
originally announced December 2024.
-
Digital Twin Assisted Beamforming Design for Integrated Sensing and Communication Systems
Authors:
Shuaifeng Jiang,
Ahmed Alkhateeb
Abstract:
This paper explores a novel research direction where a digital twin is leveraged to assist the beamforming design for an integrated sensing and communication (ISAC) system. In this setup, a base station designs joint communication and sensing beamforming to serve the communication user and detect the sensing target concurrently. Utilizing the electromagnetic (EM) 3D model of the environment and ra…
▽ More
This paper explores a novel research direction where a digital twin is leveraged to assist the beamforming design for an integrated sensing and communication (ISAC) system. In this setup, a base station designs joint communication and sensing beamforming to serve the communication user and detect the sensing target concurrently. Utilizing the electromagnetic (EM) 3D model of the environment and ray tracing, the digital twin can provide various information, e.g., propagation path parameters and wireless channels, to aid communication and sensing systems. More specifically, our digital twin-based beamforming design first leverages the environment EM 3D model and ray tracing to (i) predict the directions of the line-of-sight (LoS) and non-line-of-sight (NLoS) sensing channel paths and (ii) identify the dominant one among these sensing channel paths. Then, to optimize the joint sensing and communication beam, we maximize the sensing signal-to-noise ratio (SNR) on the dominant sensing channel component while satisfying a minimum communication signal-to-interference-plus-noise ratio (SINR) requirement. Simulation results show that the proposed digital twin-assisted beamforming design achieves near-optimal target sensing SNR in both LoS and NLoS dominant areas, while ensuring the required SINR for the communication user. This highlights the potential of leveraging digital twins to assist ISAC systems.
△ Less
Submitted 9 December, 2024;
originally announced December 2024.
-
GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot
Authors:
Aohan Zeng,
Zhengxiao Du,
Mingdao Liu,
Kedong Wang,
Shengmin Jiang,
Lei Zhao,
Yuxiao Dong,
Jie Tang
Abstract:
We introduce GLM-4-Voice, an intelligent and human-like end-to-end spoken chatbot. It supports both Chinese and English, engages in real-time voice conversations, and varies vocal nuances such as emotion, intonation, speech rate, and dialect according to user instructions. GLM-4-Voice uses an ultra-low bitrate (175bps), single-codebook speech tokenizer with 12.5Hz frame rate derived from an automa…
▽ More
We introduce GLM-4-Voice, an intelligent and human-like end-to-end spoken chatbot. It supports both Chinese and English, engages in real-time voice conversations, and varies vocal nuances such as emotion, intonation, speech rate, and dialect according to user instructions. GLM-4-Voice uses an ultra-low bitrate (175bps), single-codebook speech tokenizer with 12.5Hz frame rate derived from an automatic speech recognition (ASR) model by incorporating a vector-quantized bottleneck into the encoder. To efficiently transfer knowledge from text to speech modalities, we synthesize speech-text interleaved data from existing text pre-training corpora using a text-to-token model. We continue pre-training from the pre-trained text language model GLM-4-9B with a combination of unsupervised speech data, interleaved speech-text data, and supervised speech-text data, scaling up to 1 trillion tokens, achieving state-of-the-art performance in both speech language modeling and spoken question answering. We then fine-tune the pre-trained model with high-quality conversational speech data, achieving superior performance compared to existing baselines in both conversational ability and speech quality. The open models can be accessed through https://github.com/THUDM/GLM-4-Voice and https://huggingface.co/THUDM/glm-4-voice-9b.
△ Less
Submitted 3 December, 2024;
originally announced December 2024.
-
Scaling Speech-Text Pre-training with Synthetic Interleaved Data
Authors:
Aohan Zeng,
Zhengxiao Du,
Mingdao Liu,
Lei Zhang,
Shengmin Jiang,
Yuxiao Dong,
Jie Tang
Abstract:
Speech language models (SpeechLMs) accept speech input and produce speech output, allowing for more natural human-computer interaction compared to text-based large language models (LLMs). Traditional approaches for developing SpeechLMs are constrained by the limited availability of unsupervised speech data and parallel speech-text data, which are significantly less abundant than text pre-training…
▽ More
Speech language models (SpeechLMs) accept speech input and produce speech output, allowing for more natural human-computer interaction compared to text-based large language models (LLMs). Traditional approaches for developing SpeechLMs are constrained by the limited availability of unsupervised speech data and parallel speech-text data, which are significantly less abundant than text pre-training data, thereby limiting their scalability as LLMs. We propose a novel approach to scaling speech-text pre-training by leveraging large-scale synthetic interleaved data derived from text corpora, eliminating the need for parallel speech-text datasets. Our method efficiently constructs speech-text interleaved data by sampling text spans from existing text corpora and synthesizing corresponding speech spans using a text-to-token model, bypassing the need to generate actual speech. We also employ a supervised speech tokenizer derived from an automatic speech recognition (ASR) model by incorporating a vector-quantized bottleneck into the encoder. This supervised training approach results in discrete speech tokens with strong semantic preservation even at lower frame rates (e.g. 12.5Hz), while still maintaining speech reconstruction quality. Starting from a pre-trained language model and scaling our pre-training to 1 trillion tokens (with 600B synthetic interleaved speech-text data), we achieve state-of-the-art performance in speech language modeling and spoken question answering, improving performance on spoken questions tasks from the previous SOTA of 13% (Moshi) to 31%. We further demonstrate that by fine-tuning the pre-trained model with speech dialogue data, we can develop an end-to-end spoken chatbot that achieves competitive performance comparable to existing baselines in both conversational abilities and speech quality, even operating exclusively in the speech domain.
△ Less
Submitted 2 December, 2024; v1 submitted 26 November, 2024;
originally announced November 2024.
-
Cross-Layer Encrypted Semantic Communication Framework for Panoramic Video Transmission
Authors:
Haixiao Gao,
Mengying Sun,
Xiaodong Xu,
Bingxuan Xu,
Shujun Han,
Bizhu Wang,
Sheng Jiang,
Chen Dong,
Ping Zhang
Abstract:
In this paper, we propose a cross-layer encrypted semantic communication (CLESC) framework for panoramic video transmission, incorporating feature extraction, encoding, encryption, cyclic redundancy check (CRC), and retransmission processes to achieve compatibility between semantic communication and traditional communication systems. Additionally, we propose an adaptive cross-layer transmission me…
▽ More
In this paper, we propose a cross-layer encrypted semantic communication (CLESC) framework for panoramic video transmission, incorporating feature extraction, encoding, encryption, cyclic redundancy check (CRC), and retransmission processes to achieve compatibility between semantic communication and traditional communication systems. Additionally, we propose an adaptive cross-layer transmission mechanism that dynamically adjusts CRC, channel coding, and retransmission schemes based on the importance of semantic information. This ensures that important information is prioritized under poor transmission conditions. To verify the aforementioned framework, we also design an end-to-end adaptive panoramic video semantic transmission (APVST) network that leverages a deep joint source-channel coding (Deep JSCC) structure and attention mechanism, integrated with a latitude adaptive module that facilitates adaptive semantic feature extraction and variable-length encoding of panoramic videos. The proposed CLESC is also applicable to the transmission of other modal data. Simulation results demonstrate that the proposed CLESC effectively achieves compatibility and adaptation between semantic communication and traditional communication systems, improving both transmission efficiency and channel adaptability. Compared to traditional cross-layer transmission schemes, the CLESC framework can reduce bandwidth consumption by 85% while showing significant advantages under low signal-to-noise ratio (SNR) conditions.
△ Less
Submitted 19 November, 2024;
originally announced November 2024.
-
Transfer Learning in Vocal Education: Technical Evaluation of Limited Samples Describing Mezzo-soprano
Authors:
Zhenyi Hou,
Xu Zhao,
Kejie Ye,
Xinyu Sheng,
Shanggerile Jiang,
Jiajing Xia,
Yitao Zhang,
Chenxi Ban,
Daijun Luo,
Jiaxing Chen,
Yan Zou,
Yuchao Feng,
Guangyu Fan,
Xin Yuan
Abstract:
Vocal education in the music field is difficult to quantify due to the individual differences in singers' voices and the different quantitative criteria of singing techniques. Deep learning has great potential to be applied in music education due to its efficiency to handle complex data and perform quantitative analysis. However, accurate evaluations with limited samples over rare vocal types, suc…
▽ More
Vocal education in the music field is difficult to quantify due to the individual differences in singers' voices and the different quantitative criteria of singing techniques. Deep learning has great potential to be applied in music education due to its efficiency to handle complex data and perform quantitative analysis. However, accurate evaluations with limited samples over rare vocal types, such as Mezzo-soprano, requires extensive well-annotated data support using deep learning models. In order to attain the objective, we perform transfer learning by employing deep learning models pre-trained on the ImageNet and Urbansound8k datasets for the improvement on the precision of vocal technique evaluation. Furthermore, we tackle the problem of the lack of samples by constructing a dedicated dataset, the Mezzo-soprano Vocal Set (MVS), for vocal technique assessment. Our experimental results indicate that transfer learning increases the overall accuracy (OAcc) of all models by an average of 8.3%, with the highest accuracy at 94.2%. We not only provide a novel approach to evaluating Mezzo-soprano vocal techniques but also introduce a new quantitative assessment method for music education.
△ Less
Submitted 30 October, 2024;
originally announced October 2024.
-
Resilience-Oriented DG Siting and Sizing Considering Energy Equity Constraint
Authors:
Chenchen Li,
Fangxing Li,
Sufan Jiang,
Jin Zhao,
Shiyuan Fan,
Leon M. Tolbert
Abstract:
Extreme weather events can cause widespread power outages and huge economic losses. Low-income customers are more vulnerable to power outages because they live in areas with poorly equipped distribution systems. However, existing approaches to improve grid resilience focus on the overall condition of the system and ignore the outage experiences of low-income customers, which leads to significant e…
▽ More
Extreme weather events can cause widespread power outages and huge economic losses. Low-income customers are more vulnerable to power outages because they live in areas with poorly equipped distribution systems. However, existing approaches to improve grid resilience focus on the overall condition of the system and ignore the outage experiences of low-income customers, which leads to significant energy inequities in resilience. Therefore, this paper explores a new resilience-oriented planning method for distributed generator (DG) siting and sizing, by embedding an additional energy equity constraint (EEC). First, the expected load shedding index (ELSI) is defined as the ratio of the load shedding to the original load, which quantifies the resilience-oriented energy equity. Then, the DG siting and sizing problem is formulated as a two-stage stochastic programming with the EEC. The first stage determines the optimal sites and sizes of DG units under investment constraints and EECs, while the second stage optimizes expected costs of unserved load. A subsidiary variable is introduced to ensure the model's solvability. Finally, numerical studies are performed on the IEEE 33-bus and 123-bus systems to verify the effectiveness of the proposed DG planning model in achieving energy equity. Three observations are presented as future guidelines for resilience-oriented DG planning.
△ Less
Submitted 17 October, 2024;
originally announced October 2024.
-
Causal Context Adjustment Loss for Learned Image Compression
Authors:
Minghao Han,
Shiyin Jiang,
Shengxi Li,
Xin Deng,
Mai Xu,
Ce Zhu,
Shuhang Gu
Abstract:
In recent years, learned image compression (LIC) technologies have surpassed conventional methods notably in terms of rate-distortion (RD) performance. Most present learned techniques are VAE-based with an autoregressive entropy model, which obviously promotes the RD performance by utilizing the decoded causal context. However, extant methods are highly dependent on the fixed hand-crafted causal c…
▽ More
In recent years, learned image compression (LIC) technologies have surpassed conventional methods notably in terms of rate-distortion (RD) performance. Most present learned techniques are VAE-based with an autoregressive entropy model, which obviously promotes the RD performance by utilizing the decoded causal context. However, extant methods are highly dependent on the fixed hand-crafted causal context. The question of how to guide the auto-encoder to generate a more effective causal context benefit for the autoregressive entropy models is worth exploring. In this paper, we make the first attempt in investigating the way to explicitly adjust the causal context with our proposed Causal Context Adjustment loss (CCA-loss). By imposing the CCA-loss, we enable the neural network to spontaneously adjust important information into the early stage of the autoregressive entropy model. Furthermore, as transformer technology develops remarkably, variants of which have been adopted by many state-of-the-art (SOTA) LIC techniques. The existing computing devices have not adapted the calculation of the attention mechanism well, which leads to a burden on computation quantity and inference latency. To overcome it, we establish a convolutional neural network (CNN) image compression model and adopt the unevenly channel-wise grouped strategy for high efficiency. Ultimately, the proposed CNN-based LIC network trained with our Causal Context Adjustment loss attains a great trade-off between inference latency and rate-distortion performance.
△ Less
Submitted 7 October, 2024;
originally announced October 2024.
-
SPAC: Sampling-based Progressive Attribute Compression for Dense Point Clouds
Authors:
Xiaolong Mao,
Hui Yuan,
Tian Guo,
Shiqi Jiang,
Raouf Hamzaoui,
Sam Kwong
Abstract:
We propose an end-to-end attribute compression method for dense point clouds. The proposed method combines a frequency sampling module, an adaptive scale feature extraction module with geometry assistance, and a global hyperprior entropy model. The frequency sampling module uses a Hamming window and the Fast Fourier Transform to extract high-frequency components of the point cloud. The difference…
▽ More
We propose an end-to-end attribute compression method for dense point clouds. The proposed method combines a frequency sampling module, an adaptive scale feature extraction module with geometry assistance, and a global hyperprior entropy model. The frequency sampling module uses a Hamming window and the Fast Fourier Transform to extract high-frequency components of the point cloud. The difference between the original point cloud and the sampled point cloud is divided into multiple sub-point clouds. These sub-point clouds are then partitioned using an octree, providing a structured input for feature extraction. The feature extraction module integrates adaptive convolutional layers and uses offset-attention to capture both local and global features. Then, a geometry-assisted attribute feature refinement module is used to refine the extracted attribute features. Finally, a global hyperprior model is introduced for entropy encoding. This model propagates hyperprior parameters from the deepest (base) layer to the other layers, further enhancing the encoding efficiency. At the decoder, a mirrored network is used to progressively restore features and reconstruct the color attribute through transposed convolutional layers. The proposed method encodes base layer information at a low bitrate and progressively adds enhancement layer information to improve reconstruction accuracy. Compared to the latest G-PCC test model (TMC13v23) under the MPEG common test conditions (CTCs), the proposed method achieved an average Bjontegaard delta bitrate reduction of 24.58% for the Y component (21.23% for YUV combined) on the MPEG Category Solid dataset and 22.48% for the Y component (17.19% for YUV combined) on the MPEG Category Dense dataset. This is the first instance of a learning-based codec outperforming the G-PCC standard on these datasets under the MPEG CTCs.
△ Less
Submitted 16 September, 2024;
originally announced September 2024.
-
Learnable Wireless Digital Twins: Reconstructing Electromagnetic Field with Neural Representations
Authors:
Shuaifeng Jiang,
Qi Qu,
Xiaqing Pan,
Abhishek Agrawal,
Richard Newcombe,
Ahmed Alkhateeb
Abstract:
Fully harvesting the gain of multiple-input and multiple-output (MIMO) requires accurate channel information. However, conventional channel acquisition methods mainly rely on pilot training signals, resulting in significant training overheads (time, energy, spectrum). Digital twin-aided communications have been proposed in [1] to reduce or eliminate this overhead by approximating the real world wi…
▽ More
Fully harvesting the gain of multiple-input and multiple-output (MIMO) requires accurate channel information. However, conventional channel acquisition methods mainly rely on pilot training signals, resulting in significant training overheads (time, energy, spectrum). Digital twin-aided communications have been proposed in [1] to reduce or eliminate this overhead by approximating the real world with a digital replica. However, how to implement a digital twin-aided communication system brings new challenges. In particular, how to model the 3D environment and the associated EM properties, as well as how to update the environment dynamics in a coherent manner. To address these challenges, motivated by the latest advancements in computer vision, 3D reconstruction and neural radiance field, we propose an end-to-end deep learning framework for future generation wireless systems that can reconstruct the 3D EM field covered by a wireless access point, based on widely available crowd-sourced world-locked wireless samples between the access point and the devices. This visionary framework is grounded in classical EM theory and employs deep learning models to learn the EM properties and interaction behaviors of the objects in the environment. Simulation results demonstrate that the proposed learnable digital twin can implicitly learn the EM properties of the objects, accurately predict wireless channels, and generalize to changes in the environment, highlighting the prospect of this novel direction for future generation wireless platforms.
△ Less
Submitted 25 September, 2024; v1 submitted 4 September, 2024;
originally announced September 2024.
-
Generative AI on SpectrumNet: An Open Benchmark of Multiband 3D Radio Maps
Authors:
Shuhang Zhang,
Shuai Jiang,
Wanjie Lin,
Zheng Fang,
Kangjun Liu,
Hongliang Zhang,
Ke Chen
Abstract:
Radio map is an efficient demonstration for visually displaying the wireless signal coverage within a certain region. It has been considered to be increasingly helpful for the future sixth generation (6G) of wireless networks, as wireless nodes are becoming more crowded and complicated. However, the construction of high resolution radio map is very challenging due to the sparse sampling in practic…
▽ More
Radio map is an efficient demonstration for visually displaying the wireless signal coverage within a certain region. It has been considered to be increasingly helpful for the future sixth generation (6G) of wireless networks, as wireless nodes are becoming more crowded and complicated. However, the construction of high resolution radio map is very challenging due to the sparse sampling in practical systems. Generative artificial intelligence (AI), which is capable to create synthetic data to fill in gaps in real-world measurements, is an effective technique to construct high precision radio maps. Currently, generative models for radio map construction are trained with two-dimension (2D) single band radio maps in urban scenario, which has poor generalization in diverse terrain scenarios, spectrum bands, and heights. To tackle this problem, we provide a multiband three-dimension (3D) radio map dataset with consideration of terrain and climate information, named SpectrumNet. It is the largest radio map dataset in terms of dimensions and scale, which contains the radio map of 3 spacial dimensions, 5 frequency bands, 11 terrain scenarios, and 3 climate scenarios. We introduce the parameters and settings for the SpectrumNet dataset generation, and evaluate three baseline methods for radio map construction based on the SpectrumNet dataset. Experiments show the necessity of the SpectrumNet dataset for training models with strong generalization in spacial, frequency, and scenario domains. Future works on the SpectrumNet dataset are also discussed, including the dataset expansion and calibration, as well as the extended studies on generative models for radio map construction based on the SpectrumNet dataset.
△ Less
Submitted 9 August, 2024;
originally announced August 2024.
-
Relax, Estimate, and Track: a Simple Battery State-of-charge and State-of-health Estimation Method
Authors:
Shida Jiang,
Junzhe Shi,
Scott Moura
Abstract:
Battery management is a critical component of ubiquitous battery-powered energy systems, in which battery state-of-charge (SOC) and state-of-health (SOH) estimations are of crucial importance. Conventional SOC and SOH estimation methods, especially model-based methods, often lack accurate modeling of the open circuit voltage (OCV), have relatively high computational complexity, and lack theoretica…
▽ More
Battery management is a critical component of ubiquitous battery-powered energy systems, in which battery state-of-charge (SOC) and state-of-health (SOH) estimations are of crucial importance. Conventional SOC and SOH estimation methods, especially model-based methods, often lack accurate modeling of the open circuit voltage (OCV), have relatively high computational complexity, and lack theoretical analysis. This study introduces a simple SOC and SOH estimation method that overcomes all these weaknesses. The key idea of the proposed method is to momentarily set the cell's current to zero for a few minutes during the charging, perform SOC and SOH estimation based on the measured data, and continue tracking the cell's SOC afterward. The method is based on rigorous theoretical analysis, requires no hyperparameter fine-tuning, and is hundreds of times faster than conventional model-based methods. The method is validated on six batteries charged at different C rates and temperatures, realizing fast and accurate estimations under various conditions, with a SOH root mean square error (RMSE) of around 3% and a SOC RMSE of around 1.5%. The data and codes are available at https://berkeley.box.com/s/jz1w6po2iqzzfy7irxd9ok47ku3tr86j.
△ Less
Submitted 6 June, 2025; v1 submitted 2 August, 2024;
originally announced August 2024.
-
Babel: A Scalable Pre-trained Model for Multi-Modal Sensing via Expandable Modality Alignment
Authors:
Shenghong Dai,
Shiqi Jiang,
Yifan Yang,
Ting Cao,
Mo Li,
Suman Banerjee,
Lili Qiu
Abstract:
This paper presents Babel, the expandable modality alignment model, specially designed for multi-modal sensing. While there has been considerable work on multi-modality alignment, they all struggle to effectively incorporate multiple sensing modalities due to the data scarcity constraints. How to utilize multi-modal data with partial pairings in sensing remains an unresolved challenge. Babel tackl…
▽ More
This paper presents Babel, the expandable modality alignment model, specially designed for multi-modal sensing. While there has been considerable work on multi-modality alignment, they all struggle to effectively incorporate multiple sensing modalities due to the data scarcity constraints. How to utilize multi-modal data with partial pairings in sensing remains an unresolved challenge. Babel tackles this challenge by introducing the concept of expandable modality alignment. The key idea involves transforming the N-modality alignment into a series of binary-modality alignments. Novel techniques are also proposed to further mitigate data scarcity issue and balance the contribution of the newly incorporated modality with the previously established modality alignment during the expandable alignment process. We provide the comprehensive implementation. In the pre-training phase, Babel currently aligns 6 sensing modalities, namely Wi-Fi, mmWave, IMU, LiDAR, video, and depth. For the deployment phase, as a foundation model, any single or combination of aligned modalities could be selected from Babel and applied to downstream tasks. Evaluation demonstrates Babel's outstanding performance on eight human activity recognition datasets, compared to a broad range of baselines e.g., the SOTA single-modal sensing networks, multi-modal sensing framework, and multi-modal large language models. Babel not only improves the performance of individual modality sensing (12% averaged accuracy improvement), but also effectively fuses multiple available modalities (up to 22% accuracy increase). Case studies also highlight emerging application scenarios empowered by Babel, including cross-modality retrieval (i.e., sensing imaging), and bridging LLM for sensing comprehension.
△ Less
Submitted 21 March, 2025; v1 submitted 25 July, 2024;
originally announced July 2024.
-
OMR-NET: a two-stage octave multi-scale residual network for screen content image compression
Authors:
Shiqi Jiang,
Ting Ren,
Congrui Fu,
Shuai Li,
Hui Yuan
Abstract:
Screen content (SC) differs from natural scene (NS) with unique characteristics such as noise-free, repetitive patterns, and high contrast. Aiming at addressing the inadequacies of current learned image compression (LIC) methods for SC, we propose an improved two-stage octave convolutional residual blocks (IToRB) for high and low-frequency feature extraction and a cascaded two-stage multi-scale re…
▽ More
Screen content (SC) differs from natural scene (NS) with unique characteristics such as noise-free, repetitive patterns, and high contrast. Aiming at addressing the inadequacies of current learned image compression (LIC) methods for SC, we propose an improved two-stage octave convolutional residual blocks (IToRB) for high and low-frequency feature extraction and a cascaded two-stage multi-scale residual blocks (CTMSRB) for improved multi-scale learning and nonlinearity in SC. Additionally, we employ a window-based attention module (WAM) to capture pixel correlations, especially for high contrast regions in the image. We also construct a diverse SC image compression dataset (SDU-SCICD2K) for training, including text, charts, graphics, animation, movie, game and mixture of SC images and NS images. Experimental results show our method, more suited for SC than NS data, outperforms existing LIC methods in rate-distortion performance on SC images. The code is publicly available at https://github.com/SunshineSki/OMR Net.git.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Global Spatial-Temporal Information-based Residual ConvLSTM for Video Space-Time Super-Resolution
Authors:
Congrui Fu,
Hui Yuan,
Shiqi Jiang,
Guanghui Zhang,
Liquan Shen,
Raouf Hamzaoui
Abstract:
By converting low-frame-rate, low-resolution videos into high-frame-rate, high-resolution ones, space-time video super-resolution techniques can enhance visual experiences and facilitate more efficient information dissemination. We propose a convolutional neural network (CNN) for space-time video super-resolution, namely GIRNet. To generate highly accurate features and thus improve performance, th…
▽ More
By converting low-frame-rate, low-resolution videos into high-frame-rate, high-resolution ones, space-time video super-resolution techniques can enhance visual experiences and facilitate more efficient information dissemination. We propose a convolutional neural network (CNN) for space-time video super-resolution, namely GIRNet. To generate highly accurate features and thus improve performance, the proposed network integrates a feature-level temporal interpolation module with deformable convolutions and a global spatial-temporal information-based residual convolutional long short-term memory (convLSTM) module. In the feature-level temporal interpolation module, we leverage deformable convolution, which adapts to deformations and scale variations of objects across different scene locations. This presents a more efficient solution than conventional convolution for extracting features from moving objects. Our network effectively uses forward and backward feature information to determine inter-frame offsets, leading to the direct generation of interpolated frame features. In the global spatial-temporal information-based residual convLSTM module, the first convLSTM is used to derive global spatial-temporal information from the input features, and the second convLSTM uses the previously computed global spatial-temporal information feature as its initial cell state. This second convLSTM adopts residual connections to preserve spatial information, thereby enhancing the output features. Experiments on the Vimeo90K dataset show that the proposed method outperforms state-of-the-art techniques in peak signal-to-noise-ratio (by 1.45 dB, 1.14 dB, and 0.02 dB over STARnet, TMNet, and 3DAttGAN, respectively), structural similarity index(by 0.027, 0.023, and 0.006 over STARnet, TMNet, and 3DAttGAN, respectively), and visually.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
A New Framework for Nonlinear Kalman Filters
Authors:
Shida Jiang,
Junzhe Shi,
Scott Moura
Abstract:
The Kalman filter (KF) is a state estimation algorithm that optimally combines system knowledge and measurements to minimize the mean squared error of the estimated states. While KF was initially designed for linear systems, numerous extensions of it, such as extended Kalman filter (EKF), unscented Kalman filter (UKF), cubature Kalman filter (CKF), etc., have been proposed for nonlinear systems ov…
▽ More
The Kalman filter (KF) is a state estimation algorithm that optimally combines system knowledge and measurements to minimize the mean squared error of the estimated states. While KF was initially designed for linear systems, numerous extensions of it, such as extended Kalman filter (EKF), unscented Kalman filter (UKF), cubature Kalman filter (CKF), etc., have been proposed for nonlinear systems over the last sixty years. Although different types of nonlinear KFs have different pros and cons, they all use the same framework of linear KF. Yet, according to our theoretical and empirical analysis, the framework tends to give overconfident and less accurate state estimations when the measurement functions are nonlinear. Therefore, in this study, we designed a new framework that can be combined with any existing type of nonlinear KFs and showed theoretically and empirically that the new framework estimates the states and covariance more accurately than the old one. The new framework was tested on four different nonlinear KFs and five different tasks, showcasing its ability to reduce estimation errors by several orders of magnitude in low-measurement-noise conditions. The codes are available at https://github.com/Shida-Jiang/A-new-framework-for-nonlinear-Kalman-filters
△ Less
Submitted 19 June, 2025; v1 submitted 8 July, 2024;
originally announced July 2024.
-
Multimodal Cross-Task Interaction for Survival Analysis in Whole Slide Pathological Images
Authors:
Songhan Jiang,
Zhengyu Gan,
Linghan Cai,
Yifeng Wang,
Yongbing Zhang
Abstract:
Survival prediction, utilizing pathological images and genomic profiles, is increasingly important in cancer analysis and prognosis. Despite significant progress, precise survival analysis still faces two main challenges: (1) The massive pixels contained in whole slide images (WSIs) complicate the process of pathological images, making it difficult to generate an effective representation of the tu…
▽ More
Survival prediction, utilizing pathological images and genomic profiles, is increasingly important in cancer analysis and prognosis. Despite significant progress, precise survival analysis still faces two main challenges: (1) The massive pixels contained in whole slide images (WSIs) complicate the process of pathological images, making it difficult to generate an effective representation of the tumor microenvironment (TME). (2) Existing multimodal methods often rely on alignment strategies to integrate complementary information, which may lead to information loss due to the inherent heterogeneity between pathology and genes. In this paper, we propose a Multimodal Cross-Task Interaction (MCTI) framework to explore the intrinsic correlations between subtype classification and survival analysis tasks. Specifically, to capture TME-related features in WSIs, we leverage the subtype classification task to mine tumor regions. Simultaneously, multi-head attention mechanisms are applied in genomic feature extraction, adaptively performing genes grouping to obtain task-related genomic embedding. With the joint representation of pathological images and genomic data, we further introduce a Transport-Guided Attention (TGA) module that uses optimal transport theory to model the correlation between subtype classification and survival analysis tasks, effectively transferring potential information. Extensive experiments demonstrate the superiority of our approaches, with MCTI outperforming state-of-the-art frameworks on three public benchmarks. \href{https://github.com/jsh0792/MCTI}{https://github.com/jsh0792/MCTI}.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
PRoDeliberation: Parallel Robust Deliberation for End-to-End Spoken Language Understanding
Authors:
Trang Le,
Daniel Lazar,
Suyoun Kim,
Shan Jiang,
Duc Le,
Adithya Sagar,
Aleksandr Livshits,
Ahmed Aly,
Akshat Shrivastava
Abstract:
Spoken Language Understanding (SLU) is a critical component of voice assistants; it consists of converting speech to semantic parses for task execution. Previous works have explored end-to-end models to improve the quality and robustness of SLU models with Deliberation, however these models have remained autoregressive, resulting in higher latencies. In this work we introduce PRoDeliberation, a no…
▽ More
Spoken Language Understanding (SLU) is a critical component of voice assistants; it consists of converting speech to semantic parses for task execution. Previous works have explored end-to-end models to improve the quality and robustness of SLU models with Deliberation, however these models have remained autoregressive, resulting in higher latencies. In this work we introduce PRoDeliberation, a novel method leveraging a Connectionist Temporal Classification-based decoding strategy as well as a denoising objective to train robust non-autoregressive deliberation models. We show that PRoDeliberation achieves the latency reduction of parallel decoding (2-10x improvement over autoregressive models) while retaining the ability to correct Automatic Speech Recognition (ASR) mistranscriptions of autoregressive deliberation systems. We further show that the design of the denoising training allows PRoDeliberation to overcome the limitations of small ASR devices, and we provide analysis on the necessity of each component of the system.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Deep RAW Image Super-Resolution. A NTIRE 2024 Challenge Survey
Authors:
Marcos V. Conde,
Florin-Alexandru Vasluianu,
Radu Timofte,
Jianxing Zhang,
Jia Li,
Fan Wang,
Xiaopeng Li,
Zikun Liu,
Hyunhee Park,
Sejun Song,
Changho Kim,
Zhijuan Huang,
Hongyuan Yu,
Cheng Wan,
Wending Xiang,
Jiamin Lin,
Hang Zhong,
Qiaosong Zhang,
Yue Sun,
Xuanwu Yin,
Kunlong Zuo,
Senyan Xu,
Siyuan Jiang,
Zhijing Sun,
Jiaying Zhu
, et al. (10 additional authors not shown)
Abstract:
This paper reviews the NTIRE 2024 RAW Image Super-Resolution Challenge, highlighting the proposed solutions and results. New methods for RAW Super-Resolution could be essential in modern Image Signal Processing (ISP) pipelines, however, this problem is not as explored as in the RGB domain. Th goal of this challenge is to upscale RAW Bayer images by 2x, considering unknown degradations such as nois…
▽ More
This paper reviews the NTIRE 2024 RAW Image Super-Resolution Challenge, highlighting the proposed solutions and results. New methods for RAW Super-Resolution could be essential in modern Image Signal Processing (ISP) pipelines, however, this problem is not as explored as in the RGB domain. Th goal of this challenge is to upscale RAW Bayer images by 2x, considering unknown degradations such as noise and blur. In the challenge, a total of 230 participants registered, and 45 submitted results during thee challenge period. The performance of the top-5 submissions is reviewed and provided here as a gauge for the current state-of-the-art in RAW Image Super-Resolution.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
H2ASeg: Hierarchical Adaptive Interaction and Weighting Network for Tumor Segmentation in PET/CT Images
Authors:
Jinpeng Lu,
Jingyun Chen,
Linghan Cai,
Songhan Jiang,
Yongbing Zhang
Abstract:
Positron emission tomography (PET) combined with computed tomography (CT) imaging is routinely used in cancer diagnosis and prognosis by providing complementary information. Automatically segmenting tumors in PET/CT images can significantly improve examination efficiency. Traditional multi-modal segmentation solutions mainly rely on concatenation operations for modality fusion, which fail to effec…
▽ More
Positron emission tomography (PET) combined with computed tomography (CT) imaging is routinely used in cancer diagnosis and prognosis by providing complementary information. Automatically segmenting tumors in PET/CT images can significantly improve examination efficiency. Traditional multi-modal segmentation solutions mainly rely on concatenation operations for modality fusion, which fail to effectively model the non-linear dependencies between PET and CT modalities. Recent studies have investigated various approaches to optimize the fusion of modality-specific features for enhancing joint representations. However, modality-specific encoders used in these methods operate independently, inadequately leveraging the synergistic relationships inherent in PET and CT modalities, for example, the complementarity between semantics and structure. To address these issues, we propose a Hierarchical Adaptive Interaction and Weighting Network termed H2ASeg to explore the intrinsic cross-modal correlations and transfer potential complementary information. Specifically, we design a Modality-Cooperative Spatial Attention (MCSA) module that performs intra- and inter-modal interactions globally and locally. Additionally, a Target-Aware Modality Weighting (TAMW) module is developed to highlight tumor-related features within multi-modal features, thereby refining tumor segmentation. By embedding these modules across different layers, H2ASeg can hierarchically model cross-modal correlations, enabling a nuanced understanding of both semantic and structural tumor features. Extensive experiments demonstrate the superiority of H2ASeg, outperforming state-of-the-art methods on AutoPet-II and Hecktor2022 benchmarks. The code is released at https://github.com/JinPLu/H2ASeg.
△ Less
Submitted 28 March, 2024; v1 submitted 27 March, 2024;
originally announced March 2024.
-
Dynamic Perturbation-Adaptive Adversarial Training on Medical Image Classification
Authors:
Shuai Li,
Xiaoguang Ma,
Shancheng Jiang,
Lu Meng
Abstract:
Remarkable successes were made in Medical Image Classification (MIC) recently, mainly due to wide applications of convolutional neural networks (CNNs). However, adversarial examples (AEs) exhibited imperceptible similarity with raw data, raising serious concerns on network robustness. Although adversarial training (AT), in responding to malevolent AEs, was recognized as an effective approach to im…
▽ More
Remarkable successes were made in Medical Image Classification (MIC) recently, mainly due to wide applications of convolutional neural networks (CNNs). However, adversarial examples (AEs) exhibited imperceptible similarity with raw data, raising serious concerns on network robustness. Although adversarial training (AT), in responding to malevolent AEs, was recognized as an effective approach to improve robustness, it was challenging to overcome generalization decline of networks caused by the AT. In this paper, in order to reserve high generalization while improving robustness, we proposed a dynamic perturbation-adaptive adversarial training (DPAAT) method, which placed AT in a dynamic learning environment to generate adaptive data-level perturbations and provided a dynamically updated criterion by loss information collections to handle the disadvantage of fixed perturbation sizes in conventional AT methods and the dependence on external transference. Comprehensive testing on dermatology HAM10000 dataset showed that the DPAAT not only achieved better robustness improvement and generalization preservation but also significantly enhanced mean average precision and interpretability on various CNNs, indicating its great potential as a generic adversarial training method on the MIC.
△ Less
Submitted 11 March, 2024;
originally announced March 2024.
-
Digital Twin Aided Massive MIMO: CSI Compression and Feedback
Authors:
Shuaifeng Jiang,
Ahmed Alkhateeb
Abstract:
Deep learning (DL) approaches have demonstrated high performance in compressing and reconstructing the channel state information (CSI) and reducing the CSI feedback overhead in massive MIMO systems. One key challenge, however, with the DL approaches is the demand for extensive training data. Collecting this real-world CSI data incurs significant overhead that hinders the DL approaches from scaling…
▽ More
Deep learning (DL) approaches have demonstrated high performance in compressing and reconstructing the channel state information (CSI) and reducing the CSI feedback overhead in massive MIMO systems. One key challenge, however, with the DL approaches is the demand for extensive training data. Collecting this real-world CSI data incurs significant overhead that hinders the DL approaches from scaling to a large number of communication sites. To address this challenge, we propose a novel direction that utilizes site-specific \textit{digital twins} to aid the training of DL models. The proposed digital twin approach generates site-specific synthetic CSI data from the EM 3D model and ray tracing, which can then be used to train the DL model without real-world data collection. To further improve the performance, we adopt online data selection to refine the DL model training with a small real-world CSI dataset. Results show that a DL model trained solely on the digital twin data can achieve high performance when tested in a real-world deployment. Further, leveraging domain adaptation techniques, the proposed approach requires orders of magnitude less real-world data to approach the same performance of the model trained completely on a real-world CSI dataset.
△ Less
Submitted 29 February, 2024; v1 submitted 29 February, 2024;
originally announced February 2024.
-
Automatic laminectomy cutting plane planning based on artificial intelligence in robot assisted laminectomy surgery
Authors:
Zhuofu Li,
Yonghong Zhang,
Chengxia Wang,
Shanshan Liu,
Xiongkang Song,
Xuquan Ji,
Shuai Jiang,
Woquan Zhong,
Lei Hu,
Weishi Li
Abstract:
Objective: This study aims to use artificial intelligence to realize the automatic planning of laminectomy, and verify the method. Methods: We propose a two-stage approach for automatic laminectomy cutting plane planning. The first stage was the identification of key points. 7 key points were manually marked on each CT image. The Spatial Pyramid Upsampling Network (SPU-Net) algorithm developed by…
▽ More
Objective: This study aims to use artificial intelligence to realize the automatic planning of laminectomy, and verify the method. Methods: We propose a two-stage approach for automatic laminectomy cutting plane planning. The first stage was the identification of key points. 7 key points were manually marked on each CT image. The Spatial Pyramid Upsampling Network (SPU-Net) algorithm developed by us was used to accurately locate the 7 key points. In the second stage, based on the identification of key points, a personalized coordinate system was generated for each vertebra. Finally, the transverse and longitudinal cutting planes of laminectomy were generated under the coordinate system. The overall effect of planning was evaluated. Results: In the first stage, the average localization error of the SPU-Net algorithm for the seven key points was 0.65mm. In the second stage, a total of 320 transverse cutting planes and 640 longitudinal cutting planes were planned by the algorithm. Among them, the number of horizontal plane planning effects of grade A, B, and C were 318(99.38%), 1(0.31%), and 1(0.31%), respectively. The longitudinal planning effects of grade A, B, and C were 622(97.18%), 1(0.16%), and 17(2.66%), respectively. Conclusions: In this study, we propose a method for automatic surgical path planning of laminectomy based on the localization of key points in CT images. The results showed that the method achieved satisfactory results. More studies are needed to confirm the reliability of this approach in the future.
△ Less
Submitted 25 December, 2023;
originally announced December 2023.
-
EM Based p-norm-like Constraint RLS Algorithm for Sparse System Identification
Authors:
Shuyang Jiang,
Kung Yao
Abstract:
In this paper, the recursive least squares (RLS) algorithm is considered in the sparse system identification setting. The cost function of RLS algorithm is regularized by a $p$-norm-like ($0 \leq p \leq 1$) constraint of the estimated system parameters. In order to minimize the regularized cost function, we transform it into a penalized maximum likelihood (ML) problem, which is solved by the expec…
▽ More
In this paper, the recursive least squares (RLS) algorithm is considered in the sparse system identification setting. The cost function of RLS algorithm is regularized by a $p$-norm-like ($0 \leq p \leq 1$) constraint of the estimated system parameters. In order to minimize the regularized cost function, we transform it into a penalized maximum likelihood (ML) problem, which is solved by the expectation-maximization (EM) algorithm. With the introduction of a thresholding operator, the update equation of the tap-weight vector is derived. We also exploit the underlying sparsity to implement the proposed algorithm in a low computational complexity fashion. Numerical simulations demonstrate the superiority of the new algorithm over conventional sparse RLS algorithms, as well as regular RLS algorithm.
△ Less
Submitted 10 December, 2023;
originally announced December 2023.
-
Snake Robot with Tactile Perception Navigates on Large-scale Challenging Terrain
Authors:
Shuo Jiang,
Adarsh Salagame,
Alireza Ramezani,
Lawson Wong
Abstract:
Along with the advancement of robot skin technology, there has been notable progress in the development of snake robots featuring body-surface tactile perception. In this study, we proposed a locomotion control framework for snake robots that integrates tactile perception to augment their adaptability to various terrains. Our approach embraces a hierarchical reinforcement learning (HRL) architectu…
▽ More
Along with the advancement of robot skin technology, there has been notable progress in the development of snake robots featuring body-surface tactile perception. In this study, we proposed a locomotion control framework for snake robots that integrates tactile perception to augment their adaptability to various terrains. Our approach embraces a hierarchical reinforcement learning (HRL) architecture, wherein the high-level orchestrates global navigation strategies while the low-level uses curriculum learning for local navigation maneuvers. Due to the significant computational demands of collision detection in whole-body tactile sensing, the efficiency of the simulator is severely compromised. Thus a distributed training pattern to mitigate the efficiency reduction was adopted. We evaluated the navigation performance of the snake robot in complex large-scale cave exploration with challenging terrains to exhibit improvements in motion efficiency, evidencing the efficacy of tactile perception in terrain-adaptive locomotion of snake robots.
△ Less
Submitted 5 December, 2023;
originally announced December 2023.
-
Hierarchical RL-Guided Large-scale Navigation of a Snake Robot
Authors:
Shuo Jiang,
Adarsh Salagame,
Alireza Ramezani,
Lawson Wong
Abstract:
Classical snake robot control leverages mimicking snake-like gaits tuned for specific environments. However, to operate adaptively in unstructured environments, gait generation must be dynamically scheduled. In this work, we present a four-layer hierarchical control scheme to enable the snake robot to navigate freely in large-scale environments. The proposed model decomposes navigation into global…
▽ More
Classical snake robot control leverages mimicking snake-like gaits tuned for specific environments. However, to operate adaptively in unstructured environments, gait generation must be dynamically scheduled. In this work, we present a four-layer hierarchical control scheme to enable the snake robot to navigate freely in large-scale environments. The proposed model decomposes navigation into global planning, local planning, gait generation, and gait tracking. Using reinforcement learning (RL) and a central pattern generator (CPG), our method learns to navigate in complex mazes within hours and can be directly deployed to arbitrary new environments in a zero-shot fashion. We use the high-fidelity model of Northeastern's slithering robot COBRA to test the effectiveness of the proposed hierarchical control approach.
△ Less
Submitted 5 December, 2023;
originally announced December 2023.
-
GlanceSeg: Real-time microaneurysm lesion segmentation with gaze-map-guided foundation model for early detection of diabetic retinopathy
Authors:
Hongyang Jiang,
Mengdi Gao,
Zirong Liu,
Chen Tang,
Xiaoqing Zhang,
Shuai Jiang,
Wu Yuan,
Jiang Liu
Abstract:
Early-stage diabetic retinopathy (DR) presents challenges in clinical diagnosis due to inconspicuous and minute microangioma lesions, resulting in limited research in this area. Additionally, the potential of emerging foundation models, such as the segment anything model (SAM), in medical scenarios remains rarely explored. In this work, we propose a human-in-the-loop, label-free early DR diagnosis…
▽ More
Early-stage diabetic retinopathy (DR) presents challenges in clinical diagnosis due to inconspicuous and minute microangioma lesions, resulting in limited research in this area. Additionally, the potential of emerging foundation models, such as the segment anything model (SAM), in medical scenarios remains rarely explored. In this work, we propose a human-in-the-loop, label-free early DR diagnosis framework called GlanceSeg, based on SAM. GlanceSeg enables real-time segmentation of microangioma lesions as ophthalmologists review fundus images. Our human-in-the-loop framework integrates the ophthalmologist's gaze map, allowing for rough localization of minute lesions in fundus images. Subsequently, a saliency map is generated based on the located region of interest, which provides prompt points to assist the foundation model in efficiently segmenting microangioma lesions. Finally, a domain knowledge filter refines the segmentation of minute lesions. We conducted experiments on two newly-built public datasets, i.e., IDRiD and Retinal-Lesions, and validated the feasibility and superiority of GlanceSeg through visualized illustrations and quantitative measures. Additionally, we demonstrated that GlanceSeg improves annotation efficiency for clinicians and enhances segmentation performance through fine-tuning using annotations. This study highlights the potential of GlanceSeg-based annotations for self-model optimization, leading to enduring performance advancements through continual learning.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
Dual Pipeline Style Transfer with Input Distribution Differentiation
Authors:
ShiQi Jiang,
JunJie Kang,
YuJian Li
Abstract:
The color and texture dual pipeline architecture (CTDP) suppresses texture representation and artifacts through masked total variation loss (Mtv), and further experiments have shown that smooth input can almost completely eliminate texture representation. We have demonstrated through experiments that smooth input is not the key reason for removing texture representations, but rather the distributi…
▽ More
The color and texture dual pipeline architecture (CTDP) suppresses texture representation and artifacts through masked total variation loss (Mtv), and further experiments have shown that smooth input can almost completely eliminate texture representation. We have demonstrated through experiments that smooth input is not the key reason for removing texture representations, but rather the distribution differentiation of the training dataset. Based on this, we propose an input distribution differentiation training strategy (IDD), which forces the generation of textures to be completely dependent on the noise distribution, while the smooth distribution will not produce textures at all. Overall, our proposed distribution differentiation training strategy allows for two pre-defined input distributions to be responsible for two generation tasks, with noise distribution responsible for texture generation and smooth distribution responsible for color smooth transfer. Finally, we choose a smooth distribution as the input for the forward inference stage to completely eliminate texture representations and artifacts in color transfer tasks.
△ Less
Submitted 9 November, 2023;
originally announced November 2023.
-
DEFN: Dual-Encoder Fourier Group Harmonics Network for Three-Dimensional Indistinct-Boundary Object Segmentation
Authors:
Xiaohua Jiang,
Yihao Guo,
Jian Huang,
Yuting Wu,
Meiyi Luo,
Zhaoyang Xu,
Qianni Zhang,
Xingru Huang,
Hong He,
Shaowei Jiang,
Jing Ye,
Mang Xiao
Abstract:
The precise spatial and quantitative delineation of indistinct-boundary medical objects is paramount for the accuracy of diagnostic protocols, efficacy of surgical interventions, and reliability of postoperative assessments. Despite their significance, the effective segmentation and instantaneous three-dimensional reconstruction are significantly impeded by the paucity of representative samples in…
▽ More
The precise spatial and quantitative delineation of indistinct-boundary medical objects is paramount for the accuracy of diagnostic protocols, efficacy of surgical interventions, and reliability of postoperative assessments. Despite their significance, the effective segmentation and instantaneous three-dimensional reconstruction are significantly impeded by the paucity of representative samples in available datasets and noise artifacts. To surmount these challenges, we introduced Stochastic Defect Injection (SDi) to augment the representational diversity of challenging indistinct-boundary objects within training corpora. Consequently, we propose the Dual-Encoder Fourier Group Harmonics Network (DEFN) to tailor noise filtration, amplify detailed feature recognition, and bolster representation across diverse medical imaging scenarios. By incorporating Dynamic Weight Composing (DWC) loss dynamically adjusts model's focus based on training progression, DEFN achieves SOTA performance on the OIMHS public dataset, showcasing effectiveness in indistinct boundary contexts. Source code for DEFN is available at: https://github.com/IMOP-lab/DEFN-pytorch.
△ Less
Submitted 19 June, 2024; v1 submitted 1 November, 2023;
originally announced November 2023.
-
Assessing and Enhancing Robustness of Deep Learning Models with Corruption Emulation in Digital Pathology
Authors:
Peixiang Huang,
Songtao Zhang,
Yulu Gan,
Rui Xu,
Rongqi Zhu,
Wenkang Qin,
Limei Guo,
Shan Jiang,
Lin Luo
Abstract:
Deep learning in digital pathology brings intelligence and automation as substantial enhancements to pathological analysis, the gold standard of clinical diagnosis. However, multiple steps from tissue preparation to slide imaging introduce various image corruptions, making it difficult for deep neural network (DNN) models to achieve stable diagnostic results for clinical use. In order to assess an…
▽ More
Deep learning in digital pathology brings intelligence and automation as substantial enhancements to pathological analysis, the gold standard of clinical diagnosis. However, multiple steps from tissue preparation to slide imaging introduce various image corruptions, making it difficult for deep neural network (DNN) models to achieve stable diagnostic results for clinical use. In order to assess and further enhance the robustness of the models, we analyze the physical causes of the full-stack corruptions throughout the pathological life-cycle and propose an Omni-Corruption Emulation (OmniCE) method to reproduce 21 types of corruptions quantified with 5-level severity. We then construct three OmniCE-corrupted benchmark datasets at both patch level and slide level and assess the robustness of popular DNNs in classification and segmentation tasks. Further, we explore to use the OmniCE-corrupted datasets as augmentation data for training and experiments to verify that the generalization ability of the models has been significantly enhanced.
△ Less
Submitted 31 October, 2023;
originally announced October 2023.
-
Automatic nodule identification and differentiation in ultrasound videos to facilitate per-nodule examination
Authors:
Siyuan Jiang,
Yan Ding,
Yuling Wang,
Lei Xu,
Wenli Dai,
Wanru Chang,
Jianfeng Zhang,
Jie Yu,
Jianqiao Zhou,
Chunquan Zhang,
Ping Liang,
Dexing Kong
Abstract:
Ultrasound is a vital diagnostic technique in health screening, with the advantages of non-invasive, cost-effective, and radiation free, and therefore is widely applied in the diagnosis of nodules. However, it relies heavily on the expertise and clinical experience of the sonographer. In ultrasound images, a single nodule might present heterogeneous appearances in different cross-sectional views w…
▽ More
Ultrasound is a vital diagnostic technique in health screening, with the advantages of non-invasive, cost-effective, and radiation free, and therefore is widely applied in the diagnosis of nodules. However, it relies heavily on the expertise and clinical experience of the sonographer. In ultrasound images, a single nodule might present heterogeneous appearances in different cross-sectional views which makes it hard to perform per-nodule examination. Sonographers usually discriminate different nodules by examining the nodule features and the surrounding structures like gland and duct, which is cumbersome and time-consuming. To address this problem, we collected hundreds of breast ultrasound videos and built a nodule reidentification system that consists of two parts: an extractor based on the deep learning model that can extract feature vectors from the input video clips and a real-time clustering algorithm that automatically groups feature vectors by nodules. The system obtains satisfactory results and exhibits the capability to differentiate ultrasound videos. As far as we know, it's the first attempt to apply re-identification technique in the ultrasonic field.
△ Less
Submitted 10 October, 2023;
originally announced October 2023.
-
VisionFM: a Multi-Modal Multi-Task Vision Foundation Model for Generalist Ophthalmic Artificial Intelligence
Authors:
Jianing Qiu,
Jian Wu,
Hao Wei,
Peilun Shi,
Minqing Zhang,
Yunyun Sun,
Lin Li,
Hanruo Liu,
Hongyi Liu,
Simeng Hou,
Yuyang Zhao,
Xuehui Shi,
Junfang Xian,
Xiaoxia Qu,
Sirui Zhu,
Lijie Pan,
Xiaoniao Chen,
Xiaojia Zhang,
Shuai Jiang,
Kebing Wang,
Chenlong Yang,
Mingqiang Chen,
Sujie Fan,
Jianhua Hu,
Aiguo Lv
, et al. (17 additional authors not shown)
Abstract:
We present VisionFM, a foundation model pre-trained with 3.4 million ophthalmic images from 560,457 individuals, covering a broad range of ophthalmic diseases, modalities, imaging devices, and demography. After pre-training, VisionFM provides a foundation to foster multiple ophthalmic artificial intelligence (AI) applications, such as disease screening and diagnosis, disease prognosis, subclassifi…
▽ More
We present VisionFM, a foundation model pre-trained with 3.4 million ophthalmic images from 560,457 individuals, covering a broad range of ophthalmic diseases, modalities, imaging devices, and demography. After pre-training, VisionFM provides a foundation to foster multiple ophthalmic artificial intelligence (AI) applications, such as disease screening and diagnosis, disease prognosis, subclassification of disease phenotype, and systemic biomarker and disease prediction, with each application enhanced with expert-level intelligence and accuracy. The generalist intelligence of VisionFM outperformed ophthalmologists with basic and intermediate levels in jointly diagnosing 12 common ophthalmic diseases. Evaluated on a new large-scale ophthalmic disease diagnosis benchmark database, as well as a new large-scale segmentation and detection benchmark database, VisionFM outperformed strong baseline deep neural networks. The ophthalmic image representations learned by VisionFM exhibited noteworthy explainability, and demonstrated strong generalizability to new ophthalmic modalities, disease spectrum, and imaging devices. As a foundation model, VisionFM has a large capacity to learn from diverse ophthalmic imaging data and disparate datasets. To be commensurate with this capacity, in addition to the real data used for pre-training, we also generated and leveraged synthetic ophthalmic imaging data. Experimental results revealed that synthetic data that passed visual Turing tests, can also enhance the representation learning capability of VisionFM, leading to substantial performance gains on downstream ophthalmic AI tasks. Beyond the ophthalmic AI applications developed, validated, and demonstrated in this work, substantial further applications can be achieved in an efficient and cost-effective manner using VisionFM as the foundation.
△ Less
Submitted 7 October, 2023;
originally announced October 2023.
-
Lightweight texture transfer based on texture feature preset
Authors:
ShiQi Jiang
Abstract:
In the task of texture transfer, reference texture images typically exhibit highly repetitive texture features, and the texture transfer results from different content images under the same style also share remarkably similar texture patterns. Encoding such highly similar texture features often requires deep layers and a large number of channels, making it is also the main source of the entire mod…
▽ More
In the task of texture transfer, reference texture images typically exhibit highly repetitive texture features, and the texture transfer results from different content images under the same style also share remarkably similar texture patterns. Encoding such highly similar texture features often requires deep layers and a large number of channels, making it is also the main source of the entire model's parameter count and computational load, and inference time. We propose a lightweight texture transfer based on texture feature preset (TFP). TFP takes full advantage of the high repetitiveness of texture features by providing preset universal texture feature maps for a given style. These preset feature maps can be fused and decoded directly with shallow color transfer feature maps of any content to generate texture transfer results, thereby avoiding redundant texture information from being encoded repeatedly. The texture feature map we preset is encoded through noise input images with consistent distribution (standard normal distribution). This consistent input distribution can completely avoid the problem of texture transfer differentiation, and by randomly sampling different noise inputs, we can obtain different texture features and texture transfer results under the same reference style. Compared to state-of-the-art techniques, our TFP not only produces visually superior results but also reduces the model size by 3.2-3538 times and speeds up the process by 1.8-5.6 times.
△ Less
Submitted 1 January, 2024; v1 submitted 29 June, 2023;
originally announced June 2023.
-
Vision Guided MIMO Radar Beamforming for Enhanced Vital Signs Detection in Crowds
Authors:
Shuaifeng Jiang,
Ahmed Alkhateeb,
Daniel W. Bliss,
Yu Rong
Abstract:
Radar as a remote sensing technology has been used to analyze human activity for decades. Despite all the great features such as motion sensitivity, privacy preservation, penetrability, and more, radar has limited spatial degrees of freedom compared to optical sensors and thus makes it challenging to sense crowded environments without prior information. In this paper, we develop a novel dual-sensi…
▽ More
Radar as a remote sensing technology has been used to analyze human activity for decades. Despite all the great features such as motion sensitivity, privacy preservation, penetrability, and more, radar has limited spatial degrees of freedom compared to optical sensors and thus makes it challenging to sense crowded environments without prior information. In this paper, we develop a novel dual-sensing system, in which a vision sensor is leveraged to guide digital beamforming in a multiple-input multiple-output (MIMO) radar. Also, we develop a calibration algorithm to align the two types of sensors and show that the calibrated dual system achieves about two centimeters precision in three-dimensional space within a field of view of $75^\circ$ by $65^\circ$ and for a range of two meters. Finally, we show that the proposed approach is capable of detecting the vital signs simultaneously for a group of closely spaced subjects, sitting and standing, in a cluttered environment, which highlights a promising direction for vital signs detection in realistic environments.
△ Less
Submitted 18 June, 2023;
originally announced June 2023.
-
Zero-shot Medical Image Translation via Frequency-Guided Diffusion Models
Authors:
Yunxiang Li,
Hua-Chieh Shao,
Xiao Liang,
Liyuan Chen,
Ruiqi Li,
Steve Jiang,
Jing Wang,
You Zhang
Abstract:
Recently, the diffusion model has emerged as a superior generative model that can produce high quality and realistic images. However, for medical image translation, the existing diffusion models are deficient in accurately retaining structural information since the structure details of source domain images are lost during the forward diffusion process and cannot be fully recovered through learned…
▽ More
Recently, the diffusion model has emerged as a superior generative model that can produce high quality and realistic images. However, for medical image translation, the existing diffusion models are deficient in accurately retaining structural information since the structure details of source domain images are lost during the forward diffusion process and cannot be fully recovered through learned reverse diffusion, while the integrity of anatomical structures is extremely important in medical images. For instance, errors in image translation may distort, shift, or even remove structures and tumors, leading to incorrect diagnosis and inadequate treatments. Training and conditioning diffusion models using paired source and target images with matching anatomy can help. However, such paired data are very difficult and costly to obtain, and may also reduce the robustness of the developed model to out-of-distribution testing data. We propose a frequency-guided diffusion model (FGDM) that employs frequency-domain filters to guide the diffusion model for structure-preserving image translation. Based on its design, FGDM allows zero-shot learning, as it can be trained solely on the data from the target domain, and used directly for source-to-target domain translation without any exposure to the source-domain data during training. We evaluated it on three cone-beam CT (CBCT)-to-CT translation tasks for different anatomical sites, and a cross-institutional MR imaging translation task. FGDM outperformed the state-of-the-art methods (GAN-based, VAE-based, and diffusion-based) in metrics of Frechet Inception Distance (FID), Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity Index Measure (SSIM), showing its significant advantages in zero-shot medical image translation.
△ Less
Submitted 27 October, 2023; v1 submitted 5 April, 2023;
originally announced April 2023.