Search | arXiv e-print repository

Perception-Oriented Latent Coding for High-Performance Compressed Domain Semantic Inference

Authors: Xu Zhang, Ming Lu, Yan Chen, Zhan Ma

Abstract: In recent years, compressed domain semantic inference has primarily relied on learned image coding models optimized for mean squared error (MSE). However, MSE-oriented optimization tends to yield latent spaces with limited semantic richness, which hinders effective semantic inference in downstream tasks. Moreover, achieving high performance with these models often requires fine-tuning the entire v… ▽ More In recent years, compressed domain semantic inference has primarily relied on learned image coding models optimized for mean squared error (MSE). However, MSE-oriented optimization tends to yield latent spaces with limited semantic richness, which hinders effective semantic inference in downstream tasks. Moreover, achieving high performance with these models often requires fine-tuning the entire vision model, which is computationally intensive, especially for large models. To address these problems, we introduce Perception-Oriented Latent Coding (POLC), an approach that enriches the semantic content of latent features for high-performance compressed domain semantic inference. With the semantically rich latent space, POLC requires only a plug-and-play adapter for fine-tuning, significantly reducing the parameter count compared to previous MSE-oriented methods. Experimental results demonstrate that POLC achieves rate-perception performance comparable to state-of-the-art generative image coding methods while markedly enhancing performance in vision tasks, with minimal fine-tuning overhead. Code is available at https://github.com/NJUVISION/POLC. △ Less

Submitted 2 July, 2025; originally announced July 2025.

Comments: International Conference on Multimedia and Expo (ICME), 2025

arXiv:2506.19222 [pdf, ps, other]

Deformable Medical Image Registration with Effective Anatomical Structure Representation and Divide-and-Conquer Network

Authors: Xinke Ma, Yongsheng Pan, Qingjie Zeng, Mengkang Lu, Bolysbek Murat Yerzhanuly, Bazargul Matkerim, Yong Xia

Abstract: Effective representation of Regions of Interest (ROI) and independent alignment of these ROIs can significantly enhance the performance of deformable medical image registration (DMIR). However, current learning-based DMIR methods have limitations. Unsupervised techniques disregard ROI representation and proceed directly with aligning pairs of images, while weakly-supervised methods heavily depend… ▽ More Effective representation of Regions of Interest (ROI) and independent alignment of these ROIs can significantly enhance the performance of deformable medical image registration (DMIR). However, current learning-based DMIR methods have limitations. Unsupervised techniques disregard ROI representation and proceed directly with aligning pairs of images, while weakly-supervised methods heavily depend on label constraints to facilitate registration. To address these issues, we introduce a novel ROI-based registration approach named EASR-DCN. Our method represents medical images through effective ROIs and achieves independent alignment of these ROIs without requiring labels. Specifically, we first used a Gaussian mixture model for intensity analysis to represent images using multiple effective ROIs with distinct intensities. Furthermore, we propose a novel Divide-and-Conquer Network (DCN) to process these ROIs through separate channels to learn feature alignments for each ROI. The resultant correspondences are seamlessly integrated to generate a comprehensive displacement vector field. Extensive experiments were performed on three MRI and one CT datasets to showcase the superior accuracy and deformation reduction efficacy of our EASR-DCN. Compared to VoxelMorph, our EASR-DCN achieved improvements of 10.31\% in the Dice score for brain MRI, 13.01\% for cardiac MRI, and 5.75\% for hippocampus MRI, highlighting its promising potential for clinical applications. The code for this work will be released upon acceptance of the paper. △ Less

Submitted 23 June, 2025; originally announced June 2025.

arXiv:2506.18635 [pdf]

Hybrid Single-Pulse and Sawyer-Tower Method for Accurate Transistor Loss Separation in High-Frequency High-Efficiency Power Converters

Authors: Xiaoyang Tian, Mowei Lu, Florin Udrea, Stephan Goetz

Abstract: Accurate measurement of transistor parasitic capacitance and its associated energy losses is critical for evaluating device performance, particularly in high-frequency and high-efficiency power conversion systems. This paper proposes a hybrid single-pulse and Sawyer-Tower test method to analyse switching characteristics of field-effect transistors (FET), which not only eliminates overlap losses bu… ▽ More Accurate measurement of transistor parasitic capacitance and its associated energy losses is critical for evaluating device performance, particularly in high-frequency and high-efficiency power conversion systems. This paper proposes a hybrid single-pulse and Sawyer-Tower test method to analyse switching characteristics of field-effect transistors (FET), which not only eliminates overlap losses but also mitigates the effects of current backflow observed in traditional double-pulse testing. Through a precise loss separation model, it enables an accurate quantification of switching losses and provides a refined understanding of device energy dissipation mechanisms. We validate the hysteresis data and loss separation results through experimental measurements on a 350-W LLC converter, which further offers deeper insights into transistor dynamic behaviour and its dependence on operating conditions. This method is applicable to a wide range of transistors, including emerging SiC and GaN devices, and serves as a valuable tool for device characterization and optimization in power electronics. △ Less

Submitted 23 June, 2025; originally announced June 2025.

Comments: 5 pages, 8 figures

arXiv:2505.21838 [pdf, ps, other]

Nonadaptive Output Regulation of Second-Order Nonlinear Uncertain Systems

Authors: Maobin Lu, Martin Guay, Telema Harry, Shimin Wang, Jordan Cooper

Abstract: This paper investigates the robust output regulation problem of second-order nonlinear uncertain systems with an unknown exosystem. Instead of the adaptive control approach, this paper resorts to a robust control methodology to solve the problem and thus avoid the bursting phenomenon. In particular, this paper constructs generic internal models for the steady-state state and input variables of the… ▽ More This paper investigates the robust output regulation problem of second-order nonlinear uncertain systems with an unknown exosystem. Instead of the adaptive control approach, this paper resorts to a robust control methodology to solve the problem and thus avoid the bursting phenomenon. In particular, this paper constructs generic internal models for the steady-state state and input variables of the system. By introducing a coordinate transformation, this paper converts the robust output regulation problem into a nonadaptive stabilization problem of an augmented system composed of the second-order nonlinear uncertain system and the generic internal models. Then, we design the stabilization control law and construct a strict Lyapunov function that guarantees the robustness with respect to unmodeled disturbances. The analysis shows that the output zeroing manifold of the augmented system can be made attractive by the proposed nonadaptive control law, which solves the robust output regulation problem. Finally, we demonstrate the effectiveness of the proposed nonadaptive internal model approach by its application to the control of the Duffing system. △ Less

Submitted 27 May, 2025; originally announced May 2025.

Comments: 8 pages, 3 figures

arXiv:2505.08281 [pdf, ps, other]

Ultra Lowrate Image Compression with Semantic Residual Coding and Compression-aware Diffusion

Authors: Anle Ke, Xu Zhang, Tong Chen, Ming Lu, Chao Zhou, Jiawen Gu, Zhan Ma

Abstract: Existing multimodal large model-based image compression frameworks often rely on a fragmented integration of semantic retrieval, latent compression, and generative models, resulting in suboptimal performance in both reconstruction fidelity and coding efficiency. To address these challenges, we propose a residual-guided ultra lowrate image compression named ResULIC, which incorporates residual sign… ▽ More Existing multimodal large model-based image compression frameworks often rely on a fragmented integration of semantic retrieval, latent compression, and generative models, resulting in suboptimal performance in both reconstruction fidelity and coding efficiency. To address these challenges, we propose a residual-guided ultra lowrate image compression named ResULIC, which incorporates residual signals into both semantic retrieval and the diffusion-based generation process. Specifically, we introduce Semantic Residual Coding (SRC) to capture the semantic disparity between the original image and its compressed latent representation. A perceptual fidelity optimizer is further applied for superior reconstruction quality. Additionally, we present the Compression-aware Diffusion Model (CDM), which establishes an optimal alignment between bitrates and diffusion time steps, improving compression-reconstruction synergy. Extensive experiments demonstrate the effectiveness of ResULIC, achieving superior objective and subjective performance compared to state-of-the-art diffusion-based methods with - 80.7%, -66.3% BD-rate saving in terms of LPIPS and FID. Project page is available at https: //njuvision.github.io/ResULIC/. △ Less

Submitted 13 May, 2025; originally announced May 2025.

Journal ref: ICML 2025

arXiv:2503.21820 [pdf, other]

UFM: Unified Feature Matching Pre-training with Multi-Modal Image Assistants

Authors: Yide Di, Yun Liao, Hao Zhou, Kaijun Zhu, Qing Duan, Junhui Liu, Mingyu Lu

Abstract: Image feature matching, a foundational task in computer vision, remains challenging for multimodal image applications, often necessitating intricate training on specific datasets. In this paper, we introduce a Unified Feature Matching pre-trained model (UFM) designed to address feature matching challenges across a wide spectrum of modal images. We present Multimodal Image Assistant (MIA) transform… ▽ More Image feature matching, a foundational task in computer vision, remains challenging for multimodal image applications, often necessitating intricate training on specific datasets. In this paper, we introduce a Unified Feature Matching pre-trained model (UFM) designed to address feature matching challenges across a wide spectrum of modal images. We present Multimodal Image Assistant (MIA) transformers, finely tunable structures adept at handling diverse feature matching problems. UFM exhibits versatility in addressing both feature matching tasks within the same modal and those across different modals. Additionally, we propose a data augmentation algorithm and a staged pre-training strategy to effectively tackle challenges arising from sparse data in specific modals and imbalanced modal datasets. Experimental results demonstrate that UFM excels in generalization and performance across various feature matching tasks. The code will be released at:https://github.com/LiaoYun0x0/UFM. △ Less

Submitted 26 March, 2025; originally announced March 2025.

Comments: 34 pages, 13 figures

arXiv:2503.14352 [pdf, other]

doi 10.1109/LRA.2025.3547306

Flying in Highly Dynamic Environments with End-to-end Learning Approach

Authors: Xiyu Fan, Minghao Lu, Bowen Xu, Peng Lu

Abstract: Obstacle avoidance for unmanned aerial vehicles like quadrotors is a popular research topic. Most existing research focuses only on static environments, and obstacle avoidance in environments with multiple dynamic obstacles remains challenging. This paper proposes a novel deep-reinforcement learning-based approach for the quadrotors to navigate through highly dynamic environments. We propose a lid… ▽ More Obstacle avoidance for unmanned aerial vehicles like quadrotors is a popular research topic. Most existing research focuses only on static environments, and obstacle avoidance in environments with multiple dynamic obstacles remains challenging. This paper proposes a novel deep-reinforcement learning-based approach for the quadrotors to navigate through highly dynamic environments. We propose a lidar data encoder to extract obstacle information from the massive point cloud data from the lidar. Multi frames of historical scans will be compressed into a 2-dimension obstacle map while maintaining the obstacle features required. An end-to-end deep neural network is trained to extract the kinematics of dynamic and static obstacles from the obstacle map, and it will generate acceleration commands to the quadrotor to control it to avoid these obstacles. Our approach contains perception and navigating functions in a single neural network, which can change from a navigating state into a hovering state without mode switching. We also present simulations and real-world experiments to show the effectiveness of our approach while navigating in highly dynamic cluttered environments. △ Less

Submitted 18 March, 2025; originally announced March 2025.

Comments: IEEE Robotics and Automation Letters (2025)

arXiv:2503.07667 [pdf, other]

CLIMB: Data Foundations for Large Scale Multimodal Clinical Foundation Models

Authors: Wei Dai, Peilin Chen, Malinda Lu, Daniel Li, Haowen Wei, Hejie Cui, Paul Pu Liang

Abstract: Recent advances in clinical AI have enabled remarkable progress across many clinical domains. However, existing benchmarks and models are primarily limited to a small set of modalities and tasks, which hinders the development of large-scale multimodal methods that can make holistic assessments of patient health and well-being. To bridge this gap, we introduce Clinical Large-Scale Integrative Multi… ▽ More Recent advances in clinical AI have enabled remarkable progress across many clinical domains. However, existing benchmarks and models are primarily limited to a small set of modalities and tasks, which hinders the development of large-scale multimodal methods that can make holistic assessments of patient health and well-being. To bridge this gap, we introduce Clinical Large-Scale Integrative Multimodal Benchmark (CLIMB), a comprehensive clinical benchmark unifying diverse clinical data across imaging, language, temporal, and graph modalities. CLIMB comprises 4.51 million patient samples totaling 19.01 terabytes distributed across 2D imaging, 3D video, time series, graphs, and multimodal data. Through extensive empirical evaluation, we demonstrate that multitask pretraining significantly improves performance on understudied domains, achieving up to 29% improvement in ultrasound and 23% in ECG analysis over single-task learning. Pretraining on CLIMB also effectively improves models' generalization capability to new tasks, and strong unimodal encoder performance translates well to multimodal performance when paired with task-appropriate fusion strategies. Our findings provide a foundation for new architecture designs and pretraining strategies to advance clinical AI research. Code is released at https://github.com/DDVD233/climb. △ Less

Submitted 20 March, 2025; v1 submitted 8 March, 2025; originally announced March 2025.

arXiv:2503.06226 [pdf, ps, other]

Optimal Output Feedback Learning Control for Discrete-Time Linear Quadratic Regulation

Authors: Kedi Xie, Martin Guay, Shimin Wang, Fang Deng, Maobin Lu

Abstract: This paper studies the linear quadratic regulation (LQR) problem of unknown discrete-time systems via dynamic output feedback learning control. In contrast to the state feedback, the optimality of the dynamic output feedback control for solving the LQR problem requires an implicit condition on the convergence of the state observer. Moreover, due to unknown system matrices and the existence of obse… ▽ More This paper studies the linear quadratic regulation (LQR) problem of unknown discrete-time systems via dynamic output feedback learning control. In contrast to the state feedback, the optimality of the dynamic output feedback control for solving the LQR problem requires an implicit condition on the convergence of the state observer. Moreover, due to unknown system matrices and the existence of observer error, it is difficult to analyze the convergence and stability of most existing output feedback learning-based control methods. To tackle these issues, we propose a generalized dynamic output feedback learning control approach with guaranteed convergence, stability, and optimality performance for solving the LQR problem of unknown discrete-time linear systems. In particular, a dynamic output feedback controller is designed to be equivalent to a state feedback controller. This equivalence relationship is an inherent property without requiring convergence of the estimated state by the state observer, which plays a key role in establishing the off-policy learning control approaches. By value iteration and policy iteration schemes, the adaptive dynamic programming based learning control approaches are developed to estimate the optimal feedback control gain. In addition, a model-free stability criterion is provided by finding a nonsingular parameterization matrix, which contributes to establishing a switched iteration scheme. Furthermore, the convergence, stability, and optimality analyses of the proposed output feedback learning control approaches are given. Finally, the theoretical results are validated by two numerical examples. △ Less

Submitted 27 May, 2025; v1 submitted 8 March, 2025; originally announced March 2025.

Comments: 16 pages, 5 figures

arXiv:2502.13915 [pdf, other]

Conveniently Identify Coils in Inductive Power Transfer System Using Machine Learning

Authors: Yifan Zhao, Mowei Lu, Ting Chen, Heyuan Li, Xiang Gao, Zhenbin Zhang, Minfan Fu, Stefan M. Goetz

Abstract: High-frequency inductive power transfer (IPT) has garnered significant attention in recent years due to its long transmission distance and high efficiency. The inductance values L and quality factors Q of the transmitting and receiving coils greatly influence the system's operation. Traditional methods involved impedance analyzers or network analyzers for measurement, which required bulky and cost… ▽ More High-frequency inductive power transfer (IPT) has garnered significant attention in recent years due to its long transmission distance and high efficiency. The inductance values L and quality factors Q of the transmitting and receiving coils greatly influence the system's operation. Traditional methods involved impedance analyzers or network analyzers for measurement, which required bulky and costly equipment. Moreover, disassembling it for re-measurement is impractical once the product is packaged. Alternatively, simulation software such as HYSS can serve for the identification. Nevertheless, in the case of very high frequencies, the simulation process consumes a significant amount of time due to the skin and proximity effects. More importantly, obtaining parameters through simulation software becomes impractical when the coil design is more complex. This paper firstly employs a machine learning approach for the identification task. We simply input images of the coils and operating frequency into a well-trained model. This method enables rapid identification of the coil's L and Q values anytime and anywhere, without the need for expensive machinery or coil disassembly. △ Less

Submitted 19 February, 2025; originally announced February 2025.

Comments: This paper has accepted in 2025 IEEE Applied Power Electronics Conference and Exposition (APEC)

arXiv:2502.13880 [pdf, other]

Class E/EF Inductive Power Transfer to Achieve Stable Output under Variable Low Coupling

Authors: Yifan Zhao, Mowei Lu, Heyuan Li, Zhenbin Zhang, Minfan Fu, Stefan M. Goetz

Abstract: This paper develops an inductive power transfer(IPT)system with stable output power based on a Class E/EF inverter. Load-independent design of Class E/EF inverter has recently attracted widespread interest. However, applying this design to IPT systems has proven challenging when the coupling coefficient is weak. To solve this issue, this paper uses an expanded impedance model and substitutes the s… ▽ More This paper develops an inductive power transfer(IPT)system with stable output power based on a Class E/EF inverter. Load-independent design of Class E/EF inverter has recently attracted widespread interest. However, applying this design to IPT systems has proven challenging when the coupling coefficient is weak. To solve this issue, this paper uses an expanded impedance model and substitutes the secondary side's perfect resonance with a detuned design. Therefore, the system can maintain stable output even under a low coupling coefficient. A 400 kHz experimental prototype validates these findings. The experimental results indicate that the output power fluctuation remains within 15% as the coupling coefficient varies from 0.04 to 0.07. The peak power efficiency achieving 91% △ Less

Submitted 19 February, 2025; originally announced February 2025.

Comments: This paper has been accepted in 2025 IEEE Conference on Applied Power Electronics Conference and Exposition (APEC)

arXiv:2502.13395 [pdf]

Unsupervised CP-UNet Framework for Denoising DAS Data with Decay Noise

Authors: Tianye Huang, Aopeng Li, Xiang Li, Jing Zhang, Sijing Xian, Qi Zhang, Mingkong Lu, Guodong Chen, Liangming Xiong, Xiangyun Hu

Abstract: Distributed acoustic sensor (DAS) technology leverages optical fiber cables to detect acoustic signals, providing cost-effective and dense monitoring capabilities. It offers several advantages including resistance to extreme conditions, immunity to electromagnetic interference, and accurate detection. However, DAS typically exhibits a lower signal-to-noise ratio (S/N) compared to geophones and is… ▽ More Distributed acoustic sensor (DAS) technology leverages optical fiber cables to detect acoustic signals, providing cost-effective and dense monitoring capabilities. It offers several advantages including resistance to extreme conditions, immunity to electromagnetic interference, and accurate detection. However, DAS typically exhibits a lower signal-to-noise ratio (S/N) compared to geophones and is susceptible to various noise types, such as random noise, erratic noise, level noise, and long-period noise. This reduced S/N can negatively impact data analyses containing inversion and interpretation. While artificial intelligence has demonstrated excellent denoising capabilities, most existing methods rely on supervised learning with labeled data, which imposes stringent requirements on the quality of the labels. To address this issue, we develop a label-free unsupervised learning (UL) network model based on Context-Pyramid-UNet (CP-UNet) to suppress erratic and random noises in DAS data. The CP-UNet utilizes the Context Pyramid Module in the encoding and decoding process to extract features and reconstruct the DAS data. To enhance the connectivity between shallow and deep features, we add a Connected Module (CM) to both encoding and decoding section. Layer Normalization (LN) is utilized to replace the commonly employed Batch Normalization (BN), accelerating the convergence of the model and preventing gradient explosion during training. Huber-loss is adopted as our loss function whose parameters are experimentally determined. We apply the network to both the 2-D synthetic and filed data. Comparing to traditional denoising methods and the latest UL framework, our proposed method demonstrates superior noise reduction performance. △ Less

Submitted 18 February, 2025; originally announced February 2025.

Comments: 13 pages, 8 figures

arXiv:2502.11729 [pdf, other]

On Quantizing Neural Representation for Variable-Rate Video Coding

Authors: Junqi Shi, Zhujia Chen, Hanfei Li, Qi Zhao, Ming Lu, Tong Chen, Zhan Ma

Abstract: This work introduces NeuroQuant, a novel post-training quantization (PTQ) approach tailored to non-generalized Implicit Neural Representations for variable-rate Video Coding (INR-VC). Unlike existing methods that require extensive weight retraining for each target bitrate, we hypothesize that variable-rate coding can be achieved by adjusting quantization parameters (QPs) of pre-trained weights. Ou… ▽ More This work introduces NeuroQuant, a novel post-training quantization (PTQ) approach tailored to non-generalized Implicit Neural Representations for variable-rate Video Coding (INR-VC). Unlike existing methods that require extensive weight retraining for each target bitrate, we hypothesize that variable-rate coding can be achieved by adjusting quantization parameters (QPs) of pre-trained weights. Our study reveals that traditional quantization methods, which assume inter-layer independence, are ineffective for non-generalized INR-VC models due to significant dependencies across layers. To address this, we redefine variable-rate INR-VC as a mixed-precision quantization problem and establish a theoretical framework for sensitivity criteria aimed at simplified, fine-grained rate control. Additionally, we propose network-wise calibration and channel-wise quantization strategies to minimize quantization-induced errors, arriving at a unified formula for representation-oriented PTQ calibration. Our experimental evaluations demonstrate that NeuroQuant significantly outperforms existing techniques in varying bitwidth quantization and compression efficiency, accelerating encoding by up to eight times and enabling quantization down to INT2 with minimal reconstruction loss. This work introduces variable-rate INR-VC for the first time and lays a theoretical foundation for future research in rate-distortion optimization, advancing the field of video coding technology. The materials will be available at https://github.com/Eric-qi/NeuroQuant. △ Less

Submitted 17 February, 2025; originally announced February 2025.

Comments: to be pulished in ICLR'25

arXiv:2502.04988 [pdf, other]

CMamba: Learned Image Compression with State Space Models

Authors: Zhuojie Wu, Heming Du, Shuyun Wang, Ming Lu, Haiyang Sun, Yandong Guo, Xin Yu

Abstract: Learned Image Compression (LIC) has explored various architectures, such as Convolutional Neural Networks (CNNs) and transformers, in modeling image content distributions in order to achieve compression effectiveness. However, achieving high rate-distortion performance while maintaining low computational complexity (\ie, parameters, FLOPs, and latency) remains challenging. In this paper, we propos… ▽ More Learned Image Compression (LIC) has explored various architectures, such as Convolutional Neural Networks (CNNs) and transformers, in modeling image content distributions in order to achieve compression effectiveness. However, achieving high rate-distortion performance while maintaining low computational complexity (\ie, parameters, FLOPs, and latency) remains challenging. In this paper, we propose a hybrid Convolution and State Space Models (SSMs) based image compression framework, termed \textit{CMamba}, to achieve superior rate-distortion performance with low computational complexity. Specifically, CMamba introduces two key components: a Content-Adaptive SSM (CA-SSM) module and a Context-Aware Entropy (CAE) module. First, we observed that SSMs excel in modeling overall content but tend to lose high-frequency details. In contrast, CNNs are proficient at capturing local details. Motivated by this, we propose the CA-SSM module that can dynamically fuse global content extracted by SSM blocks and local details captured by CNN blocks in both encoding and decoding stages. As a result, important image content is well preserved during compression. Second, our proposed CAE module is designed to reduce spatial and channel redundancies in latent representations after encoding. Specifically, our CAE leverages SSMs to parameterize the spatial content in latent representations. Benefiting from SSMs, CAE significantly improves spatial compression efficiency while reducing spatial content redundancies. Moreover, along the channel dimension, CAE reduces inter-channel redundancies of latent representations via an autoregressive manner, which can fully exploit prior knowledge from previous channels without sacrificing efficiency. Experimental results demonstrate that CMamba achieves superior rate-distortion performance. △ Less

Submitted 7 February, 2025; originally announced February 2025.

arXiv:2501.11263 [pdf, other]

Towards Loss-Resilient Image Coding for Unstable Satellite Networks

Authors: Hongwei Sha, Muchen Dong, Quanyou Luo, Ming Lu, Hao Chen, Zhan Ma

Abstract: Geostationary Earth Orbit (GEO) satellite communication demonstrates significant advantages in emergency short burst data services. However, unstable satellite networks, particularly those with frequent packet loss, present a severe challenge to accurate image transmission. To address it, we propose a loss-resilient image coding approach that leverages end-to-end optimization in learned image comp… ▽ More Geostationary Earth Orbit (GEO) satellite communication demonstrates significant advantages in emergency short burst data services. However, unstable satellite networks, particularly those with frequent packet loss, present a severe challenge to accurate image transmission. To address it, we propose a loss-resilient image coding approach that leverages end-to-end optimization in learned image compression (LIC). Our method builds on the channel-wise progressive coding framework, incorporating Spatial-Channel Rearrangement (SCR) on the encoder side and Mask Conditional Aggregation (MCA) on the decoder side to improve reconstruction quality with unpredictable errors. By integrating the Gilbert-Elliot model into the training process, we enhance the model's ability to generalize in real-world network conditions. Extensive evaluations show that our approach outperforms traditional and deep learning-based methods in terms of compression performance and stability under diverse packet loss, offering robust and efficient progressive transmission even in challenging environments. Code is available at https://github.com/NJUVISION/LossResilientLIC. △ Less

Submitted 19 January, 2025; originally announced January 2025.

Comments: Accepted as a poster presentation at AAAI 2025

arXiv:2501.08825 [pdf, other]

A Multi-modal Intelligent Channel Model for 6G Multi-UAV-to-Multi-Vehicle Communications

Authors: Lu Bai, Mengyuan Lu, Ziwei Huang, Xiang Cheng

Abstract: In this paper, a novel multi-modal intelligent channel model for sixth-generation (6G) multiple-unmanned aerial vehicle (multi-UAV)-to-multi-vehicle communications is proposed. To thoroughly explore the mapping relationship between the physical environment and the electromagnetic space in the complex multi-UAV-to-multi-vehicle scenario, two new parameters, i.e., terrestrial traffic density (TTD) a… ▽ More In this paper, a novel multi-modal intelligent channel model for sixth-generation (6G) multiple-unmanned aerial vehicle (multi-UAV)-to-multi-vehicle communications is proposed. To thoroughly explore the mapping relationship between the physical environment and the electromagnetic space in the complex multi-UAV-to-multi-vehicle scenario, two new parameters, i.e., terrestrial traffic density (TTD) and aerial traffic density (ATD), are developed and a new sensing-communication intelligent integrated dataset is constructed in suburban scenario under different TTD and ATD conditions. With the aid of sensing data, i.e., light detection and ranging (LiDAR) point clouds, the parameters of static scatterers, terrestrial dynamic scatterers, and aerial dynamic scatterers in the electromagnetic space, e.g., number, distance, angle, and power, are quantified under different TTD and ATD conditions in the physical environment. In the proposed model, the channel non-stationarity and consistency on the time and space domains and the channel non-stationarity on the frequency domain are simultaneously mimicked. The channel statistical properties, such as time-space-frequency correlation function (TSF-CF), time stationary interval (TSI), and Doppler power spectral density (DPSD), are derived and simulated. Simulation results match ray-tracing (RT) results well, which verifies the accuracy of the proposed multi-UAV-to-multi-vehicle channel model. △ Less

Submitted 15 January, 2025; originally announced January 2025.

arXiv:2411.19666 [pdf, other]

Multimodal Whole Slide Foundation Model for Pathology

Authors: Tong Ding, Sophia J. Wagner, Andrew H. Song, Richard J. Chen, Ming Y. Lu, Andrew Zhang, Anurag J. Vaidya, Guillaume Jaume, Muhammad Shaban, Ahrong Kim, Drew F. K. Williamson, Bowen Chen, Cristina Almagro-Perez, Paul Doucet, Sharifa Sahai, Chengkuan Chen, Daisuke Komura, Akihiro Kawabe, Shumpei Ishikawa, Georg Gerber, Tingying Peng, Long Phi Le, Faisal Mahmood

Abstract: The field of computational pathology has been transformed with recent advances in foundation models that encode histopathology region-of-interests (ROIs) into versatile and transferable feature representations via self-supervised learning (SSL). However, translating these advancements to address complex clinical challenges at the patient and slide level remains constrained by limited clinical data… ▽ More The field of computational pathology has been transformed with recent advances in foundation models that encode histopathology region-of-interests (ROIs) into versatile and transferable feature representations via self-supervised learning (SSL). However, translating these advancements to address complex clinical challenges at the patient and slide level remains constrained by limited clinical data in disease-specific cohorts, especially for rare clinical conditions. We propose TITAN, a multimodal whole slide foundation model pretrained using 335,645 WSIs via visual self-supervised learning and vision-language alignment with corresponding pathology reports and 423,122 synthetic captions generated from a multimodal generative AI copilot for pathology. Without any finetuning or requiring clinical labels, TITAN can extract general-purpose slide representations and generate pathology reports that generalize to resource-limited clinical scenarios such as rare disease retrieval and cancer prognosis. We evaluate TITAN on diverse clinical tasks and find that TITAN outperforms both ROI and slide foundation models across machine learning settings such as linear probing, few-shot and zero-shot classification, rare cancer retrieval and cross-modal retrieval, and pathology report generation. △ Less

Submitted 29 November, 2024; originally announced November 2024.

Comments: The code is accessible at https://github.com/mahmoodlab/TITAN

arXiv:2410.23695 [pdf, other]

Parameterized TDOA: Instantaneous TDOA Estimation and Localization for Mobile Targets in a Time-Division Broadcast Positioning System

Authors: Chenxin Tu, Xiaowei Cui, Gang Liu, Sihao Zhao, Mingquan Lu

Abstract: In a time-division broadcast positioning system (TDBPS), localizing mobile targets using classical time difference of arrival (TDOA) methods poses significant challenges. Concurrent TDOA measurements are infeasible because targets receive signals from different anchors and extract their transmission times at different reception times, as well as at varying positions. Traditional TDOA estimation sc… ▽ More In a time-division broadcast positioning system (TDBPS), localizing mobile targets using classical time difference of arrival (TDOA) methods poses significant challenges. Concurrent TDOA measurements are infeasible because targets receive signals from different anchors and extract their transmission times at different reception times, as well as at varying positions. Traditional TDOA estimation schemes implicitly assume that the target remains stationary during the measurement period, which is impractical for mobile targets exhibiting high dynamics. Existing methods for mobile target localization are mostly specialized and rely on motion modeling and do not rely on the concurrent TDOA measurements. This issue limits their direct use of the well-established classical TDOA-based localization methods and complicating the entire localization process. In this paper, to obtain concurrent TDOA estimates at any instant out of the sequential measurements for direct use of existing TDOA-based localization methods, we propose a novel TDOA estimation method, termed parameterized TDOA (P-TDOA). By approximating the time-varying TDOA as a polynomial function over a short period, we transform the TDOA estimation problem into a model parameter estimation problem and derive the desired TDOA estimates thereafter. Theoretical analysis shows that, under certain conditions, the proposed P-TDOA method closely approaches the Cramer-Rao Lower Bound (CRLB) for TDOA estimation in concurrent measurement scenarios, despite measurements being obtained sequentially. Extensive numerical simulations validate our theoretical analysis and demonstrate the effectiveness of the proposed method, highlighting substantial improvements over existing approaches across various scenarios. △ Less

Submitted 22 March, 2025; v1 submitted 31 October, 2024; originally announced October 2024.

Comments: This manuscript has been accepted for publication in IEEE Internet of Things Journal. The final version will be available at DOI: 10.1109/JIOT.2025.3554528

arXiv:2410.07277 [pdf, other]

Swin-BERT: A Feature Fusion System designed for Speech-based Alzheimer's Dementia Detection

Authors: Yilin Pan, Yanpei Shi, Yijia Zhang, Mingyu Lu

Abstract: Speech is usually used for constructing an automatic Alzheimer's dementia (AD) detection system, as the acoustic and linguistic abilities show a decline in people living with AD at the early stages. However, speech includes not only AD-related local and global information but also other information unrelated to cognitive status, such as age and gender. In this paper, we propose a speech-based syst… ▽ More Speech is usually used for constructing an automatic Alzheimer's dementia (AD) detection system, as the acoustic and linguistic abilities show a decline in people living with AD at the early stages. However, speech includes not only AD-related local and global information but also other information unrelated to cognitive status, such as age and gender. In this paper, we propose a speech-based system named Swin-BERT for automatic dementia detection. For the acoustic part, the shifted windows multi-head attention that proposed to extract local and global information from images, is used for designing our acoustic-based system. To decouple the effect of age and gender on acoustic feature extraction, they are used as an extra input of the designed acoustic system. For the linguistic part, the rhythm-related information, which varies significantly between people living with and without AD, is removed while transcribing the audio recordings into transcripts. To compensate for the removed rhythm-related information, the character-level transcripts are proposed to be used as the extra input of a word-level BERT-style system. Finally, the Swin-BERT combines the acoustic features learned from our proposed acoustic-based system with our linguistic-based system. The experiments are based on the two datasets provided by the international dementia detection challenges: the ADReSS and ADReSSo. The results show that both the proposed acoustic and linguistic systems can be better or comparable with previous research on the two datasets. Superior results are achieved by the proposed Swin-BERT system on the ADReSS and ADReSSo datasets, which are 85.58\% F-score and 87.32\% F-score respectively. △ Less

Submitted 9 October, 2024; originally announced October 2024.

arXiv:2410.02598 [pdf, other]

High-Efficiency Neural Video Compression via Hierarchical Predictive Learning

Authors: Ming Lu, Zhihao Duan, Wuyang Cong, Dandan Ding, Fengqing Zhu, Zhan Ma

Abstract: The enhanced Deep Hierarchical Video Compression-DHVC 2.0-has been introduced. This single-model neural video codec operates across a broad range of bitrates, delivering not only superior compression performance to representative methods but also impressive complexity efficiency, enabling real-time processing with a significantly smaller memory footprint on standard GPUs. These remarkable advancem… ▽ More The enhanced Deep Hierarchical Video Compression-DHVC 2.0-has been introduced. This single-model neural video codec operates across a broad range of bitrates, delivering not only superior compression performance to representative methods but also impressive complexity efficiency, enabling real-time processing with a significantly smaller memory footprint on standard GPUs. These remarkable advancements stem from the use of hierarchical predictive coding. Each video frame is uniformly transformed into multiscale representations through hierarchical variational autoencoders. For a specific scale's feature representation of a frame, its corresponding latent residual variables are generated by referencing lower-scale spatial features from the same frame and then conditionally entropy-encoded using a probabilistic model whose parameters are predicted using same-scale temporal reference from previous frames and lower-scale spatial reference of the current frame. This feature-space processing operates from the lowest to the highest scale of each frame, completely eliminating the need for the complexity-intensive motion estimation and compensation techniques that have been standard in video codecs for decades. The hierarchical approach facilitates parallel processing, accelerating both encoding and decoding, and supports transmission-friendly progressive decoding, making it particularly advantageous for networked video applications in the presence of packet loss. Source codes will be made available. △ Less

Submitted 3 October, 2024; originally announced October 2024.

arXiv:2409.19660 [pdf, other]

All-in-One Image Coding for Joint Human-Machine Vision with Multi-Path Aggregation

Authors: Xu Zhang, Peiyao Guo, Ming Lu, Zhan Ma

Abstract: Image coding for multi-task applications, catering to both human perception and machine vision, has been extensively investigated. Existing methods often rely on multiple task-specific encoder-decoder pairs, leading to high overhead of parameter and bitrate usage, or face challenges in multi-objective optimization under a unified representation, failing to achieve both performance and efficiency.… ▽ More Image coding for multi-task applications, catering to both human perception and machine vision, has been extensively investigated. Existing methods often rely on multiple task-specific encoder-decoder pairs, leading to high overhead of parameter and bitrate usage, or face challenges in multi-objective optimization under a unified representation, failing to achieve both performance and efficiency. To this end, we propose Multi-Path Aggregation (MPA) integrated into existing coding models for joint human-machine vision, unifying the feature representation with an all-in-one architecture. MPA employs a predictor to allocate latent features among task-specific paths based on feature importance varied across tasks, maximizing the utility of shared features while preserving task-specific features for subsequent refinement. Leveraging feature correlations, we develop a two-stage optimization strategy to alleviate multi-task performance degradation. Upon the reuse of shared features, as low as 1.89% parameters are further augmented and fine-tuned for a specific task, which completely avoids extensive optimization of the entire model. Experimental results show that MPA achieves performance comparable to state-of-the-art methods in both task-specific and multi-objective optimization across human viewing and machine analysis tasks. Moreover, our all-in-one design supports seamless transitions between human- and machine-oriented reconstruction, enabling task-controllable interpretation without altering the unified model. Code is available at https://github.com/NJUVISION/MPA. △ Less

Submitted 29 September, 2024; originally announced September 2024.

Comments: NeurIPS 2024

arXiv:2409.01009 [pdf, other]

Accelerating block-level rate control for learned image compression

Authors: Muchen Dong, Ming Lu, Zhan Ma

Abstract: Despite the unprecedented compression efficiency achieved by deep learned image compression (LIC), existing methods usually approximate the desired bitrate by adjusting a single quality factor for a given input image, which may compromise the rate control results. Considering the Rate-Distortion (R - D) characteristics of different spatial content, this work introduces the block-level rate control… ▽ More Despite the unprecedented compression efficiency achieved by deep learned image compression (LIC), existing methods usually approximate the desired bitrate by adjusting a single quality factor for a given input image, which may compromise the rate control results. Considering the Rate-Distortion (R - D) characteristics of different spatial content, this work introduces the block-level rate control based on a novel D - λ model specific for LIC. Furthermore, we try to exploit the inter-block correlations and propose a block-wise R - D prediction algorithm which greatly speeds up block-level rate control while still guaranteeing high accuracy. Experimental results show that the proposed rate control achieves up to 100 times, speed-up with more than 98% accuracy. Our approach provides an optimal bit allocation for each block and therefore improves the overall compression performance, which offers great potential for block-level LIC. △ Less

Submitted 2 September, 2024; originally announced September 2024.

Comments: 10 pages, 5 figures

MSC Class: 68P30 ACM Class: I.4.2

arXiv:2407.21395 [pdf, other]

HINER: Neural Representation for Hyperspectral Image

Authors: Junqi Shi, Mingyi Jiang, Ming Lu, Tong Chen, Xun Cao, Zhan Ma

Abstract: This paper introduces {HINER}, a novel neural representation for compressing HSI and ensuring high-quality downstream tasks on compressed HSI. HINER fully exploits inter-spectral correlations by explicitly encoding of spectral wavelengths and achieves a compact representation of the input HSI sample through joint optimization with a learnable decoder. By additionally incorporating the Content Angl… ▽ More This paper introduces {HINER}, a novel neural representation for compressing HSI and ensuring high-quality downstream tasks on compressed HSI. HINER fully exploits inter-spectral correlations by explicitly encoding of spectral wavelengths and achieves a compact representation of the input HSI sample through joint optimization with a learnable decoder. By additionally incorporating the Content Angle Mapper with the L1 loss, we can supervise the global and local information within each spectral band, thereby enhancing the overall reconstruction quality. For downstream classification on compressed HSI, we theoretically demonstrate the task accuracy is not only related to the classification loss but also to the reconstruction fidelity through a first-order expansion of the accuracy degradation, and accordingly adapt the reconstruction by introducing Adaptive Spectral Weighting. Owing to the monotonic mapping of HINER between wavelengths and spectral bands, we propose Implicit Spectral Interpolation for data augmentation by adding random variables to input wavelengths during classification model training. Experimental results on various HSI datasets demonstrate the superior compression performance of our HINER compared to the existing learned methods and also the traditional codecs. Our model is lightweight and computationally efficient, which maintains high accuracy for downstream classification task even on decoded HSIs at high compression ratios. Our materials will be released at https://github.com/Eric-qi/HINER. △ Less

Submitted 31 July, 2024; originally announced July 2024.

Comments: ACM MM24

arXiv:2405.10570 [pdf]

Simultaneous Deep Learning of Myocardium Segmentation and T2 Quantification for Acute Myocardial Infarction MRI

Authors: Yirong Zhou, Chengyan Wang, Mengtian Lu, Kunyuan Guo, Zi Wang, Dan Ruan, Rui Guo, Peijun Zhao, Jianhua Wang, Naiming Wu, Jianzhong Lin, Yinyin Chen, Hang Jin, Lianxin Xie, Lilan Wu, Liuhong Zhu, Jianjun Zhou, Congbo Cai, He Wang, Xiaobo Qu

Abstract: In cardiac Magnetic Resonance Imaging (MRI) analysis, simultaneous myocardial segmentation and T2 quantification are crucial for assessing myocardial pathologies. Existing methods often address these tasks separately, limiting their synergistic potential. To address this, we propose SQNet, a dual-task network integrating Transformer and Convolutional Neural Network (CNN) components. SQNet features… ▽ More In cardiac Magnetic Resonance Imaging (MRI) analysis, simultaneous myocardial segmentation and T2 quantification are crucial for assessing myocardial pathologies. Existing methods often address these tasks separately, limiting their synergistic potential. To address this, we propose SQNet, a dual-task network integrating Transformer and Convolutional Neural Network (CNN) components. SQNet features a T2-refine fusion decoder for quantitative analysis, leveraging global features from the Transformer, and a segmentation decoder with multiple local region supervision for enhanced accuracy. A tight coupling module aligns and fuses CNN and Transformer branch features, enabling SQNet to focus on myocardium regions. Evaluation on healthy controls (HC) and acute myocardial infarction patients (AMI) demonstrates superior segmentation dice scores (89.3/89.2) compared to state-of-the-art methods (87.7/87.9). T2 quantification yields strong linear correlations (Pearson coefficients: 0.84/0.93) with label values for HC/AMI, indicating accurate mapping. Radiologist evaluations confirm SQNet's superior image quality scores (4.60/4.58 for segmentation, 4.32/4.42 for T2 quantification) over state-of-the-art methods (4.50/4.44 for segmentation, 3.59/4.37 for T2 quantification). SQNet thus offers accurate simultaneous segmentation and quantification, enhancing cardiac disease diagnosis, such as AMI. △ Less

Submitted 29 May, 2024; v1 submitted 17 May, 2024; originally announced May 2024.

Comments: 10 pages, 8 figures, 6 tables

arXiv:2404.08285 [pdf]

A Survey of Neural Network Robustness Assessment in Image Recognition

Authors: Jie Wang, Jun Ai, Minyan Lu, Haoran Su, Dan Yu, Yutao Zhang, Junda Zhu, Jingyu Liu

Abstract: In recent years, there has been significant attention given to the robustness assessment of neural networks. Robustness plays a critical role in ensuring reliable operation of artificial intelligence (AI) systems in complex and uncertain environments. Deep learning's robustness problem is particularly significant, highlighted by the discovery of adversarial attacks on image classification models.… ▽ More In recent years, there has been significant attention given to the robustness assessment of neural networks. Robustness plays a critical role in ensuring reliable operation of artificial intelligence (AI) systems in complex and uncertain environments. Deep learning's robustness problem is particularly significant, highlighted by the discovery of adversarial attacks on image classification models. Researchers have dedicated efforts to evaluate robustness in diverse perturbation conditions for image recognition tasks. Robustness assessment encompasses two main techniques: robustness verification/ certification for deliberate adversarial attacks and robustness testing for random data corruptions. In this survey, we present a detailed examination of both adversarial robustness (AR) and corruption robustness (CR) in neural network assessment. Analyzing current research papers and standards, we provide an extensive overview of robustness assessment in image recognition. Three essential aspects are analyzed: concepts, metrics, and assessment methods. We investigate the perturbation metrics and range representations used to measure the degree of perturbations on images, as well as the robustness metrics specifically for the robustness conditions of classification models. The strengths and limitations of the existing methods are also discussed, and some potential directions for future research are provided. △ Less

Submitted 15 April, 2024; v1 submitted 12 April, 2024; originally announced April 2024.

Comments: Corrected typos and grammatical errors in Section 5

arXiv:2402.18862 [pdf, other]

Towards Backward-Compatible Continual Learning of Image Compression

Authors: Zhihao Duan, Ming Lu, Justin Yang, Jiangpeng He, Zhan Ma, Fengqing Zhu

Abstract: This paper explores the possibility of extending the capability of pre-trained neural image compressors (e.g., adapting to new data or target bitrates) without breaking backward compatibility, the ability to decode bitstreams encoded by the original model. We refer to this problem as continual learning of image compression. Our initial findings show that baseline solutions, such as end-to-end fine… ▽ More This paper explores the possibility of extending the capability of pre-trained neural image compressors (e.g., adapting to new data or target bitrates) without breaking backward compatibility, the ability to decode bitstreams encoded by the original model. We refer to this problem as continual learning of image compression. Our initial findings show that baseline solutions, such as end-to-end fine-tuning, do not preserve the desired backward compatibility. To tackle this, we propose a knowledge replay training strategy that effectively addresses this issue. We also design a new model architecture that enables more effective continual learning than existing baselines. Experiments are conducted for two scenarios: data-incremental learning and rate-incremental learning. The main conclusion of this paper is that neural image compressors can be fine-tuned to achieve better performance (compared to their pre-trained version) on new data and rates without compromising backward compatibility. Our code is available at https://gitlab.com/viper-purdue/continual-compression △ Less

Submitted 29 February, 2024; originally announced February 2024.

Comments: Accepted to CVPR 2024

arXiv:2402.11164 [pdf]

TinyLIC-High efficiency lossy image compression method

Authors: Gaocheng Ma, Yinfeng Chai, Tianhao Jiang, Ming Lu, Tong Chen

Abstract: Image compression has been the subject of extensive research for several decades, resulting in the development of well-known standards such as JPEG, JPEG2000, and H.264/AVC. However, recent advancements in deep learning have led to the emergence of learned image compression methods that offer significant improvements in coding efficiency compared to traditional codecs. These learned compression te… ▽ More Image compression has been the subject of extensive research for several decades, resulting in the development of well-known standards such as JPEG, JPEG2000, and H.264/AVC. However, recent advancements in deep learning have led to the emergence of learned image compression methods that offer significant improvements in coding efficiency compared to traditional codecs. These learned compression techniques have shown noticeable gains and even outperformed traditional schemes △ Less

Submitted 16 February, 2024; originally announced February 2024.

arXiv:2401.11615 [pdf, other]

Another Way to the Top: Exploit Contextual Clustering in Learned Image Coding

Authors: Yichi Zhang, Zhihao Duan, Ming Lu, Dandan Ding, Fengqing Zhu, Zhan Ma

Abstract: While convolution and self-attention are extensively used in learned image compression (LIC) for transform coding, this paper proposes an alternative called Contextual Clustering based LIC (CLIC) which primarily relies on clustering operations and local attention for correlation characterization and compact representation of an image. As seen, CLIC expands the receptive field into the entire image… ▽ More While convolution and self-attention are extensively used in learned image compression (LIC) for transform coding, this paper proposes an alternative called Contextual Clustering based LIC (CLIC) which primarily relies on clustering operations and local attention for correlation characterization and compact representation of an image. As seen, CLIC expands the receptive field into the entire image for intra-cluster feature aggregation. Afterward, features are reordered to their original spatial positions to pass through the local attention units for inter-cluster embedding. Additionally, we introduce the Guided Post-Quantization Filtering (GuidedPQF) into CLIC, effectively mitigating the propagation and accumulation of quantization errors at the initial decoding stage. Extensive experiments demonstrate the superior performance of CLIC over state-of-the-art works: when optimized using MSE, it outperforms VVC by about 10% BD-Rate in three widely-used benchmark datasets; when optimized using MS-SSIM, it saves more than 50% BD-Rate over VVC. Our CLIC offers a new way to generate compact representations for image compression, which also provides a novel direction along the line of LIC development. △ Less

Submitted 21 January, 2024; originally announced January 2024.

Comments: The 38th Annual AAAI Conference on Artificial Intelligence (AAAI 2024)

arXiv:2401.06148 [pdf, other]

doi 10.1038/s44222-023-00096-8

Artificial Intelligence for Digital and Computational Pathology

Authors: Andrew H. Song, Guillaume Jaume, Drew F. K. Williamson, Ming Y. Lu, Anurag Vaidya, Tiffany R. Miller, Faisal Mahmood

Abstract: Advances in digitizing tissue slides and the fast-paced progress in artificial intelligence, including deep learning, have boosted the field of computational pathology. This field holds tremendous potential to automate clinical diagnosis, predict patient prognosis and response to therapy, and discover new morphological biomarkers from tissue images. Some of these artificial intelligence-based syst… ▽ More Advances in digitizing tissue slides and the fast-paced progress in artificial intelligence, including deep learning, have boosted the field of computational pathology. This field holds tremendous potential to automate clinical diagnosis, predict patient prognosis and response to therapy, and discover new morphological biomarkers from tissue images. Some of these artificial intelligence-based systems are now getting approved to assist clinical diagnosis; however, technical barriers remain for their widespread clinical adoption and integration as a research tool. This Review consolidates recent methodological advances in computational pathology for predicting clinical end points in whole-slide images and highlights how these developments enable the automation of clinical practice and the discovery of new biomarkers. We then provide future perspectives as the field expands into a broader range of clinical and research tasks with increasingly diverse modalities of clinical data. △ Less

Submitted 12 December, 2023; originally announced January 2024.

Journal ref: Nature Reviews Bioengineering 2023

arXiv:2401.04412 [pdf, other]

Deep Covariance Alignment for Domain Adaptive Remote Sensing Image Segmentation

Authors: Linshan Wu, Ming Lu, Leyuan Fang

Abstract: Unsupervised domain adaptive (UDA) image segmentation has recently gained increasing attention, aiming to improve the generalization capability for transferring knowledge from the source domain to the target domain. However, in high spatial resolution remote sensing image (RSI), the same category from different domains (\emph{e.g.}, urban and rural) can appear to be totally different with extremel… ▽ More Unsupervised domain adaptive (UDA) image segmentation has recently gained increasing attention, aiming to improve the generalization capability for transferring knowledge from the source domain to the target domain. However, in high spatial resolution remote sensing image (RSI), the same category from different domains (\emph{e.g.}, urban and rural) can appear to be totally different with extremely inconsistent distributions, which heavily limits the UDA accuracy. To address this problem, in this paper, we propose a novel Deep Covariance Alignment (DCA) model for UDA RSI segmentation. The DCA can explicitly align category features to learn shared domain-invariant discriminative feature representations, which enhances the ability of model generalization. Specifically, a Category Feature Pooling (CFP) module is first employed to extract category features by combining the coarse outputs and the deep features. Then, we leverage a novel Covariance Regularization (CR) to enforce the intra-category features to be closer and the inter-category features to be further separate. Compared with the existing category alignment methods, our CR aims to regularize the correlation between different dimensions of the features and thus performs more robustly when dealing with the divergent category features of imbalanced and inconsistent distributions. Finally, we propose a stagewise procedure to train the DCA in order to alleviate the error accumulation. Experiments on both Rural-to-Urban and Urban-to-Rural scenarios of the LoveDA dataset demonstrate the superiority of our proposed DCA over other state-of-the-art UDA segmentation methods. Code is available at https://github.com/Luffy03/DCA. △ Less

Submitted 9 January, 2024; originally announced January 2024.

Comments: A paper accepted by TGRS

arXiv:2312.08743 [pdf, other]

FAPP: Fast and Adaptive Perception and Planning for UAVs in Dynamic Cluttered Environments

Authors: Minghao Lu, Xiyu Fan, Han Chen, Peng Lu

Abstract: Obstacle avoidance for Unmanned Aerial Vehicles (UAVs) in cluttered environments is significantly challenging. Existing obstacle avoidance for UAVs either focuses on fully static environments or static environments with only a few dynamic objects. In this paper, we take the initiative to consider the obstacle avoidance of UAVs in dynamic cluttered environments in which dynamic objects are the domi… ▽ More Obstacle avoidance for Unmanned Aerial Vehicles (UAVs) in cluttered environments is significantly challenging. Existing obstacle avoidance for UAVs either focuses on fully static environments or static environments with only a few dynamic objects. In this paper, we take the initiative to consider the obstacle avoidance of UAVs in dynamic cluttered environments in which dynamic objects are the dominant objects. This type of environment poses significant challenges to both perception and planning. Multiple dynamic objects possess various motions, making it extremely difficult to estimate and predict their motions using one motion model. The planning must be highly efficient to avoid cluttered dynamic objects. This paper proposes Fast and Adaptive Perception and Planning (FAPP) for UAVs flying in complex dynamic cluttered environments. A novel and efficient point cloud segmentation strategy is proposed to distinguish static and dynamic objects. To address multiple dynamic objects with different motions, an adaptive estimation method with covariance adaptation is proposed to quickly and accurately predict their motions. Our proposed trajectory optimization algorithm is highly efficient, enabling it to avoid fast objects. Furthermore, an adaptive re-planning method is proposed to address the case when the trajectory optimization cannot find a feasible solution, which is common for dynamic cluttered environments. Extensive validations in both simulation and real-world experiments demonstrate the effectiveness of our proposed system for highly dynamic and cluttered environments. △ Less

Submitted 14 December, 2023; originally announced December 2023.

arXiv:2312.07126 [pdf, other]

Deep Hierarchical Video Compression

Authors: Ming Lu, Zhihao Duan, Fengqing Zhu, Zhan Ma

Abstract: Recently, probabilistic predictive coding that directly models the conditional distribution of latent features across successive frames for temporal redundancy removal has yielded promising results. Existing methods using a single-scale Variational AutoEncoder (VAE) must devise complex networks for conditional probability estimation in latent space, neglecting multiscale characteristics of video f… ▽ More Recently, probabilistic predictive coding that directly models the conditional distribution of latent features across successive frames for temporal redundancy removal has yielded promising results. Existing methods using a single-scale Variational AutoEncoder (VAE) must devise complex networks for conditional probability estimation in latent space, neglecting multiscale characteristics of video frames. Instead, this work proposes hierarchical probabilistic predictive coding, for which hierarchal VAEs are carefully designed to characterize multiscale latent features as a family of flexible priors and posteriors to predict the probabilities of future frames. Under such a hierarchical structure, lightweight networks are sufficient for prediction. The proposed method outperforms representative learned video compression models on common testing videos and demonstrates computational friendliness with much less memory footprint and faster encoding/decoding. Extensive experiments on adaptation to temporal patterns also indicate the better generalization of our hierarchical predictive mechanism. Furthermore, our solution is the first to enable progressive decoding that is favored in networked video applications with packet loss. △ Less

Submitted 12 December, 2023; originally announced December 2023.

arXiv:2312.01361 [pdf, other]

MoEC: Mixture of Experts Implicit Neural Compression

Authors: Jianchen Zhao, Cheng-Ching Tseng, Ming Lu, Ruichuan An, Xiaobao Wei, He Sun, Shanghang Zhang

Abstract: Emerging Implicit Neural Representation (INR) is a promising data compression technique, which represents the data using the parameters of a Deep Neural Network (DNN). Existing methods manually partition a complex scene into local regions and overfit the INRs into those regions. However, manually designing the partition scheme for a complex scene is very challenging and fails to jointly learn the… ▽ More Emerging Implicit Neural Representation (INR) is a promising data compression technique, which represents the data using the parameters of a Deep Neural Network (DNN). Existing methods manually partition a complex scene into local regions and overfit the INRs into those regions. However, manually designing the partition scheme for a complex scene is very challenging and fails to jointly learn the partition and INRs. To solve the problem, we propose MoEC, a novel implicit neural compression method based on the theory of mixture of experts. Specifically, we use a gating network to automatically assign a specific INR to a 3D point in the scene. The gating network is trained jointly with the INRs of different local regions. Compared with block-wise and tree-structured partitions, our learnable partition can adaptively find the optimal partition in an end-to-end manner. We conduct detailed experiments on massive and diverse biomedical data to demonstrate the advantages of MoEC against existing approaches. In most of experiment settings, we have achieved state-of-the-art results. Especially in cases of extreme compression ratios, such as 6000x, we are able to uphold the PSNR of 48.16. △ Less

Submitted 3 December, 2023; originally announced December 2023.

arXiv:2311.16572

Adapting to climate change: Long-term impact of wind resource changes on China's power system resilience

Authors: Jiaqi Ruan, Xiangrui Meng, Yifan Zhu, Gaoqi Liang, Xianzhuo Sun, Huayi Wu, Huijuan Xiao, Mengqian Lu, Pin Gao, Jiapeng Li, Wai-Kin Wong, Zhao Xu, Junhua Zhao

Abstract: Modern society's reliance on power systems is at risk from the escalating effects of wind-related climate change. Yet, failure to identify the intricate relationship between wind-related climate risks and power systems could lead to serious short- and long-term issues, including partial or complete blackouts. Here, we develop a comprehensive framework to assess China's power system resilience acro… ▽ More Modern society's reliance on power systems is at risk from the escalating effects of wind-related climate change. Yet, failure to identify the intricate relationship between wind-related climate risks and power systems could lead to serious short- and long-term issues, including partial or complete blackouts. Here, we develop a comprehensive framework to assess China's power system resilience across various climate change scenarios, enabling a holistic evaluation of the repercussions induced by wind-related climate change. Our findings indicate that China's current wind projects and planning strategies could be jeopardized by wind-related climate change, with up to a 12\% decline in regional wind power availability. Moreover, our results underscore a pronounced vulnerability of power system resilience amidst the rigors of hastened climate change, unveiling a potential amplification of resilience deterioration, even approaching fourfold by 2060 under the most severe scenario, relative to the 2020 benchmark. This work advocates for strategic financial deployment within the power sector aimed at climate adaptation, enhancing power system resilience to avert profound losses from long-term, wind-influenced climatic fluctuations. △ Less

Submitted 24 January, 2024; v1 submitted 28 November, 2023; originally announced November 2023.

Comments: Not suitable for publication

arXiv:2311.16565 [pdf, other]

DiffusionTalker: Personalization and Acceleration for Speech-Driven 3D Face Diffuser

Authors: Peng Chen, Xiaobao Wei, Ming Lu, Yitong Zhu, Naiming Yao, Xingyu Xiao, Hui Chen

Abstract: Speech-driven 3D facial animation has been an attractive task in both academia and industry. Traditional methods mostly focus on learning a deterministic mapping from speech to animation. Recent approaches start to consider the non-deterministic fact of speech-driven 3D face animation and employ the diffusion model for the task. However, personalizing facial animation and accelerating animation ge… ▽ More Speech-driven 3D facial animation has been an attractive task in both academia and industry. Traditional methods mostly focus on learning a deterministic mapping from speech to animation. Recent approaches start to consider the non-deterministic fact of speech-driven 3D face animation and employ the diffusion model for the task. However, personalizing facial animation and accelerating animation generation are still two major limitations of existing diffusion-based methods. To address the above limitations, we propose DiffusionTalker, a diffusion-based method that utilizes contrastive learning to personalize 3D facial animation and knowledge distillation to accelerate 3D animation generation. Specifically, to enable personalization, we introduce a learnable talking identity to aggregate knowledge in audio sequences. The proposed identity embeddings extract customized facial cues across different people in a contrastive learning manner. During inference, users can obtain personalized facial animation based on input audio, reflecting a specific talking style. With a trained diffusion model with hundreds of steps, we distill it into a lightweight model with 8 steps for acceleration. Extensive experiments are conducted to demonstrate that our method outperforms state-of-the-art methods. The code will be released. △ Less

Submitted 2 December, 2023; v1 submitted 28 November, 2023; originally announced November 2023.

arXiv:2311.02035 [pdf, other]

A Highly-Compact Direct-Injection Universal Power Flow and Quality Control Circuit

Authors: Mowei Lu, Mengjie Qin, Jan Kacetl, Eeshta Suresh, Teng Long, Stefan M. Goetz

Abstract: This paper presents a novel direct-injection modular universal power flow and quality control topology exclusively using lower power components. In addition to conventional high-voltage applications, it is particularly attractive for the distribution and secondary grids, e.g., in soft open points, down to low voltage as it can exploit the latest developments in low-voltage high-current semiconduct… ▽ More This paper presents a novel direct-injection modular universal power flow and quality control topology exclusively using lower power components. In addition to conventional high-voltage applications, it is particularly attractive for the distribution and secondary grids, e.g., in soft open points, down to low voltage as it can exploit the latest developments in low-voltage high-current semiconductors. In contrast to other concepts that do not interface the grid through transformers, it does not need to convert the entire line power but only the injected or extracted power difference. The proposed power flow and quality (f/q) controller comprises a shunt active front end, together with high-frequency links serving as a power supply for a series floating module per phase. Each of the floating modules is in series with one phase of the line, floating with the electric potential of that particular phase, avoiding any ground connection. Omitting bulky and dynamically limited line transformers of conventional universal power flow controllers, the presented direct-injection f/q controller enables exceptionally small size and volume, high power density, high frequency content, and fast response. In contrast to direct-injection concepts with full back-to-back converters, it only needs to handle a fraction of the power. The circuit combines grid-voltage low-current electronics in the shunt unit and low-voltage high-current modules in the floating series injection units. Simulations and experiments demonstrate and validate the concept. △ Less

Submitted 3 November, 2023; originally announced November 2023.

arXiv:2310.08292 [pdf, other]

Concealed Electronic Countermeasures of Radar Signal with Adversarial Examples

Authors: Ruinan Ma, Canjie Zhu, Mingfeng Lu, Yunjie Li, Yu-an Tan, Ruibin Zhang, Ran Tao

Abstract: Electronic countermeasures involving radar signals are an important aspect of modern warfare. Traditional electronic countermeasures techniques typically add large-scale interference signals to ensure interference effects, which can lead to attacks being too obvious. In recent years, AI-based attack methods have emerged that can effectively solve this problem, but the attack scenarios are currentl… ▽ More Electronic countermeasures involving radar signals are an important aspect of modern warfare. Traditional electronic countermeasures techniques typically add large-scale interference signals to ensure interference effects, which can lead to attacks being too obvious. In recent years, AI-based attack methods have emerged that can effectively solve this problem, but the attack scenarios are currently limited to time domain radar signal classification. In this paper, we focus on the time-frequency images classification scenario of radar signals. We first propose an attack pipeline under the time-frequency images scenario and DITIMI-FGSM attack algorithm with high transferability. Then, we propose STFT-based time domain signal attack(STDS) algorithm to solve the problem of non-invertibility in time-frequency analysis, thus obtaining the time-domain representation of the interference signal. A large number of experiments show that our attack pipeline is feasible and the proposed attack method has a high success rate. △ Less

Submitted 12 October, 2023; originally announced October 2023.

arXiv:2310.08068 [pdf, other]

Frequency-Aware Re-Parameterization for Over-Fitting Based Image Compression

Authors: Yun Ye, Yanjie Pan, Qually Jiang, Ming Lu, Xiaoran Fang, Beryl Xu

Abstract: Over-fitting-based image compression requires weights compactness for compression and fast convergence for practical use, posing challenges for deep convolutional neural networks (CNNs) based methods. This paper presents a simple re-parameterization method to train CNNs with reduced weights storage and accelerated convergence. The convolution kernels are re-parameterized as a weighted sum of discr… ▽ More Over-fitting-based image compression requires weights compactness for compression and fast convergence for practical use, posing challenges for deep convolutional neural networks (CNNs) based methods. This paper presents a simple re-parameterization method to train CNNs with reduced weights storage and accelerated convergence. The convolution kernels are re-parameterized as a weighted sum of discrete cosine transform (DCT) kernels enabling direct optimization in the frequency domain. Combined with L1 regularization, the proposed method surpasses vanilla convolutions by achieving a significantly improved rate-distortion with low computational cost. The proposed method is verified with extensive experiments of over-fitting-based image restoration on various datasets, achieving up to -46.12% BD-rate on top of HEIF with only 200 iterations. △ Less

Submitted 12 October, 2023; originally announced October 2023.

Comments: to be published at ICIP 2023, this version fixed a mistake in Eq. (1) in the proceeding version

arXiv:2310.06162 [pdf]

Empirical Evaluation of the Segment Anything Model (SAM) for Brain Tumor Segmentation

Authors: Mohammad Peivandi, Jason Zhang, Michael Lu, Dongxiao Zhu, Zhifeng Kou

Abstract: Brain tumor segmentation presents a formidable challenge in the field of Medical Image Segmentation. While deep-learning models have been useful, human expert segmentation remains the most accurate method. The recently released Segment Anything Model (SAM) has opened up the opportunity to apply foundation models to this difficult task. However, SAM was primarily trained on diverse natural images.… ▽ More Brain tumor segmentation presents a formidable challenge in the field of Medical Image Segmentation. While deep-learning models have been useful, human expert segmentation remains the most accurate method. The recently released Segment Anything Model (SAM) has opened up the opportunity to apply foundation models to this difficult task. However, SAM was primarily trained on diverse natural images. This makes applying SAM to biomedical segmentation, such as brain tumors with less defined boundaries, challenging. In this paper, we enhanced SAM's mask decoder using transfer learning with the Decathlon brain tumor dataset. We developed three methods to encapsulate the four-dimensional data into three dimensions for SAM. An on-the-fly data augmentation approach has been used with a combination of rotations and elastic deformations to increase the size of the training dataset. Two key metrics: the Dice Similarity Coefficient (DSC) and the Hausdorff Distance 95th Percentile (HD95), have been applied to assess the performance of our segmentation models. These metrics provided valuable insights into the quality of the segmentation results. In our evaluation, we compared this improved model to two benchmarks: the pretrained SAM and the widely used model, nnUNetv2. We find that the improved SAM shows considerable improvement over the pretrained SAM, while nnUNetv2 outperformed the improved SAM in terms of overall segmentation accuracy. Nevertheless, the improved SAM demonstrated slightly more consistent results than nnUNetv2, especially on challenging cases that can lead to larger Hausdorff distances. In the future, more advanced techniques can be applied in order to further improve the performance of SAM on brain tumor segmentation. △ Less

Submitted 9 October, 2023; originally announced October 2023.

arXiv:2309.06421 [pdf, other]

AGMDT: Virtual Staining of Renal Histology Images with Adjacency-Guided Multi-Domain Transfer

Authors: Tao Ma, Chao Zhang, Min Lu, Lin Luo

Abstract: Renal pathology, as the gold standard of kidney disease diagnosis, requires doctors to analyze a series of tissue slices stained by H&E staining and special staining like Masson, PASM, and PAS, respectively. These special staining methods are costly, time-consuming, and hard to standardize for wide use especially in primary hospitals. Advances of supervised learning methods have enabled the virtua… ▽ More Renal pathology, as the gold standard of kidney disease diagnosis, requires doctors to analyze a series of tissue slices stained by H&E staining and special staining like Masson, PASM, and PAS, respectively. These special staining methods are costly, time-consuming, and hard to standardize for wide use especially in primary hospitals. Advances of supervised learning methods have enabled the virtually conversion of H&E images into special staining images, but achieving pixel-to-pixel alignment for training remains challenging. In contrast, unsupervised learning methods regarding different stains as different style transfer domains can utilize unpaired data, but they ignore the spatial inter-domain correlations and thus decrease the trustworthiness of structural details for diagnosis. In this paper, we propose a novel virtual staining framework AGMDT to translate images into other domains by avoiding pixel-level alignment and meanwhile utilizing the correlations among adjacent tissue slices. We first build a high-quality multi-domain renal histological dataset where each specimen case comprises a series of slices stained in various ways. Based on it, the proposed framework AGMDT discovers patch-level aligned pairs across the serial slices of multi-domains through glomerulus detection and bipartite graph matching, and utilizes such correlations to supervise the end-to-end model for multi-domain staining transformation. Experimental results show that the proposed AGMDT achieves a good balance between the precise pixel-level alignment and unpaired domain transfer by exploiting correlations across multi-domain serial pathological slices, and outperforms the state-of-the-art methods in both quantitative measure and morphological details. △ Less

Submitted 17 September, 2023; v1 submitted 12 September, 2023; originally announced September 2023.

Comments: BMVC 2023

arXiv:2308.15144

TKwinFormer: Top k Window Attention in Vision Transformers for Feature Matching

Authors: Yun Liao, Yide Di, Hao Zhou, Kaijun Zhu, Mingyu Lu, Yijia Zhang, Qing Duan, Junhui Liu

Abstract: Local feature matching remains a challenging task, primarily due to difficulties in matching sparse keypoints and low-texture regions. The key to solving this problem lies in effectively and accurately integrating global and local information. To achieve this goal, we introduce an innovative local feature matching method called TKwinFormer. Our approach employs a multi-stage matching strategy to o… ▽ More Local feature matching remains a challenging task, primarily due to difficulties in matching sparse keypoints and low-texture regions. The key to solving this problem lies in effectively and accurately integrating global and local information. To achieve this goal, we introduce an innovative local feature matching method called TKwinFormer. Our approach employs a multi-stage matching strategy to optimize the efficiency of information interaction. Furthermore, we propose a novel attention mechanism called Top K Window Attention, which facilitates global information interaction through window tokens prior to patch-level matching, resulting in improved matching accuracy. Additionally, we design an attention block to enhance attention between channels. Experimental results demonstrate that TKwinFormer outperforms state-of-the-art methods on various benchmarks. Code is available at: https://github.com/LiaoYun0x0/TKwinFormer. △ Less

Submitted 30 March, 2025; v1 submitted 29 August, 2023; originally announced August 2023.

Comments: After careful reconsideration, we have decided to withdraw the manuscript due to data inconsistencies and issues with methodology. Given these concerns, we believe it would be inappropriate to proceed with the revised version, and we have therefore decided to retract our submission

ACM Class: I.4.7

arXiv:2306.08955 [pdf, other]

A Comparison of Self-Supervised Pretraining Approaches for Predicting Disease Risk from Chest Radiograph Images

Authors: Yanru Chen, Michael T Lu, Vineet K Raghu

Abstract: Deep learning is the state-of-the-art for medical imaging tasks, but requires large, labeled datasets. For risk prediction, large datasets are rare since they require both imaging and follow-up (e.g., diagnosis codes). However, the release of publicly available imaging data with diagnostic labels presents an opportunity for self and semi-supervised approaches to improve label efficiency for risk p… ▽ More Deep learning is the state-of-the-art for medical imaging tasks, but requires large, labeled datasets. For risk prediction, large datasets are rare since they require both imaging and follow-up (e.g., diagnosis codes). However, the release of publicly available imaging data with diagnostic labels presents an opportunity for self and semi-supervised approaches to improve label efficiency for risk prediction. Though several studies have compared self-supervised approaches in natural image classification, object detection, and medical image interpretation, there is limited data on which approaches learn robust representations for risk prediction. We present a comparison of semi- and self-supervised learning to predict mortality risk using chest x-ray images. We find that a semi-supervised autoencoder outperforms contrastive and transfer learning in internal and external validation. △ Less

Submitted 15 June, 2023; originally announced June 2023.

Comments: 33 pages, 22 figures, Accepted for publication at MIDL 2023

arXiv:2306.05196 [pdf, other]

Channel prior convolutional attention for medical image segmentation

Authors: Hejun Huang, Zuguo Chen, Ying Zou, Ming Lu, Chaoyang Chen

Abstract: Characteristics such as low contrast and significant organ shape variations are often exhibited in medical images. The improvement of segmentation performance in medical imaging is limited by the generally insufficient adaptive capabilities of existing attention mechanisms. An efficient Channel Prior Convolutional Attention (CPCA) method is proposed in this paper, supporting the dynamic distributi… ▽ More Characteristics such as low contrast and significant organ shape variations are often exhibited in medical images. The improvement of segmentation performance in medical imaging is limited by the generally insufficient adaptive capabilities of existing attention mechanisms. An efficient Channel Prior Convolutional Attention (CPCA) method is proposed in this paper, supporting the dynamic distribution of attention weights in both channel and spatial dimensions. Spatial relationships are effectively extracted while preserving the channel prior by employing a multi-scale depth-wise convolutional module. The ability to focus on informative channels and important regions is possessed by CPCA. A segmentation network called CPCANet for medical image segmentation is proposed based on CPCA. CPCANet is validated on two publicly available datasets. Improved segmentation performance is achieved by CPCANet while requiring fewer computational resources through comparisons with state-of-the-art algorithms. Our code is publicly available at \url{https://github.com/Cuthbert-Huang/CPCANet}. △ Less

Submitted 8 June, 2023; originally announced June 2023.

arXiv:2304.06497 [pdf, other]

A Comprehensive Comparison of Projections in Omnidirectional Super-Resolution

Authors: Huicheng Pi, Senmao Tian, Ming Lu, Jiaming Liu, Yandong Guo, Shunli Zhang

Abstract: Super-Resolution (SR) has gained increasing research attention over the past few years. With the development of Deep Neural Networks (DNNs), many super-resolution methods based on DNNs have been proposed. Although most of these methods are aimed at ordinary frames, there are few works on super-resolution of omnidirectional frames. In these works, omnidirectional frames are projected from the 3D sp… ▽ More Super-Resolution (SR) has gained increasing research attention over the past few years. With the development of Deep Neural Networks (DNNs), many super-resolution methods based on DNNs have been proposed. Although most of these methods are aimed at ordinary frames, there are few works on super-resolution of omnidirectional frames. In these works, omnidirectional frames are projected from the 3D sphere to a 2D plane by Equi-Rectangular Projection (ERP). Although ERP has been widely used for projection, it has severe projection distortion near poles. Current DNN-based SR methods use 2D convolution modules, which is more suitable for the regular grid. In this paper, we find that different projection methods have great impact on the performance of DNNs. To study this problem, a comprehensive comparison of projections in omnidirectional super-resolution is conducted. We compare the SR results of different projection methods. Experimental results show that Equi-Angular cube map projection (EAC), which has minimal distortion, achieves the best result in terms of WS-PSNR compared with other projections. Code and data will be released. △ Less

Submitted 13 April, 2023; originally announced April 2023.

Comments: Accepted to ICASSP2023

arXiv:2302.08899 [pdf, other]

doi 10.1109/TPAMI.2023.3322904

QARV: Quantization-Aware ResNet VAE for Lossy Image Compression

Authors: Zhihao Duan, Ming Lu, Jack Ma, Yuning Huang, Zhan Ma, Fengqing Zhu

Abstract: This paper addresses the problem of lossy image compression, a fundamental problem in image processing and information theory that is involved in many real-world applications. We start by reviewing the framework of variational autoencoders (VAEs), a powerful class of generative probabilistic models that has a deep connection to lossy compression. Based on VAEs, we develop a novel scheme for lossy… ▽ More This paper addresses the problem of lossy image compression, a fundamental problem in image processing and information theory that is involved in many real-world applications. We start by reviewing the framework of variational autoencoders (VAEs), a powerful class of generative probabilistic models that has a deep connection to lossy compression. Based on VAEs, we develop a novel scheme for lossy image compression, which we name quantization-aware ResNet VAE (QARV). Our method incorporates a hierarchical VAE architecture integrated with test-time quantization and quantization-aware training, without which efficient entropy coding would not be possible. In addition, we design the neural network architecture of QARV specifically for fast decoding and propose an adaptive normalization operation for variable-rate compression. Extensive experiments are conducted, and results show that QARV achieves variable-rate compression, high-speed decoding, and a better rate-distortion performance than existing baseline methods. The code of our method is publicly accessible at https://github.com/duanzhiihao/lossy-vae △ Less

Submitted 1 December, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

Comments: Full version (19 pages, includes appendix) of the paper accepted by IEEE TPAMI

arXiv:2212.07778 [pdf, other]

doi 10.1109/TPAMI.2024.3359326

Efficient Visual Computing with Camera RAW Snapshots

Authors: Zhihao Li, Ming Lu, Xu Zhang, Xin Feng, M. Salman Asif, Zhan Ma

Abstract: Conventional cameras capture image irradiance on a sensor and convert it to RGB images using an image signal processor (ISP). The images can then be used for photography or visual computing tasks in a variety of applications, such as public safety surveillance and autonomous driving. One can argue that since RAW images contain all the captured information, the conversion of RAW to RGB using an ISP… ▽ More Conventional cameras capture image irradiance on a sensor and convert it to RGB images using an image signal processor (ISP). The images can then be used for photography or visual computing tasks in a variety of applications, such as public safety surveillance and autonomous driving. One can argue that since RAW images contain all the captured information, the conversion of RAW to RGB using an ISP is not necessary for visual computing. In this paper, we propose a novel $ρ$-Vision framework to perform high-level semantic understanding and low-level compression using RAW images without the ISP subsystem used for decades. Considering the scarcity of available RAW image datasets, we first develop an unpaired CycleR2R network based on unsupervised CycleGAN to train modular unrolled ISP and inverse ISP (invISP) models using unpaired RAW and RGB images. We can then flexibly generate simulated RAW images (simRAW) using any existing RGB image dataset and finetune different models originally trained for the RGB domain to process real-world camera RAW images. We demonstrate object detection and image compression capabilities in RAW-domain using RAW-domain YOLOv3 and RAW image compressor (RIC) on snapshots from various cameras. Quantitative results reveal that RAW-domain task inference provides better detection accuracy and compression compared to RGB-domain processing. Furthermore, the proposed \r{ho}-Vision generalizes across various camera sensors and different task-specific models. Additional advantages of the proposed $ρ$-Vision that eliminates the ISP are the potential reductions in computations and processing times. △ Less

Submitted 25 January, 2024; v1 submitted 15 December, 2022; originally announced December 2022.

Comments: Accepted by T-PAMI 2024. Homepage: https://njuvision.github.io/rho-vision

arXiv:2211.13092 [pdf, other]

doi 10.1109/TVT.2022.3213179

Efficient Rigid Body Localization based on Euclidean Distance Matrix Completion for AGV Positioning under Harsh Environment

Authors: Xinyuan An, Xiaowei Cui, Sihao Zhao, Gang Liu, Mingquan Lu

Abstract: In real-world applications for automatic guided vehicle (AGV) navigation, the positioning system based on the time-of-flight (TOF) measurements between anchors and tags is confronted with the problem of insufficient measurements caused by blockages to radio signals or lasers, etc. Mounting multiple tags at different positions of the AGV to collect more TOFs is a feasible solution to tackle this di… ▽ More In real-world applications for automatic guided vehicle (AGV) navigation, the positioning system based on the time-of-flight (TOF) measurements between anchors and tags is confronted with the problem of insufficient measurements caused by blockages to radio signals or lasers, etc. Mounting multiple tags at different positions of the AGV to collect more TOFs is a feasible solution to tackle this difficulty. Vehicle localization by exploiting the measurements between multiple tags and anchors is a rigid body localization (RBL) problem, which estimates both the position and attitude of the vehicle. However, the state-of-the-art solutions to the RBL problem do not deal with missing measurements, and thus will result in degraded localization availability and accuracy in harsh environments. In this paper, different from these existing solutions for RBL, we model this problem as a sensor network localization problem with missing TOFs. To solve this problem, we propose a new efficient RBL solution based on Euclidean distance matrix (EDM) completion, abbreviated as ERBL-EDMC. Firstly, we develop a method to determine the upper and lower bounds of the missing measurements to complete the EDM reliably, using the known relative positions between tags and the statistics of the TOF measurements. Then, based on the completed EDM, the global tag positions are obtained from a coarse estimation followed by a refinement step assisted with inter-tag distances. Finally, the optimal vehicle position and attitude are obtained iteratively based on the estimated tag positions from the previous step. Theoretical analysis and simulation results show that the proposed ERBL-EDMC method effectively solves the RBL problem with incomplete measurements. It obtains the optimal positioning results while maintaining low computational complexity compared with the existing RBL methods based on semi-definite relaxation. △ Less

Submitted 23 November, 2022; originally announced November 2022.

arXiv:2211.12621 [pdf]

doi 10.1007/s10291-016-0560-y

A priori knowledge-free fast positioning approach for BeiDou receivers

Authors: Sihao Zhao, Xiaowei Cui, Mingquan Lu

Abstract: A Global Navigation Satellite System (GNSS) receiver usually needs a sufficient number of full pseudorange measurements to obtain a position solution. However, it is time-consuming to acquire full pseudorange information from only the satellite broadcast signals due to the navigation data features of GNSS. In order to realize fast positioning during a cold or warm start in a GNSS receiver, the exi… ▽ More A Global Navigation Satellite System (GNSS) receiver usually needs a sufficient number of full pseudorange measurements to obtain a position solution. However, it is time-consuming to acquire full pseudorange information from only the satellite broadcast signals due to the navigation data features of GNSS. In order to realize fast positioning during a cold or warm start in a GNSS receiver, the existing approaches require an initial estimation of position and time or require a number of computational steps to recover the full pseudorange information from fractional pseudoranges and then compute the position solution. The BeiDou Navigation Satellite System (BDS) has a unique constellation distribution and a fast navigation data rate for geostationary earth orbit (GEO) satellites. Taking advantage of these features, we propose a fast positioning technique for BDS receivers. It simultaneously processes the full and fractional pseudorange measurements from the BDS GEOs and non-GEOs, respectively, which is faster than processing all full measurements. This method resolves the position solution and recovers the full pseudoranges for non-GEOs simultaneously within 1 s theoretically and does not need an estimate of the initial position. Simulation and real data experiments confirm that the proposed technique completes fast positioning without a priori position and time estimation, and the positioning accuracy is identical with the conventional single-point positioning approach using full pseudorange measurements from all available satellites. △ Less

Submitted 22 November, 2022; originally announced November 2022.

arXiv:2211.02854 [pdf, other]

Rate-Distortion Optimized Post-Training Quantization for Learned Image Compression

Authors: Junqi Shi, Ming Lu, Zhan Ma

Abstract: Quantizing a floating-point neural network to its fixed-point representation is crucial for Learned Image Compression (LIC) because it improves decoding consistency for interoperability and reduces space-time complexity for implementation. Existing solutions often have to retrain the network for model quantization, which is time-consuming and impractical to some extent. This work suggests using Po… ▽ More Quantizing a floating-point neural network to its fixed-point representation is crucial for Learned Image Compression (LIC) because it improves decoding consistency for interoperability and reduces space-time complexity for implementation. Existing solutions often have to retrain the network for model quantization, which is time-consuming and impractical to some extent. This work suggests using Post-Training Quantization (PTQ) to process pretrained, off-the-shelf LIC models. We theoretically prove that minimizing quantization-induced mean square error (MSE) of model parameters (e.g., weight, bias, and activation) in PTQ is sub-optimal for compression tasks and thus develop a novel Rate-Distortion (R-D) Optimized PTQ (RDO-PTQ) to best retain the compression performance. Given a LIC model, RDO-PTQ layer-wisely determines the quantization parameters to transform the original floating-point parameters in 32-bit precision (FP32) to fixed-point ones at 8-bit precision (INT8), for which a tiny calibration image set is compressed in optimization to minimize R-D loss. Experiments reveal the outstanding efficiency of the proposed method on different LICs, showing the closest coding performance to their floating-point counterparts. Our method is a lightweight and plug-and-play approach without retraining model parameters but just adjusting quantization parameters, which is attractive to practitioners. Such an RDO-PTQ is a task-oriented PTQ scheme, which is then extended to quantize popular super-resolution and image classification models with negligible performance loss, further evidencing the generalization of our methodology. Related materials will be released at https://njuvision.github.io/RDO-PTQ. △ Less

Submitted 8 October, 2023; v1 submitted 5 November, 2022; originally announced November 2022.

arXiv:2210.01438 [pdf, other]

Complementary consistency semi-supervised learning for 3D left atrial image segmentation

Authors: Hejun Huang, Zuguo Chen, Chaoyang Chen, Ming Lu, Ying Zou

Abstract: A network based on complementary consistency training, called CC-Net, has been proposed for semi-supervised left atrium image segmentation. CC-Net efficiently utilizes unlabeled data from the perspective of complementary information to address the problem of limited ability of existing semi-supervised segmentation algorithms to extract information from unlabeled data. The complementary symmetric s… ▽ More A network based on complementary consistency training, called CC-Net, has been proposed for semi-supervised left atrium image segmentation. CC-Net efficiently utilizes unlabeled data from the perspective of complementary information to address the problem of limited ability of existing semi-supervised segmentation algorithms to extract information from unlabeled data. The complementary symmetric structure of CC-Net includes a main model and two auxiliary models. The complementary model inter-perturbations between the main and auxiliary models force consistency to form complementary consistency. The complementary information obtained by the two auxiliary models helps the main model to effectively focus on ambiguous areas, while enforcing consistency between the models is advantageous in obtaining decision boundaries with low uncertainty. CC-Net has been validated on two public datasets. In the case of specific proportions of labeled data, compared with current advanced algorithms, CC-Net has the best semi-supervised segmentation performance. Our code is publicly available at https://github.com/Cuthbert-Huang/CC-Net. △ Less

Submitted 4 April, 2023; v1 submitted 4 October, 2022; originally announced October 2022.

Showing 1–50 of 110 results for author: Lu, M