-
Dynamical Multimodal Fusion with Mixture-of-Experts for Localizations
Authors:
Bohao Wang,
Zitao Shuai,
Fenghao Zhu,
Chongwen Huang,
Yongliang Shen,
Zhaoyang Zhang,
Qianqian Yang,
Sami Muhaidat,
Merouane Debbah
Abstract:
Multimodal fingerprinting is a crucial technique to sub-meter 6G integrated sensing and communications (ISAC) localization, but two hurdles block deployment: (i) the contribution each modality makes to the target position varies with the operating conditions such as carrier frequency, and (ii) spatial and fingerprint ambiguities markedly undermine localization accuracy, especially in non-line-of-s…
▽ More
Multimodal fingerprinting is a crucial technique to sub-meter 6G integrated sensing and communications (ISAC) localization, but two hurdles block deployment: (i) the contribution each modality makes to the target position varies with the operating conditions such as carrier frequency, and (ii) spatial and fingerprint ambiguities markedly undermine localization accuracy, especially in non-line-of-sight (NLOS) scenarios. To solve these problems, we introduce SCADF-MoE, a spatial-context aware dynamic fusion network built on a soft mixture-of-experts backbone. SCADF-MoE first clusters neighboring points into short trajectories to inject explicit spatial context. Then, it adaptively fuses channel state information, angle of arrival profile, distance, and gain through its learnable MoE router, so that the most reliable cues dominate at each carrier band. The fused representation is fed to a modality-task MoE that simultaneously regresses the coordinates of every vertex in the trajectory and its centroid, thereby exploiting inter-point correlations. Finally, an auxiliary maximum-mean-discrepancy loss enforces expert diversity and mitigates gradient interference, stabilizing multi-task training. On three real urban layouts and three carrier bands (2.6, 6, 28 GHz), the model delivers consistent sub-meter MSE and halves unseen-NLOS error versus the best prior work. To our knowledge, this is the first work that leverages large-scale multimodal MoE for frequency-robust ISAC localization.
△ Less
Submitted 1 July, 2025;
originally announced July 2025.
-
Structured Pruning and Quantization for Learned Image Compression
Authors:
Md Adnan Faisal Hossain,
Fengqing Zhu
Abstract:
The high computational costs associated with large deep learning models significantly hinder their practical deployment. Model pruning has been widely explored in deep learning literature to reduce their computational burden, but its application has been largely limited to computer vision tasks such as image classification and object detection. In this work, we propose a structured pruning method…
▽ More
The high computational costs associated with large deep learning models significantly hinder their practical deployment. Model pruning has been widely explored in deep learning literature to reduce their computational burden, but its application has been largely limited to computer vision tasks such as image classification and object detection. In this work, we propose a structured pruning method targeted for Learned Image Compression (LIC) models that aims to reduce the computational costs associated with image compression while maintaining the rate-distortion performance. We employ a Neural Architecture Search (NAS) method based on the rate-distortion loss for computing the pruning ratio for each layer of the network. We compare our pruned model with the uncompressed LIC Model with same network architecture and show that it can achieve model size reduction without any BD-Rate performance drop. We further show that our pruning method can be integrated with model quantization to achieve further model compression while maintaining similar BD-Rate performance. We have made the source code available at gitlab.com/viper-purdue/lic-pruning.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
Flexible Mixed Precision Quantization for Learned Image Compression
Authors:
Md Adnan Faisal Hossain,
Zhihao Duan,
Fengqing Zhu
Abstract:
Despite its improvements in coding performance compared to traditional codecs, Learned Image Compression (LIC) suffers from large computational costs for storage and deployment. Model quantization offers an effective solution to reduce the computational complexity of LIC models. However, most existing works perform fixed-precision quantization which suffers from sub-optimal utilization of resource…
▽ More
Despite its improvements in coding performance compared to traditional codecs, Learned Image Compression (LIC) suffers from large computational costs for storage and deployment. Model quantization offers an effective solution to reduce the computational complexity of LIC models. However, most existing works perform fixed-precision quantization which suffers from sub-optimal utilization of resources due to the varying sensitivity to quantization of different layers of a neural network. In this paper, we propose a Flexible Mixed Precision Quantization (FMPQ) method that assigns different bit-widths to different layers of the quantized network using the fractional change in rate-distortion loss as the bit-assignment criterion. We also introduce an adaptive search algorithm which reduces the time-complexity of searching for the desired distribution of quantization bit-widths given a fixed model size. Evaluation of our method shows improved BD-Rate performance under similar model size constraints compared to other works on quantization of LIC models. We have made the source code available at gitlab.com/viper-purdue/fmpq.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
PS4PRO: Pixel-to-pixel Supervision for Photorealistic Rendering and Optimization
Authors:
Yezhi Shen,
Qiuchen Zhai,
Fengqing Zhu
Abstract:
Neural rendering methods have gained significant attention for their ability to reconstruct 3D scenes from 2D images. The core idea is to take multiple views as input and optimize the reconstructed scene by minimizing the uncertainty in geometry and appearance across the views. However, the reconstruction quality is limited by the number of input views. This limitation is further pronounced in com…
▽ More
Neural rendering methods have gained significant attention for their ability to reconstruct 3D scenes from 2D images. The core idea is to take multiple views as input and optimize the reconstructed scene by minimizing the uncertainty in geometry and appearance across the views. However, the reconstruction quality is limited by the number of input views. This limitation is further pronounced in complex and dynamic scenes, where certain angles of objects are never seen. In this paper, we propose to use video frame interpolation as the data augmentation method for neural rendering. Furthermore, we design a lightweight yet high-quality video frame interpolation model, PS4PRO (Pixel-to-pixel Supervision for Photorealistic Rendering and Optimization). PS4PRO is trained on diverse video datasets, implicitly modeling camera movement as well as real-world 3D geometry. Our model performs as an implicit world prior, enriching the photo supervision for 3D reconstruction. By leveraging the proposed method, we effectively augment existing datasets for neural rendering methods. Our experimental results indicate that our method improves the reconstruction performance on both static and dynamic scenes.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
Accelerating Learned Image Compression Through Modeling Neural Training Dynamics
Authors:
Yichi Zhang,
Zhihao Duan,
Yuning Huang,
Fengqing Zhu
Abstract:
As learned image compression (LIC) methods become increasingly computationally demanding, enhancing their training efficiency is crucial. This paper takes a step forward in accelerating the training of LIC methods by modeling the neural training dynamics. We first propose a Sensitivity-aware True and Dummy Embedding Training mechanism (STDET) that clusters LIC model parameters into few separate mo…
▽ More
As learned image compression (LIC) methods become increasingly computationally demanding, enhancing their training efficiency is crucial. This paper takes a step forward in accelerating the training of LIC methods by modeling the neural training dynamics. We first propose a Sensitivity-aware True and Dummy Embedding Training mechanism (STDET) that clusters LIC model parameters into few separate modes where parameters are expressed as affine transformations of reference parameters within the same mode. By further utilizing the stable intra-mode correlations throughout training and parameter sensitivities, we gradually embed non-reference parameters, reducing the number of trainable parameters. Additionally, we incorporate a Sampling-then-Moving Average (SMA) technique, interpolating sampled weights from stochastic gradient descent (SGD) training to obtain the moving average weights, ensuring smooth temporal behavior and minimizing training state variances. Overall, our method significantly reduces training space dimensions and the number of trainable parameters without sacrificing model performance, thus accelerating model convergence. We also provide a theoretical analysis on the Noisy quadratic model, showing that the proposed method achieves a lower training variance than standard SGD. Our approach offers valuable insights for further developing efficient training methods for LICs.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
Low-Rank Adaptation of Pre-trained Vision Backbones for Energy-Efficient Image Coding for Machine
Authors:
Yichi Zhang,
Zhihao Duan,
Yuning Huang,
Fengqing Zhu
Abstract:
Image Coding for Machines (ICM) focuses on optimizing image compression for AI-driven analysis rather than human perception. Existing ICM frameworks often rely on separate codecs for specific tasks, leading to significant storage requirements, training overhead, and computational complexity. To address these challenges, we propose an energy-efficient framework that leverages pre-trained vision bac…
▽ More
Image Coding for Machines (ICM) focuses on optimizing image compression for AI-driven analysis rather than human perception. Existing ICM frameworks often rely on separate codecs for specific tasks, leading to significant storage requirements, training overhead, and computational complexity. To address these challenges, we propose an energy-efficient framework that leverages pre-trained vision backbones to extract robust and versatile latent representations suitable for multiple tasks. We introduce a task-specific low-rank adaptation mechanism, which refines the pre-trained features to be both compressible and tailored to downstream applications. This design minimizes trainable parameters and reduces energy costs for multi-task scenarios. By jointly optimizing task performance and entropy minimization, our method enables efficient adaptation to diverse tasks and datasets without full fine-tuning, achieving high coding efficiency. Extensive experiments demonstrate that our framework significantly outperforms traditional codecs and pre-processors, offering an energy-efficient and effective solution for ICM applications. The code and the supplementary materials will be available at: https://gitlab.com/viper-purdue/efficient-compression.
△ Less
Submitted 28 May, 2025; v1 submitted 22 May, 2025;
originally announced May 2025.
-
Aneumo: A Large-Scale Multimodal Aneurysm Dataset with Computational Fluid Dynamics Simulations and Deep Learning Benchmarks
Authors:
Xigui Li,
Yuanye Zhou,
Feiyang Xiao,
Xin Guo,
Chen Jiang,
Tan Pan,
Xingmeng Zhang,
Cenyu Liu,
Zeyun Miao,
Jianchao Ge,
Xiansheng Wang,
Qimeng Wang,
Yichi Zhang,
Wenbo Zhang,
Fengping Zhu,
Limei Han,
Yuan Qi,
Chensen Lin,
Yuan Cheng
Abstract:
Intracranial aneurysms (IAs) are serious cerebrovascular lesions found in approximately 5\% of the general population. Their rupture may lead to high mortality. Current methods for assessing IA risk focus on morphological and patient-specific factors, but the hemodynamic influences on IA development and rupture remain unclear. While accurate for hemodynamic studies, conventional computational flui…
▽ More
Intracranial aneurysms (IAs) are serious cerebrovascular lesions found in approximately 5\% of the general population. Their rupture may lead to high mortality. Current methods for assessing IA risk focus on morphological and patient-specific factors, but the hemodynamic influences on IA development and rupture remain unclear. While accurate for hemodynamic studies, conventional computational fluid dynamics (CFD) methods are computationally intensive, hindering their deployment in large-scale or real-time clinical applications. To address this challenge, we curated a large-scale, high-fidelity aneurysm CFD dataset to facilitate the development of efficient machine learning algorithms for such applications. Based on 427 real aneurysm geometries, we synthesized 10,660 3D shapes via controlled deformation to simulate aneurysm evolution. The authenticity of these synthetic shapes was confirmed by neurosurgeons. CFD computations were performed on each shape under eight steady-state mass flow conditions, generating a total of 85,280 blood flow dynamics data covering key parameters. Furthermore, the dataset includes segmentation masks, which can support tasks that use images, point clouds or other multimodal data as input. Additionally, we introduced a benchmark for estimating flow parameters to assess current modeling methods. This dataset aims to advance aneurysm research and promote data-driven approaches in biofluids, biomedical engineering, and clinical risk assessment. The code and dataset are available at: https://github.com/Xigui-Li/Aneumo.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Robust Deep Learning-Based Physical Layer Communications: Strategies and Approaches
Authors:
Fenghao Zhu,
Xinquan Wang,
Chen Zhu,
Tierui Gong,
Zhaohui Yang,
Chongwen Huang,
Xiaoming Chen,
Zhaoyang Zhang,
Mérouane Debbah
Abstract:
Deep learning (DL) has emerged as a transformative technology with immense potential to reshape the sixth-generation (6G) wireless communication network. By utilizing advanced algorithms for feature extraction and pattern recognition, DL provides unprecedented capabilities in optimizing the network efficiency and performance, particularly in physical layer communications. Although DL technologies…
▽ More
Deep learning (DL) has emerged as a transformative technology with immense potential to reshape the sixth-generation (6G) wireless communication network. By utilizing advanced algorithms for feature extraction and pattern recognition, DL provides unprecedented capabilities in optimizing the network efficiency and performance, particularly in physical layer communications. Although DL technologies present the great potential, they also face significant challenges related to the robustness, which are expected to intensify in the complex and demanding 6G environment. Specifically, current DL models typically exhibit substantial performance degradation in dynamic environments with time-varying channels, interference of noise and different scenarios, which affect their effectiveness in diverse real-world applications. This paper provides a comprehensive overview of strategies and approaches for robust DL-based methods in physical layer communications. First we introduce the key challenges that current DL models face. Then we delve into a detailed examination of DL approaches specifically tailored to enhance robustness in 6G, which are classified into data-driven and model-driven strategies. Finally, we verify the effectiveness of these methods by case studies and outline future research directions.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
Flying through cluttered and dynamic environments with LiDAR
Authors:
Huajie Wu,
Wenyi Liu,
Yunfan Ren,
Zheng Liu,
Hairuo Wei,
Fangcheng Zhu,
Haotian Li,
Fu Zhang
Abstract:
Navigating unmanned aerial vehicles (UAVs) through cluttered and dynamic environments remains a significant challenge, particularly when dealing with fast-moving or sudden-appearing obstacles. This paper introduces a complete LiDAR-based system designed to enable UAVs to avoid various moving obstacles in complex environments. Benefiting the high computational efficiency of perception and planning,…
▽ More
Navigating unmanned aerial vehicles (UAVs) through cluttered and dynamic environments remains a significant challenge, particularly when dealing with fast-moving or sudden-appearing obstacles. This paper introduces a complete LiDAR-based system designed to enable UAVs to avoid various moving obstacles in complex environments. Benefiting the high computational efficiency of perception and planning, the system can operate in real time using onboard computing resources with low latency. For dynamic environment perception, we have integrated our previous work, M-detector, into the system. M-detector ensures that moving objects of different sizes, colors, and types are reliably detected. For dynamic environment planning, we incorporate dynamic object predictions into the integrated planning and control (IPC) framework, namely DynIPC. This integration allows the UAV to utilize predictions about dynamic obstacles to effectively evade them. We validate our proposed system through both simulations and real-world experiments. In simulation tests, our system outperforms state-of-the-art baselines across several metrics, including success rate, time consumption, average flight time, and maximum velocity. In real-world trials, our system successfully navigates through forests, avoiding moving obstacles along its path.
△ Less
Submitted 24 April, 2025;
originally announced April 2025.
-
Wireless Large AI Model: Shaping the AI-Native Future of 6G and Beyond
Authors:
Fenghao Zhu,
Xinquan Wang,
Xinyi Li,
Maojun Zhang,
Yixuan Chen,
Chongwen Huang,
Zhaohui Yang,
Xiaoming Chen,
Zhaoyang Zhang,
Richeng Jin,
Yongming Huang,
Wei Feng,
Tingting Yang,
Baoming Bai,
Feifei Gao,
Kun Yang,
Yuanwei Liu,
Sami Muhaidat,
Chau Yuen,
Kaibin Huang,
Kai-Kit Wong,
Dusit Niyato,
Mérouane Debbah
Abstract:
The emergence of sixth-generation and beyond communication systems is expected to fundamentally transform digital experiences through introducing unparalleled levels of intelligence, efficiency, and connectivity. A promising technology poised to enable this revolutionary vision is the wireless large AI model (WLAM), characterized by its exceptional capabilities in data processing, inference, and d…
▽ More
The emergence of sixth-generation and beyond communication systems is expected to fundamentally transform digital experiences through introducing unparalleled levels of intelligence, efficiency, and connectivity. A promising technology poised to enable this revolutionary vision is the wireless large AI model (WLAM), characterized by its exceptional capabilities in data processing, inference, and decision-making. In light of these remarkable capabilities, this paper provides a comprehensive survey of WLAM, elucidating its fundamental principles, diverse applications, critical challenges, and future research opportunities. We begin by introducing the background of WLAM and analyzing the key synergies with wireless networks, emphasizing the mutual benefits. Subsequently, we explore the foundational characteristics of WLAM, delving into their unique relevance in wireless environments. Then, the role of WLAM in optimizing wireless communication systems across various use cases and the reciprocal benefits are systematically investigated. Furthermore, we discuss the integration of WLAM with emerging technologies, highlighting their potential to enable transformative capabilities and breakthroughs in wireless communication. Finally, we thoroughly examine the high-level challenges hindering the practical implementation of WLAM and discuss pivotal future research directions.
△ Less
Submitted 28 April, 2025; v1 submitted 20 April, 2025;
originally announced April 2025.
-
MAAM: A Lightweight Multi-Agent Aggregation Module for Efficient Image Classification Based on the MindSpore Framework
Authors:
Zhenkai Qin,
Feng Zhu,
Huan Zeng,
Xunyi Nong
Abstract:
The demand for lightweight models in image classification tasks under resource-constrained environments necessitates a balance between computational efficiency and robust feature representation. Traditional attention mechanisms, despite their strong feature modeling capability, often struggle with high computational complexity and structural rigidity, limiting their applicability in scenarios with…
▽ More
The demand for lightweight models in image classification tasks under resource-constrained environments necessitates a balance between computational efficiency and robust feature representation. Traditional attention mechanisms, despite their strong feature modeling capability, often struggle with high computational complexity and structural rigidity, limiting their applicability in scenarios with limited computational resources (e.g., edge devices or real-time systems). To address this, we propose the Multi-Agent Aggregation Module (MAAM), a lightweight attention architecture integrated with the MindSpore framework. MAAM employs three parallel agent branches with independently parameterized operations to extract heterogeneous features, adaptively fused via learnable scalar weights, and refined through a convolutional compression layer. Leveraging MindSpore's dynamic computational graph and operator fusion, MAAM achieves 87.0% accuracy on the CIFAR-10 dataset, significantly outperforming conventional CNN (58.3%) and MLP (49.6%) models, while improving training efficiency by 30%. Ablation studies confirm the critical role of agent attention (accuracy drops to 32.0% if removed) and compression modules (25.5% if omitted), validating their necessity for maintaining discriminative feature learning. The framework's hardware acceleration capabilities and minimal memory footprint further demonstrate its practicality, offering a deployable solution for image classification in resource-constrained scenarios without compromising accuracy.
△ Less
Submitted 18 April, 2025;
originally announced April 2025.
-
Achieving Tighter Finite-Time Rates for Heterogeneous Federated Stochastic Approximation under Markovian Sampling
Authors:
Feng Zhu,
Aritra Mitra,
Robert W. Heath
Abstract:
Motivated by collaborative reinforcement learning (RL) and optimization with time-correlated data, we study a generic federated stochastic approximation problem involving $M$ agents, where each agent is characterized by an agent-specific (potentially nonlinear) local operator. The goal is for the agents to communicate intermittently via a server to find the root of the average of the agents' local…
▽ More
Motivated by collaborative reinforcement learning (RL) and optimization with time-correlated data, we study a generic federated stochastic approximation problem involving $M$ agents, where each agent is characterized by an agent-specific (potentially nonlinear) local operator. The goal is for the agents to communicate intermittently via a server to find the root of the average of the agents' local operators. The generality of our setting stems from allowing for (i) Markovian data at each agent and (ii) heterogeneity in the roots of the agents' local operators. The limited recent work that has accounted for both these features in a federated setting fails to guarantee convergence to the desired point or to show any benefit of collaboration; furthermore, they rely on projection steps in their algorithms to guarantee bounded iterates. Our work overcomes each of these limitations. We develop a novel algorithm titled \texttt{FedHSA}, and prove that it guarantees convergence to the correct point, while enjoying an $M$-fold linear speedup in sample-complexity due to collaboration. To our knowledge, \emph{this is the first finite-time result of its kind}, and establishing it (without relying on a projection step) entails a fairly intricate argument that accounts for the interplay between complex temporal correlations due to Markovian sampling, multiple local steps to save communication, and the drift-effects induced by heterogeneous local operators. Our results have implications for a broad class of heterogeneous federated RL problems (e.g., policy evaluation and control) with function approximation, where the agents' Markov decision processes can differ in their probability transition kernels and reward functions.
△ Less
Submitted 15 April, 2025;
originally announced April 2025.
-
TeleMoM: Consensus-Driven Telecom Intelligence via Mixture of Models
Authors:
Xinquan Wang,
Fenghao Zhu,
Chongwen Huang,
Zhaohui Yang,
Zhaoyang Zhang,
Sami Muhaidat,
Chau Yuen,
Mérouane Debbah
Abstract:
Large language models (LLMs) face significant challenges in specialized domains like telecommunication (Telecom) due to technical complexity, specialized terminology, and rapidly evolving knowledge. Traditional methods, such as scaling model parameters or retraining on domain-specific corpora, are computationally expensive and yield diminishing returns, while existing approaches like retrieval-aug…
▽ More
Large language models (LLMs) face significant challenges in specialized domains like telecommunication (Telecom) due to technical complexity, specialized terminology, and rapidly evolving knowledge. Traditional methods, such as scaling model parameters or retraining on domain-specific corpora, are computationally expensive and yield diminishing returns, while existing approaches like retrieval-augmented generation, mixture of experts, and fine-tuning struggle with accuracy, efficiency, and coordination. To address this issue, we propose Telecom mixture of models (TeleMoM), a consensus-driven ensemble framework that integrates multiple LLMs for enhanced decision-making in Telecom. TeleMoM employs a two-stage process: proponent models generate justified responses, and an adjudicator finalizes decisions, supported by a quality-checking mechanism. This approach leverages strengths of diverse models to improve accuracy, reduce biases, and handle domain-specific complexities effectively. Evaluation results demonstrate that TeleMoM achieves a 9.7\% increase in answer accuracy, highlighting its effectiveness in Telecom applications.
△ Less
Submitted 1 June, 2025; v1 submitted 3 April, 2025;
originally announced April 2025.
-
Liquid Neural Networks: Next-Generation AI for Telecom from First Principles
Authors:
Fenghao Zhu,
Xinquan Wang,
Chen Zhu,
Chongwen Huang
Abstract:
Artificial intelligence (AI) has emerged as a transformative technology with immense potential to reshape the next-generation of wireless networks. By leveraging advanced algorithms and machine learning techniques, AI offers unprecedented capabilities in optimizing network performance, enhancing data processing efficiency, and enabling smarter decision-making processes. However, existing AI soluti…
▽ More
Artificial intelligence (AI) has emerged as a transformative technology with immense potential to reshape the next-generation of wireless networks. By leveraging advanced algorithms and machine learning techniques, AI offers unprecedented capabilities in optimizing network performance, enhancing data processing efficiency, and enabling smarter decision-making processes. However, existing AI solutions face significant challenges in terms of robustness and interpretability. Specifically, current AI models exhibit substantial performance degradation in dynamic environments with varying data distributions, and the black-box nature of these algorithms raises concerns regarding safety, transparency, and fairness. This presents a major challenge in integrating AI into practical communication systems. Recently, a novel type of neural network, known as the liquid neural networks (LNNs), has been designed from first principles to address these issues. In this paper, we explore the potential of LNNs in telecommunications. First, we illustrate the mechanisms of LNNs and highlight their unique advantages over traditional networks. Then we unveil the opportunities that LNNs bring to future wireless networks. Furthermore, we discuss the challenges and design directions for the implementation of LNNs. Finally, we summarize the performance of LNNs in two case studies.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
Diagnosis of Pulmonary Hypertension by Integrating Multimodal Data with a Hybrid Graph Convolutional and Transformer Network
Authors:
Fubao Zhu,
Yang Zhang,
Gengmin Liang,
Jiaofen Nan,
Yanting Li,
Chuang Han,
Danyang Sun,
Zhiguo Wang,
Chen Zhao,
Wenxuan Zhou,
Jian He,
Yi Xu,
Iokfai Cheang,
Xu Zhu,
Yanli Zhou,
Weihua Zhou
Abstract:
Early and accurate diagnosis of pulmonary hypertension (PH) is essential for optimal patient management. Differentiating between pre-capillary and post-capillary PH is critical for guiding treatment decisions. This study develops and validates a deep learning-based diagnostic model for PH, designed to classify patients as non-PH, pre-capillary PH, or post-capillary PH. This retrospective study ana…
▽ More
Early and accurate diagnosis of pulmonary hypertension (PH) is essential for optimal patient management. Differentiating between pre-capillary and post-capillary PH is critical for guiding treatment decisions. This study develops and validates a deep learning-based diagnostic model for PH, designed to classify patients as non-PH, pre-capillary PH, or post-capillary PH. This retrospective study analyzed data from 204 patients (112 with pre-capillary PH, 32 with post-capillary PH, and 60 non-PH controls) at the First Affiliated Hospital of Nanjing Medical University. Diagnoses were confirmed through right heart catheterization. We selected 6 samples from each category for the test set (18 samples, 10%), with the remaining 186 samples used for the training set. This process was repeated 35 times for testing. This paper proposes a deep learning model that combines Graph convolutional networks (GCN), Convolutional neural networks (CNN), and Transformers. The model was developed to process multimodal data, including short-axis (SAX) sequences, four-chamber (4CH) sequences, and clinical parameters. Our model achieved a performance of Area under the receiver operating characteristic curve (AUC) = 0.81 +- 0.06(standard deviation) and Accuracy (ACC) = 0.73 +- 0.06 on the test set. The discriminative abilities were as follows: non-PH subjects (AUC = 0.74 +- 0.11), pre-capillary PH (AUC = 0.86 +- 0.06), and post-capillary PH (AUC = 0.83 +- 0.10). It has the potential to support clinical decision-making by effectively integrating multimodal data to assist physicians in making accurate and timely diagnoses.
△ Less
Submitted 27 March, 2025;
originally announced April 2025.
-
DTU-Net: A Multi-Scale Dilated Transformer Network for Nonlinear Hyperspectral Unmixing
Authors:
ChenTong Wang,
Jincheng Gao,
Fei Zhu,
Abderrahim Halimi,
Cédric Richard
Abstract:
Transformers have shown significant success in hyperspectral unmixing (HU). However, challenges remain. While multi-scale and long-range spatial correlations are essential in unmixing tasks, current Transformer-based unmixing networks, built on Vision Transformer (ViT) or Swin-Transformer, struggle to capture them effectively. Additionally, current Transformer-based unmixing networks rely on the l…
▽ More
Transformers have shown significant success in hyperspectral unmixing (HU). However, challenges remain. While multi-scale and long-range spatial correlations are essential in unmixing tasks, current Transformer-based unmixing networks, built on Vision Transformer (ViT) or Swin-Transformer, struggle to capture them effectively. Additionally, current Transformer-based unmixing networks rely on the linear mixing model, which lacks the flexibility to accommodate scenarios where nonlinear effects are significant. To address these limitations, we propose a multi-scale Dilated Transformer-based unmixing network for nonlinear HU (DTU-Net). The encoder employs two branches. The first one performs multi-scale spatial feature extraction using Multi-Scale Dilated Attention (MSDA) in the Dilated Transformer, which varies dilation rates across attention heads to capture long-range and multi-scale spatial correlations. The second one performs spectral feature extraction utilizing 3D-CNNs with channel attention. The outputs from both branches are then fused to integrate multi-scale spatial and spectral information, which is subsequently transformed to estimate the abundances. The decoder is designed to accommodate both linear and nonlinear mixing scenarios. Its interpretability is enhanced by explicitly modeling the relationships between endmembers, abundances, and nonlinear coefficients in accordance with the polynomial post-nonlinear mixing model (PPNMM). Experiments on synthetic and real datasets validate the effectiveness of the proposed DTU-Net compared to PPNMM-derived methods and several advanced unmixing networks.
△ Less
Submitted 5 March, 2025; v1 submitted 5 March, 2025;
originally announced March 2025.
-
RGBSQGrasp: Inferring Local Superquadric Primitives from Single RGB Image for Graspability-Aware Bin Picking
Authors:
Yifeng Xu,
Fan Zhu,
Ye Li,
Sebastian Ren,
Xiaonan Huang,
Yuhao Chen
Abstract:
Bin picking is a challenging robotic task due to occlusions and physical constraints that limit visual information for object recognition and grasping. Existing approaches often rely on known CAD models or prior object geometries, restricting generalization to novel or unknown objects. Other methods directly regress grasp poses from RGB-D data without object priors, but the inherent noise in depth…
▽ More
Bin picking is a challenging robotic task due to occlusions and physical constraints that limit visual information for object recognition and grasping. Existing approaches often rely on known CAD models or prior object geometries, restricting generalization to novel or unknown objects. Other methods directly regress grasp poses from RGB-D data without object priors, but the inherent noise in depth sensing and the lack of object understanding make grasp synthesis and evaluation more difficult. Superquadrics (SQ) offer a compact, interpretable shape representation that captures the physical and graspability understanding of objects. However, recovering them from limited viewpoints is challenging, as existing methods rely on multiple perspectives for near-complete point cloud reconstruction, limiting their effectiveness in bin-picking. To address these challenges, we propose \textbf{RGBSQGrasp}, a grasping framework that leverages superquadric shape primitives and foundation metric depth estimation models to infer grasp poses from a monocular RGB camera -- eliminating the need for depth sensors. Our framework integrates a universal, cross-platform dataset generation pipeline, a foundation model-based object point cloud estimation module, a global-local superquadric fitting network, and an SQ-guided grasp pose sampling module. By integrating these components, RGBSQGrasp reliably infers grasp poses through geometric reasoning, enhancing grasp stability and adaptability to unseen objects. Real-world robotic experiments demonstrate a 92\% grasp success rate, highlighting the effectiveness of RGBSQGrasp in packed bin-picking environments.
△ Less
Submitted 4 March, 2025;
originally announced March 2025.
-
Balanced Rate-Distortion Optimization in Learned Image Compression
Authors:
Yichi Zhang,
Zhihao Duan,
Yuning Huang,
Fengqing Zhu
Abstract:
Learned image compression (LIC) using deep learning architectures has seen significant advancements, yet standard rate-distortion (R-D) optimization often encounters imbalanced updates due to diverse gradients of the rate and distortion objectives. This imbalance can lead to suboptimal optimization, where one objective dominates, thereby reducing overall compression efficiency. To address this cha…
▽ More
Learned image compression (LIC) using deep learning architectures has seen significant advancements, yet standard rate-distortion (R-D) optimization often encounters imbalanced updates due to diverse gradients of the rate and distortion objectives. This imbalance can lead to suboptimal optimization, where one objective dominates, thereby reducing overall compression efficiency. To address this challenge, we reformulate R-D optimization as a multi-objective optimization (MOO) problem and introduce two balanced R-D optimization strategies that adaptively adjust gradient updates to achieve more equitable improvements in both rate and distortion. The first proposed strategy utilizes a coarse-to-fine gradient descent approach along standard R-D optimization trajectories, making it particularly suitable for training LIC models from scratch. The second proposed strategy analytically addresses the reformulated optimization as a quadratic programming problem with an equality constraint, which is ideal for fine-tuning existing models. Experimental results demonstrate that both proposed methods enhance the R-D performance of LIC models, achieving around a 2\% BD-Rate reduction with acceptable additional training cost, leading to a more balanced and efficient optimization process. Code will be available at https://gitlab.com/viper-purdue/Balanced-RD.
△ Less
Submitted 18 March, 2025; v1 submitted 27 February, 2025;
originally announced February 2025.
-
MFP3D: Monocular Food Portion Estimation Leveraging 3D Point Clouds
Authors:
Jinge Ma,
Xiaoyan Zhang,
Gautham Vinod,
Siddeshwar Raghavan,
Jiangpeng He,
Fengqing Zhu
Abstract:
Food portion estimation is crucial for monitoring health and tracking dietary intake. Image-based dietary assessment, which involves analyzing eating occasion images using computer vision techniques, is increasingly replacing traditional methods such as 24-hour recalls. However, accurately estimating the nutritional content from images remains challenging due to the loss of 3D information when pro…
▽ More
Food portion estimation is crucial for monitoring health and tracking dietary intake. Image-based dietary assessment, which involves analyzing eating occasion images using computer vision techniques, is increasingly replacing traditional methods such as 24-hour recalls. However, accurately estimating the nutritional content from images remains challenging due to the loss of 3D information when projecting to the 2D image plane. Existing portion estimation methods are challenging to deploy in real-world scenarios due to their reliance on specific requirements, such as physical reference objects, high-quality depth information, or multi-view images and videos. In this paper, we introduce MFP3D, a new framework for accurate food portion estimation using only a single monocular image. Specifically, MFP3D consists of three key modules: (1) a 3D Reconstruction Module that generates a 3D point cloud representation of the food from the 2D image, (2) a Feature Extraction Module that extracts and concatenates features from both the 3D point cloud and the 2D RGB image, and (3) a Portion Regression Module that employs a deep regression model to estimate the food's volume and energy content based on the extracted features. Our MFP3D is evaluated on MetaFood3D dataset, demonstrating its significant improvement in accurate portion estimation over existing methods.
△ Less
Submitted 14 November, 2024;
originally announced November 2024.
-
Mitigating Parameter Degeneracy using Joint Conditional Diffusion Model for WECC Composite Load Model in Power Systems
Authors:
Feiqin Zhu,
Dmitrii Torbunov,
Yihui Ren,
Zhongjing Jiang,
Tianqiao Zhao,
Amirthagunaraj Yogarathnam,
Meng Yue
Abstract:
Data-driven modeling for dynamic systems has gained widespread attention in recent years. Its inverse formulation, parameter estimation, aims to infer the inherent model parameters from observations. However, parameter degeneracy, where different combinations of parameters yield the same observable output, poses a critical barrier to accurately and uniquely identifying model parameters. In the con…
▽ More
Data-driven modeling for dynamic systems has gained widespread attention in recent years. Its inverse formulation, parameter estimation, aims to infer the inherent model parameters from observations. However, parameter degeneracy, where different combinations of parameters yield the same observable output, poses a critical barrier to accurately and uniquely identifying model parameters. In the context of WECC composite load model (CLM) in power systems, utility practitioners have observed that CLM parameters carefully selected for one fault event may not perform satisfactorily in another fault. Here, we innovate a joint conditional diffusion model-based inverse problem solver (JCDI), that incorporates a joint conditioning architecture with simultaneous inputs of multi-event observations to improve parameter generalizability. Simulation studies on the WECC CLM show that the proposed JCDI effectively reduces uncertainties of degenerate parameters, thus the parameter estimation error is decreased by 42.1% compared to a single-event learning scheme. This enables the model to achieve high accuracy in predicting power trajectories under different fault events, including electronic load tripping and motor stalling, outperforming standard deep reinforcement learning and supervised learning approaches. We anticipate this work will contribute to mitigating parameter degeneracy in system dynamics, providing a general parameter estimation framework across various scientific domains.
△ Less
Submitted 15 November, 2024;
originally announced November 2024.
-
Mindalogue: LLM-Powered Nonlinear Interaction for Effective Learning and Task Exploration
Authors:
Rui Zhang,
Ziyao Zhang,
Fengliang Zhu,
Jiajie Zhou,
Anyi Rao
Abstract:
Current generative AI models like ChatGPT, Claude, and Gemini are widely used for knowledge dissemination, task decomposition, and creative thinking. However, their linear interaction methods often force users to repeatedly compare and copy contextual information when handling complex tasks, increasing cognitive load and operational costs. Moreover, the ambiguity in model responses requires users…
▽ More
Current generative AI models like ChatGPT, Claude, and Gemini are widely used for knowledge dissemination, task decomposition, and creative thinking. However, their linear interaction methods often force users to repeatedly compare and copy contextual information when handling complex tasks, increasing cognitive load and operational costs. Moreover, the ambiguity in model responses requires users to refine and simplify the information further. To address these issues, we developed "Mindalogue", a system using a non-linear interaction model based on "nodes + canvas" to enhance user efficiency and freedom while generating structured responses. A formative study with 11 users informed the design of Mindalogue, which was then evaluated through a study with 16 participants. The results showed that Mindalogue significantly reduced task steps and improved users' comprehension of complex information. This study highlights the potential of non-linear interaction in improving AI tool efficiency and user experience in the HCI field.
△ Less
Submitted 15 October, 2024; v1 submitted 14 October, 2024;
originally announced October 2024.
-
High-Efficiency Neural Video Compression via Hierarchical Predictive Learning
Authors:
Ming Lu,
Zhihao Duan,
Wuyang Cong,
Dandan Ding,
Fengqing Zhu,
Zhan Ma
Abstract:
The enhanced Deep Hierarchical Video Compression-DHVC 2.0-has been introduced. This single-model neural video codec operates across a broad range of bitrates, delivering not only superior compression performance to representative methods but also impressive complexity efficiency, enabling real-time processing with a significantly smaller memory footprint on standard GPUs. These remarkable advancem…
▽ More
The enhanced Deep Hierarchical Video Compression-DHVC 2.0-has been introduced. This single-model neural video codec operates across a broad range of bitrates, delivering not only superior compression performance to representative methods but also impressive complexity efficiency, enabling real-time processing with a significantly smaller memory footprint on standard GPUs. These remarkable advancements stem from the use of hierarchical predictive coding. Each video frame is uniformly transformed into multiscale representations through hierarchical variational autoencoders. For a specific scale's feature representation of a frame, its corresponding latent residual variables are generated by referencing lower-scale spatial features from the same frame and then conditionally entropy-encoded using a probabilistic model whose parameters are predicted using same-scale temporal reference from previous frames and lower-scale spatial reference of the current frame. This feature-space processing operates from the lowest to the highest scale of each frame, completely eliminating the need for the complexity-intensive motion estimation and compensation techniques that have been standard in video codecs for decades. The hierarchical approach facilitates parallel processing, accelerating both encoding and decoding, and supports transmission-friendly progressive decoding, making it particularly advantageous for networked video applications in the presence of packet loss. Source codes will be made available.
△ Less
Submitted 3 October, 2024;
originally announced October 2024.
-
Towards Fast Rates for Federated and Multi-Task Reinforcement Learning
Authors:
Feng Zhu,
Robert W. Heath Jr.,
Aritra Mitra
Abstract:
We consider a setting involving $N$ agents, where each agent interacts with an environment modeled as a Markov Decision Process (MDP). The agents' MDPs differ in their reward functions, capturing heterogeneous objectives/tasks. The collective goal of the agents is to communicate intermittently via a central server to find a policy that maximizes the average of long-term cumulative rewards across e…
▽ More
We consider a setting involving $N$ agents, where each agent interacts with an environment modeled as a Markov Decision Process (MDP). The agents' MDPs differ in their reward functions, capturing heterogeneous objectives/tasks. The collective goal of the agents is to communicate intermittently via a central server to find a policy that maximizes the average of long-term cumulative rewards across environments. The limited existing work on this topic either only provide asymptotic rates, or generate biased policies, or fail to establish any benefits of collaboration. In response, we propose Fast-FedPG - a novel federated policy gradient algorithm with a carefully designed bias-correction mechanism. Under a gradient-domination condition, we prove that our algorithm guarantees (i) fast linear convergence with exact gradients, and (ii) sub-linear rates that enjoy a linear speedup w.r.t. the number of agents with noisy, truncated policy gradients. Notably, in each case, the convergence is to a globally optimal policy with no heterogeneity-induced bias. In the absence of gradient-domination, we establish convergence to a first-order stationary point at a rate that continues to benefit from collaboration.
△ Less
Submitted 8 September, 2024;
originally announced September 2024.
-
High-Resolution Spatial Transcriptomics from Histology Images using HisToSGE
Authors:
Zhiceng Shi,
Shuailin Xue,
Fangfang Zhu,
Wenwen Min
Abstract:
Spatial transcriptomics (ST) is a groundbreaking genomic technology that enables spatial localization analysis of gene expression within tissue sections. However, it is significantly limited by high costs and sparse spatial resolution. An alternative, more cost-effective strategy is to use deep learning methods to predict high-density gene expression profiles from histological images. However, exi…
▽ More
Spatial transcriptomics (ST) is a groundbreaking genomic technology that enables spatial localization analysis of gene expression within tissue sections. However, it is significantly limited by high costs and sparse spatial resolution. An alternative, more cost-effective strategy is to use deep learning methods to predict high-density gene expression profiles from histological images. However, existing methods struggle to capture rich image features effectively or rely on low-dimensional positional coordinates, making it difficult to accurately predict high-resolution gene expression profiles. To address these limitations, we developed HisToSGE, a method that employs a Pathology Image Large Model (PILM) to extract rich image features from histological images and utilizes a feature learning module to robustly generate high-resolution gene expression profiles. We evaluated HisToSGE on four ST datasets, comparing its performance with five state-of-the-art baseline methods. The results demonstrate that HisToSGE excels in generating high-resolution gene expression profiles and performing downstream tasks such as spatial domain identification. All code and public datasets used in this paper are available at https://github.com/wenwenmin/HisToSGE and https://zenodo.org/records/12792163.
△ Less
Submitted 29 July, 2024;
originally announced July 2024.
-
On Efficient Neural Network Architectures for Image Compression
Authors:
Yichi Zhang,
Zhihao Duan,
Fengqing Zhu
Abstract:
Recent advances in learning-based image compression typically come at the cost of high complexity. Designing computationally efficient architectures remains an open challenge. In this paper, we empirically investigate the impact of different network designs in terms of rate-distortion performance and computational complexity. Our experiments involve testing various transforms, including convolutio…
▽ More
Recent advances in learning-based image compression typically come at the cost of high complexity. Designing computationally efficient architectures remains an open challenge. In this paper, we empirically investigate the impact of different network designs in terms of rate-distortion performance and computational complexity. Our experiments involve testing various transforms, including convolutional neural networks and transformers, as well as various context models, including hierarchical, channel-wise, and space-channel context models. Based on the results, we present a series of efficient models, the final model of which has comparable performance to recent best-performing methods but with significantly lower complexity. Extensive experiments provide insights into the design of architectures for learned image compression and potential direction for future research. The code is available at \url{https://gitlab.com/viper-purdue/efficient-compression}.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
Robust Beamforming with Gradient-based Liquid Neural Network
Authors:
Xinquan Wang,
Fenghao Zhu,
Chongwen Huang,
Ahmed Alhammadi,
Faouzi Bader,
Zhaoyang Zhang,
Chau Yuen,
Merouane Debbah
Abstract:
Millimeter-wave (mmWave) multiple-input multiple-output (MIMO) communication with the advanced beamforming technologies is a key enabler to meet the growing demands of future mobile communication. However, the dynamic nature of cellular channels in large-scale urban mmWave MIMO communication scenarios brings substantial challenges, particularly in terms of complexity and robustness. To address the…
▽ More
Millimeter-wave (mmWave) multiple-input multiple-output (MIMO) communication with the advanced beamforming technologies is a key enabler to meet the growing demands of future mobile communication. However, the dynamic nature of cellular channels in large-scale urban mmWave MIMO communication scenarios brings substantial challenges, particularly in terms of complexity and robustness. To address these issues, we propose a robust gradient-based liquid neural network (GLNN) framework that utilizes ordinary differential equation-based liquid neurons to solve the beamforming problem. Specifically, our proposed GLNN framework takes gradients of the optimization objective function as inputs to extract the high-order channel feature information, and then introduces a residual connection to mitigate the training burden. Furthermore, we use the manifold learning technique to compress the search space of the beamforming problem. These designs enable the GLNN to effectively maintain low complexity while ensuring strong robustness to noisy and highly dynamic channels. Extensive simulation results demonstrate that the GLNN can achieve 4.15% higher spectral efficiency than that of typical iterative algorithms, and reduce the time consumption to only 1.61% that of conventional methods.
△ Less
Submitted 29 July, 2024; v1 submitted 12 May, 2024;
originally announced May 2024.
-
Beamforming Inferring by Conditional WGAN-GP for Holographic Antenna Arrays
Authors:
Fenghao Zhu,
Xinquan Wang,
Chongwen Huang,
Ahmed Alhammadi,
Hui Chen,
Zhaoyang Zhang,
Chau Yuen,
Mérouane Debbah
Abstract:
The beamforming technology with large holographic antenna arrays is one of the key enablers for the next generation of wireless systems, which can significantly improve the spectral efficiency. However, the deployment of large antenna arrays implies high algorithm complexity and resource overhead at both receiver and transmitter ends. To address this issue, advanced technologies such as artificial…
▽ More
The beamforming technology with large holographic antenna arrays is one of the key enablers for the next generation of wireless systems, which can significantly improve the spectral efficiency. However, the deployment of large antenna arrays implies high algorithm complexity and resource overhead at both receiver and transmitter ends. To address this issue, advanced technologies such as artificial intelligence have been developed to reduce beamforming overhead. Intuitively, if we can implement the near-optimal beamforming only using a tiny subset of the all channel information, the overhead for channel estimation and beamforming would be reduced significantly compared with the traditional beamforming methods that usually need full channel information and the inversion of large dimensional matrix. In light of this idea, we propose a novel scheme that utilizes Wasserstein generative adversarial network with gradient penalty to infer the full beamforming matrices based on very little of channel information. Simulation results confirm that it can accomplish comparable performance with the weighted minimum mean-square error algorithm, while reducing the overhead by over 50%.
△ Less
Submitted 15 May, 2024; v1 submitted 1 May, 2024;
originally announced May 2024.
-
Robust Continuous-Time Beam Tracking with Liquid Neural Network
Authors:
Fenghao Zhu,
Xinquan Wang,
Chongwen Huang,
Richeng Jin,
Qianqian Yang,
Ahmed Alhammadi,
Zhaoyang Zhang,
Chau Yuen,
Mérouane Debbah
Abstract:
Millimeter-wave (mmWave) technology is increasingly recognized as a pivotal technology of the sixth-generation communication networks due to the large amounts of available spectrum at high frequencies. However, the huge overhead associated with beam training imposes a significant challenge in mmWave communications, particularly in urban environments with high background noise. To reduce this high…
▽ More
Millimeter-wave (mmWave) technology is increasingly recognized as a pivotal technology of the sixth-generation communication networks due to the large amounts of available spectrum at high frequencies. However, the huge overhead associated with beam training imposes a significant challenge in mmWave communications, particularly in urban environments with high background noise. To reduce this high overhead, we propose a novel solution for robust continuous-time beam tracking with liquid neural network, which dynamically adjust the narrow mmWave beams to ensure real-time beam alignment with mobile users. Through extensive simulations, we validate the effectiveness of our proposed method and demonstrate its superiority over existing state-of-the-art deep-learning-based approaches. Specifically, our scheme achieves at most 46.9% higher normalized spectral efficiency than the baselines when the user is moving at 5 m/s, demonstrating the potential of liquid neural networks to enhance mmWave mobile communication performance.
△ Less
Submitted 26 August, 2024; v1 submitted 1 May, 2024;
originally announced May 2024.
-
Jitter Characterization of the HyTI Satellite
Authors:
Chase Urasaki,
Frances Zhu,
Michael Bottom,
Miguel Nunes,
Aidan Walk
Abstract:
The Hyperspectral Thermal Imager (HyTI) is a technology demonstration mission that will obtain high spatial, spectral, and temporal resolution long-wave infrared images of Earth's surface from a 6U cubesat. HyTI science requires that the pointing accuracy of the optical axis shall not exceed 2.89 arcsec over the 0.5 ms integration time due to microvibration effects (known as jitter). Two sources o…
▽ More
The Hyperspectral Thermal Imager (HyTI) is a technology demonstration mission that will obtain high spatial, spectral, and temporal resolution long-wave infrared images of Earth's surface from a 6U cubesat. HyTI science requires that the pointing accuracy of the optical axis shall not exceed 2.89 arcsec over the 0.5 ms integration time due to microvibration effects (known as jitter). Two sources of vibration are a cryocooler that is added to maintain the detector at 68 K and three orthogonally placed reaction wheels that are a part of the attitude control system. Both of these parts will introduce vibrations that are propagated through to the satellite structure while imaging. Typical methods of characterizing and measuring jitter involve complex finite element methods and specialized equipment and setups. In this paper, we describe a novel method of characterizing jitter for small satellite systems that is low-cost and minimally modifies the subject's mass distribution. The metrology instrument is comprised of a laser source, a small mirror mounted via a 3D printed clamp to a jig, and a lateral effect position-sensing detector. The position-sensing detector samples 1000 Hz and can measure displacements as little as 0.15 arcsec at distances of one meter. This paper provides an experimental procedure that incrementally analyzes vibratory sources to establish causal relationships between sources and the vibratory modes they create. We demonstrate the capabilities of this metrology system and testing procedure on HyTI in the Hawaii Space Flight Lab's clean room. Results include power spectral density plots that show fundamental and higher-order vibratory modal frequencies. Results from metrology show that jitter from reaction wheels meets HyTI system requirements within 3$σ$.
△ Less
Submitted 23 April, 2024;
originally announced April 2024.
-
Food Portion Estimation via 3D Object Scaling
Authors:
Gautham Vinod,
Jiangpeng He,
Zeman Shao,
Fengqing Zhu
Abstract:
Image-based methods to analyze food images have alleviated the user burden and biases associated with traditional methods. However, accurate portion estimation remains a major challenge due to the loss of 3D information in the 2D representation of foods captured by smartphone cameras or wearable devices. In this paper, we propose a new framework to estimate both food volume and energy from 2D imag…
▽ More
Image-based methods to analyze food images have alleviated the user burden and biases associated with traditional methods. However, accurate portion estimation remains a major challenge due to the loss of 3D information in the 2D representation of foods captured by smartphone cameras or wearable devices. In this paper, we propose a new framework to estimate both food volume and energy from 2D images by leveraging the power of 3D food models and physical reference in the eating scene. Our method estimates the pose of the camera and the food object in the input image and recreates the eating occasion by rendering an image of a 3D model of the food with the estimated poses. We also introduce a new dataset, SimpleFood45, which contains 2D images of 45 food items and associated annotations including food volume, weight, and energy. Our method achieves an average error of 31.10 kCal (17.67%) on this dataset, outperforming existing portion estimation methods. The dataset can be accessed at: https://lorenz.ecn.purdue.edu/~gvinod/simplefood45/ and the code can be accessed at: https://gitlab.com/viper-purdue/monocular-food-volume-3d
△ Less
Submitted 10 October, 2024; v1 submitted 18 April, 2024;
originally announced April 2024.
-
Learning to Classify New Foods Incrementally Via Compressed Exemplars
Authors:
Justin Yang,
Zhihao Duan,
Jiangpeng He,
Fengqing Zhu
Abstract:
Food image classification systems play a crucial role in health monitoring and diet tracking through image-based dietary assessment techniques. However, existing food recognition systems rely on static datasets characterized by a pre-defined fixed number of food classes. This contrasts drastically with the reality of food consumption, which features constantly changing data. Therefore, food image…
▽ More
Food image classification systems play a crucial role in health monitoring and diet tracking through image-based dietary assessment techniques. However, existing food recognition systems rely on static datasets characterized by a pre-defined fixed number of food classes. This contrasts drastically with the reality of food consumption, which features constantly changing data. Therefore, food image classification systems should adapt to and manage data that continuously evolves. This is where continual learning plays an important role. A challenge in continual learning is catastrophic forgetting, where ML models tend to discard old knowledge upon learning new information. While memory-replay algorithms have shown promise in mitigating this problem by storing old data as exemplars, they are hampered by the limited capacity of memory buffers, leading to an imbalance between new and previously learned data. To address this, our work explores the use of neural image compression to extend buffer size and enhance data diversity. We introduced the concept of continuously learning a neural compression model to adaptively improve the quality of compressed data and optimize the bitrates per pixel (bpp) to store more exemplars. Our extensive experiments, including evaluations on food-specific datasets including Food-101 and VFN-74, as well as the general dataset ImageNet-100, demonstrate improvements in classification accuracy. This progress is pivotal in advancing more realistic food recognition systems that are capable of adapting to continually evolving data. Moreover, the principles and methodologies we've developed hold promise for broader applications, extending their benefits to other domains of continual machine learning systems.
△ Less
Submitted 11 April, 2024;
originally announced April 2024.
-
Flexible Variable-Rate Image Feature Compression for Edge-Cloud Systems
Authors:
Md Adnan Faisal Hossain,
Zhihao Duan,
Yuning Huang,
Fengqing Zhu
Abstract:
Feature compression is a promising direction for coding for machines. Existing methods have made substantial progress, but they require designing and training separate neural network models to meet different specifications of compression rate, performance accuracy and computational complexity. In this paper, a flexible variable-rate feature compression method is presented that can operate on a ran…
▽ More
Feature compression is a promising direction for coding for machines. Existing methods have made substantial progress, but they require designing and training separate neural network models to meet different specifications of compression rate, performance accuracy and computational complexity. In this paper, a flexible variable-rate feature compression method is presented that can operate on a range of rates by introducing a rate control parameter as an input to the neural network model. By compressing different intermediate features of a pre-trained vision task model, the proposed method can scale the encoding complexity without changing the overall size of the model. The proposed method is more flexible than existing baselines, at the same time outperforming them in terms of the three-way trade-off between feature compression rate, vision task accuracy, and encoding complexity. We have made the source code available at https://github.com/adnan-hossain/var_feat_comp.git.
△ Less
Submitted 30 March, 2024;
originally announced April 2024.
-
Theoretical Bound-Guided Hierarchical VAE for Neural Image Codecs
Authors:
Yichi Zhang,
Zhihao Duan,
Yuning Huang,
Fengqing Zhu
Abstract:
Recent studies reveal a significant theoretical link between variational autoencoders (VAEs) and rate-distortion theory, notably in utilizing VAEs to estimate the theoretical upper bound of the information rate-distortion function of images. Such estimated theoretical bounds substantially exceed the performance of existing neural image codecs (NICs). To narrow this gap, we propose a theoretical bo…
▽ More
Recent studies reveal a significant theoretical link between variational autoencoders (VAEs) and rate-distortion theory, notably in utilizing VAEs to estimate the theoretical upper bound of the information rate-distortion function of images. Such estimated theoretical bounds substantially exceed the performance of existing neural image codecs (NICs). To narrow this gap, we propose a theoretical bound-guided hierarchical VAE (BG-VAE) for NIC. The proposed BG-VAE leverages the theoretical bound to guide the NIC model towards enhanced performance. We implement the BG-VAE using Hierarchical VAEs and demonstrate its effectiveness through extensive experiments. Along with advanced neural network blocks, we provide a versatile, variable-rate NIC that outperforms existing methods when considering both rate-distortion performance and computational complexity. The code is available at BG-VAE.
△ Less
Submitted 27 March, 2024;
originally announced March 2024.
-
Towards Backward-Compatible Continual Learning of Image Compression
Authors:
Zhihao Duan,
Ming Lu,
Justin Yang,
Jiangpeng He,
Zhan Ma,
Fengqing Zhu
Abstract:
This paper explores the possibility of extending the capability of pre-trained neural image compressors (e.g., adapting to new data or target bitrates) without breaking backward compatibility, the ability to decode bitstreams encoded by the original model. We refer to this problem as continual learning of image compression. Our initial findings show that baseline solutions, such as end-to-end fine…
▽ More
This paper explores the possibility of extending the capability of pre-trained neural image compressors (e.g., adapting to new data or target bitrates) without breaking backward compatibility, the ability to decode bitstreams encoded by the original model. We refer to this problem as continual learning of image compression. Our initial findings show that baseline solutions, such as end-to-end fine-tuning, do not preserve the desired backward compatibility. To tackle this, we propose a knowledge replay training strategy that effectively addresses this issue. We also design a new model architecture that enables more effective continual learning than existing baselines. Experiments are conducted for two scenarios: data-incremental learning and rate-incremental learning. The main conclusion of this paper is that neural image compressors can be fine-tuned to achieve better performance (compared to their pre-trained version) on new data and rates without compromising backward compatibility. Our code is available at https://gitlab.com/viper-purdue/continual-compression
△ Less
Submitted 29 February, 2024;
originally announced February 2024.
-
Robust Beamforming for RIS-aided Communications: Gradient-based Manifold Meta Learning
Authors:
Fenghao Zhu,
Xinquan Wang,
Chongwen Huang,
Zhaohui Yang,
Xiaoming Chen,
Ahmed Alhammadi,
Zhaoyang Zhang,
Chau Yuen,
Mérouane Debbah
Abstract:
Reconfigurable intelligent surface (RIS) has become a promising technology to realize the programmable wireless environment via steering the incident signal in fully customizable ways. However, a major challenge in RIS-aided communication systems is the simultaneous design of the precoding matrix at the base station (BS) and the phase shifting matrix of the RIS elements. This is mainly attributed…
▽ More
Reconfigurable intelligent surface (RIS) has become a promising technology to realize the programmable wireless environment via steering the incident signal in fully customizable ways. However, a major challenge in RIS-aided communication systems is the simultaneous design of the precoding matrix at the base station (BS) and the phase shifting matrix of the RIS elements. This is mainly attributed to the highly non-convex optimization space of variables at both the BS and the RIS, and the diversity of communication environments. Generally, traditional optimization methods for this problem suffer from the high complexity, while existing deep learning based methods are lack of robustness in various scenarios. To address these issues, we introduce a gradient-based manifold meta learning method (GMML), which works without pre-training and has strong robustness for RIS-aided communications. Specifically, the proposed method fuses meta learning and manifold learning to improve the overall spectral efficiency, and reduce the overhead of the high-dimensional signal process. Unlike traditional deep learning based methods which directly take channel state information as input, GMML feeds the gradients of the precoding matrix and phase shifting matrix into neural networks. Coherently, we design a differential regulator to constrain the phase shifting matrix of the RIS. Numerical results show that the proposed GMML can improve the spectral efficiency by up to 7.31\%, and speed up the convergence by 23 times faster compared to traditional approaches. Moreover, they also demonstrate remarkable robustness and adaptability in dynamic settings.
△ Less
Submitted 24 July, 2024; v1 submitted 16 February, 2024;
originally announced February 2024.
-
3D Lymphoma Segmentation on PET/CT Images via Multi-Scale Information Fusion with Cross-Attention
Authors:
Huan Huang,
Liheng Qiu,
Shenmiao Yang,
Longxi Li,
Jiaofen Nan,
Yanting Li,
Chuang Han,
Fubao Zhu,
Chen Zhao,
Weihua Zhou
Abstract:
Background: Accurate segmentation of diffuse large B-cell lymphoma (DLBCL) lesions is challenging due to their complex patterns in medical imaging.
Objective: This study aims to develop a precise segmentation method for DLBCL using 18F-Fluorodeoxyglucose (FDG) positron emission tomography (PET) and computed tomography (CT) images.
Methods: We propose a 3D dual-branch encoder segmentation metho…
▽ More
Background: Accurate segmentation of diffuse large B-cell lymphoma (DLBCL) lesions is challenging due to their complex patterns in medical imaging.
Objective: This study aims to develop a precise segmentation method for DLBCL using 18F-Fluorodeoxyglucose (FDG) positron emission tomography (PET) and computed tomography (CT) images.
Methods: We propose a 3D dual-branch encoder segmentation method using shifted window transformers and a Multi-Scale Information Fusion (MSIF) module. To enhance feature integration, the MSIF module performs multi-scale feature fusion using cross-attention mechanisms with a shifted window framework. A gated neural network within the MSIF module dynamically balances the contributions from each modality. The model was optimized using the Dice Similarity Coefficient (DSC) loss function. Additionally, we computed the total metabolic tumor volume (TMTV) and performed statistical analyses.
Results: The model was trained and validated on a dataset of 165 DLBCL patients using 5-fold cross-validation, achieving a DSC of 0.7512. Statistical analysis showed a significant improvement over comparative methods (p < 0.05). Additionally, a Pearson correlation coefficient of 0.91 and an R^2 of 0.89 were observed when comparing manual annotations to segmentation results for TMTV measurement.
Conclusion: This study presents an effective automatic segmentation method for DLBCL that leverages the complementary strengths of PET and CT imaging. Our method has the potential to improve diagnostic interpretations and assist in treatment planning for DLBCL patients.
△ Less
Submitted 9 September, 2024; v1 submitted 4 February, 2024;
originally announced February 2024.
-
Another Way to the Top: Exploit Contextual Clustering in Learned Image Coding
Authors:
Yichi Zhang,
Zhihao Duan,
Ming Lu,
Dandan Ding,
Fengqing Zhu,
Zhan Ma
Abstract:
While convolution and self-attention are extensively used in learned image compression (LIC) for transform coding, this paper proposes an alternative called Contextual Clustering based LIC (CLIC) which primarily relies on clustering operations and local attention for correlation characterization and compact representation of an image. As seen, CLIC expands the receptive field into the entire image…
▽ More
While convolution and self-attention are extensively used in learned image compression (LIC) for transform coding, this paper proposes an alternative called Contextual Clustering based LIC (CLIC) which primarily relies on clustering operations and local attention for correlation characterization and compact representation of an image. As seen, CLIC expands the receptive field into the entire image for intra-cluster feature aggregation. Afterward, features are reordered to their original spatial positions to pass through the local attention units for inter-cluster embedding. Additionally, we introduce the Guided Post-Quantization Filtering (GuidedPQF) into CLIC, effectively mitigating the propagation and accumulation of quantization errors at the initial decoding stage. Extensive experiments demonstrate the superior performance of CLIC over state-of-the-art works: when optimized using MSE, it outperforms VVC by about 10% BD-Rate in three widely-used benchmark datasets; when optimized using MS-SSIM, it saves more than 50% BD-Rate over VVC. Our CLIC offers a new way to generate compact representations for image compression, which also provides a novel direction along the line of LIC development.
△ Less
Submitted 21 January, 2024;
originally announced January 2024.
-
Deep Hierarchical Video Compression
Authors:
Ming Lu,
Zhihao Duan,
Fengqing Zhu,
Zhan Ma
Abstract:
Recently, probabilistic predictive coding that directly models the conditional distribution of latent features across successive frames for temporal redundancy removal has yielded promising results. Existing methods using a single-scale Variational AutoEncoder (VAE) must devise complex networks for conditional probability estimation in latent space, neglecting multiscale characteristics of video f…
▽ More
Recently, probabilistic predictive coding that directly models the conditional distribution of latent features across successive frames for temporal redundancy removal has yielded promising results. Existing methods using a single-scale Variational AutoEncoder (VAE) must devise complex networks for conditional probability estimation in latent space, neglecting multiscale characteristics of video frames. Instead, this work proposes hierarchical probabilistic predictive coding, for which hierarchal VAEs are carefully designed to characterize multiscale latent features as a family of flexible priors and posteriors to predict the probabilities of future frames. Under such a hierarchical structure, lightweight networks are sufficient for prediction. The proposed method outperforms representative learned video compression models on common testing videos and demonstrates computational friendliness with much less memory footprint and faster encoding/decoding. Extensive experiments on adaptation to temporal patterns also indicate the better generalization of our hierarchical predictive mechanism. Furthermore, our solution is the first to enable progressive decoding that is favored in networked video applications with packet loss.
△ Less
Submitted 12 December, 2023;
originally announced December 2023.
-
Energy-efficient Beamforming for RISs-aided Communications: Gradient Based Meta Learning
Authors:
Xinquan Wang,
Fenghao Zhu,
Qianyun Zhou,
Qihao Yu,
Chongwen Huang,
Ahmed Alhammadi,
Zhaoyang Zhang,
Chau Yuen,
Mérouane Debbah
Abstract:
Reconfigurable intelligent surfaces (RISs) have become a promising technology to meet the requirements of energy efficiency and scalability in future six-generation (6G) communications. However, a significant challenge in RISs-aided communications is the joint optimization of active and passive beamforming at base stations (BSs) and RISs respectively. Specifically, the main difficulty is attribute…
▽ More
Reconfigurable intelligent surfaces (RISs) have become a promising technology to meet the requirements of energy efficiency and scalability in future six-generation (6G) communications. However, a significant challenge in RISs-aided communications is the joint optimization of active and passive beamforming at base stations (BSs) and RISs respectively. Specifically, the main difficulty is attributed to the highly non-convex optimization space of beamforming matrices at both BSs and RISs, as well as the diversity and mobility of communication scenarios. To address this, we present a greenly gradient based meta learning beamforming (GMLB) approach. Unlike traditional deep learning based methods which take channel information directly as input, GMLB feeds the gradient of sum rate into neural networks. Coherently, we design a differential regulator to address the phase shift optimization of RISs. Moreover, we use the meta learning to iteratively optimize the beamforming matrices of BSs and RISs. These techniques make the proposed method to work well without requiring energy-consuming pre-training. Simulations show that GMLB could achieve higher sum rate than that of typical alternating optimization algorithms with the energy consumption by two orders of magnitude less.
△ Less
Submitted 16 February, 2024; v1 submitted 12 November, 2023;
originally announced November 2023.
-
A Robust Deep Learning Method with Uncertainty Estimation for the Pathological Classification of Renal Cell Carcinoma based on CT Images
Authors:
Ni Yao,
Hang Hu,
Kaicong Chen,
Chen Zhao,
Yuan Guo,
Boya Li,
Jiaofen Nan,
Yanting Li,
Chuang Han,
Fubao Zhu,
Weihua Zhou,
Li Tian
Abstract:
Objectives To develop and validate a deep learning-based diagnostic model incorporating uncertainty estimation so as to facilitate radiologists in the preoperative differentiation of the pathological subtypes of renal cell carcinoma (RCC) based on CT images. Methods Data from 668 consecutive patients, pathologically proven RCC, were retrospectively collected from Center 1. By using five-fold cross…
▽ More
Objectives To develop and validate a deep learning-based diagnostic model incorporating uncertainty estimation so as to facilitate radiologists in the preoperative differentiation of the pathological subtypes of renal cell carcinoma (RCC) based on CT images. Methods Data from 668 consecutive patients, pathologically proven RCC, were retrospectively collected from Center 1. By using five-fold cross-validation, a deep learning model incorporating uncertainty estimation was developed to classify RCC subtypes into clear cell RCC (ccRCC), papillary RCC (pRCC), and chromophobe RCC (chRCC). An external validation set of 78 patients from Center 2 further evaluated the model's performance. Results In the five-fold cross-validation, the model's area under the receiver operating characteristic curve (AUC) for the classification of ccRCC, pRCC, and chRCC was 0.868 (95% CI: 0.826-0.923), 0.846 (95% CI: 0.812-0.886), and 0.839 (95% CI: 0.802-0.88), respectively. In the external validation set, the AUCs were 0.856 (95% CI: 0.838-0.882), 0.787 (95% CI: 0.757-0.818), and 0.793 (95% CI: 0.758-0.831) for ccRCC, pRCC, and chRCC, respectively. Conclusions The developed deep learning model demonstrated robust performance in predicting the pathological subtypes of RCC, while the incorporated uncertainty emphasized the importance of understanding model confidence, which is crucial for assisting clinical decision-making for patients with renal tumors. Clinical relevance statement Our deep learning approach, integrated with uncertainty estimation, offers clinicians a dual advantage: accurate RCC subtype predictions complemented by diagnostic confidence references, promoting informed decision-making for patients with RCC.
△ Less
Submitted 12 November, 2023; v1 submitted 1 November, 2023;
originally announced November 2023.
-
Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP
Authors:
Jinzuomu Zhong,
Yang Li,
Hui Huang,
Korin Richmond,
Jie Liu,
Zhiba Su,
Jing Guo,
Benlai Tang,
Fengjie Zhu
Abstract:
In expressive and controllable Text-to-Speech (TTS), explicit prosodic features significantly improve the naturalness and controllability of synthesised speech. However, manual prosody annotation is labor-intensive and inconsistent. To address this issue, a two-stage automatic annotation pipeline is novelly proposed in this paper. In the first stage, we use contrastive pretraining of Speech-Silenc…
▽ More
In expressive and controllable Text-to-Speech (TTS), explicit prosodic features significantly improve the naturalness and controllability of synthesised speech. However, manual prosody annotation is labor-intensive and inconsistent. To address this issue, a two-stage automatic annotation pipeline is novelly proposed in this paper. In the first stage, we use contrastive pretraining of Speech-Silence and Word-Punctuation (SSWP) pairs to enhance prosodic information in latent representations. In the second stage, we build a multi-modal prosody annotator, comprising pretrained encoders, a text-speech fusing scheme, and a sequence classifier. Experiments on English prosodic boundaries demonstrate that our method achieves state-of-the-art (SOTA) performance with 0.72 and 0.93 f1 score for Prosodic Word and Prosodic Phrase boundary respectively, while bearing remarkable robustness to data scarcity.
△ Less
Submitted 11 June, 2024; v1 submitted 11 September, 2023;
originally announced September 2023.
-
An Improved Upper Bound on the Rate-Distortion Function of Images
Authors:
Zhihao Duan,
Jack Ma,
Jiangpeng He,
Fengqing Zhu
Abstract:
Recent work has shown that Variational Autoencoders (VAEs) can be used to upper-bound the information rate-distortion (R-D) function of images, i.e., the fundamental limit of lossy image compression. In this paper, we report an improved upper bound on the R-D function of images implemented by (1) introducing a new VAE model architecture, (2) applying variable-rate compression techniques, and (3) p…
▽ More
Recent work has shown that Variational Autoencoders (VAEs) can be used to upper-bound the information rate-distortion (R-D) function of images, i.e., the fundamental limit of lossy image compression. In this paper, we report an improved upper bound on the R-D function of images implemented by (1) introducing a new VAE model architecture, (2) applying variable-rate compression techniques, and (3) proposing a novel \ourfunction{} to stabilize training. We demonstrate that at least 30\% BD-rate reduction w.r.t. the intra prediction mode in VVC codec is achievable, suggesting that there is still great potential for improving lossy image compression. Code is made publicly available at https://github.com/duanzhiihao/lossy-vae.
△ Less
Submitted 5 September, 2023;
originally announced September 2023.
-
A Visual Quality Assessment Method for Raster Images in Scanned Document
Authors:
Justin Yang,
Peter Bauer,
Todd Harris,
Changhyung Lee,
Hyeon Seok Seo,
Jan P Allebach,
Fengqing Zhu
Abstract:
Image quality assessment (IQA) is an active research area in the field of image processing. Most prior works focus on visual quality of natural images captured by cameras. In this paper, we explore visual quality of scanned documents, focusing on raster image areas. Different from many existing works which aim to estimate a visual quality score, we propose a machine learning based classification m…
▽ More
Image quality assessment (IQA) is an active research area in the field of image processing. Most prior works focus on visual quality of natural images captured by cameras. In this paper, we explore visual quality of scanned documents, focusing on raster image areas. Different from many existing works which aim to estimate a visual quality score, we propose a machine learning based classification method to determine whether the visual quality of a scanned raster image at a given resolution setting is acceptable. We conduct a psychophysical study to determine the acceptability at different image resolutions based on human subject ratings and use them as the ground truth to train our machine learning model. However, this dataset is unbalanced as most images were rated as visually acceptable. To address the data imbalance problem, we introduce several noise models to simulate the degradation of image quality during the scanning process. Our results show that by including augmented data in training, we can significantly improve the performance of the classifier to determine whether the visual quality of raster images in a scanned document is acceptable or not for a given resolution setting.
△ Less
Submitted 25 July, 2023;
originally announced July 2023.
-
Efficient Gaussian Process Classification-based Physical-Layer Authentication with Configurable Fingerprints for 6G-Enabled IoT
Authors:
Rui Meng,
Fangzhou Zhu,
Xiqi Cheng,
Xiaodong Xu,
Bizhu Wang,
Chen Dong,
Bingxuan Xu,
Xiaofeng Tao,
Ping Zhang
Abstract:
The future 6G-enabled IoT will facilitate seamless global connectivity among ubiquitous wireless devices, but this advancement also introduces heightened security risks such as spoofing attacks. Physical-Layer Authentication (PLA) has emerged as a promising, inherently secure, and energy-efficient technique for authenticating IoT terminals. Nonetheless, the direct application of state-of-the-art P…
▽ More
The future 6G-enabled IoT will facilitate seamless global connectivity among ubiquitous wireless devices, but this advancement also introduces heightened security risks such as spoofing attacks. Physical-Layer Authentication (PLA) has emerged as a promising, inherently secure, and energy-efficient technique for authenticating IoT terminals. Nonetheless, the direct application of state-of-the-art PLA schemes to 6G-enabled IoT encounters two major hurdles: inaccurate channel fingerprints and the inefficient utilization of prior fingerprint information. To tackle these challenges, we leverage Reconfigurable Intelligent Surfaces (RISs) to enhance fingerprint accuracy. Additionally, we integrate active learning and Gaussian Processes (GPs) to propose an Efficient Gaussian Process Classification (EGPC)-based PLA scheme, aiming for reliable and lightweight authentication. Following Bayes' theorem, we model configurable fingerprints using GPs and employ the expectation propagation method to identify unknown fingerprints. Given the difficulty of obtaining sufficient labeled fingerprint samples to train PLA models, we propose three fingerprint selection algorithms. These algorithms select unlabeled fingerprints and query their identities using upper-layer authentication mechanisms. Among these methods, the optimal algorithm reduces the number of training fingerprints needed through importance sampling and eliminates the requirement for PLA model retraining through joint distribution calculation. Simulations results reveal that, in comparison with non-RIS-based approaches, the RIS-aided PLA framework decreases the authentication error rate by 98.69%. In addition, our designed fingerprint selection algorithms achieve a reduction in the authentication error rate of up to 86.93% compared to baseline active learning schemes.
△ Less
Submitted 5 April, 2025; v1 submitted 23 July, 2023;
originally announced July 2023.
-
MLA-BIN: Model-level Attention and Batch-instance Style Normalization for Domain Generalization of Federated Learning on Medical Image Segmentation
Authors:
Fubao Zhu,
Yanhui Tian,
Chuang Han,
Yanting Li,
Jiaofen Nan,
Ni Yao,
Weihua Zhou
Abstract:
The privacy protection mechanism of federated learning (FL) offers an effective solution for cross-center medical collaboration and data sharing. In multi-site medical image segmentation, each medical site serves as a client of FL, and its data naturally forms a domain. FL supplies the possibility to improve the performance of seen domains model. However, there is a problem of domain generalizatio…
▽ More
The privacy protection mechanism of federated learning (FL) offers an effective solution for cross-center medical collaboration and data sharing. In multi-site medical image segmentation, each medical site serves as a client of FL, and its data naturally forms a domain. FL supplies the possibility to improve the performance of seen domains model. However, there is a problem of domain generalization (DG) in the actual de-ployment, that is, the performance of the model trained by FL in unseen domains will decrease. Hence, MLA-BIN is proposed to solve the DG of FL in this study. Specifically, the model-level attention module (MLA) and batch-instance style normalization (BIN) block were designed. The MLA represents the unseen domain as a linear combination of seen domain models. The atten-tion mechanism is introduced for the weighting coefficient to obtain the optimal coefficient ac-cording to the similarity of inter-domain data features. MLA enables the global model to gen-eralize to unseen domain. In the BIN block, batch normalization (BN) and instance normalization (IN) are combined to perform the shallow layers of the segmentation network for style normali-zation, solving the influence of inter-domain image style differences on DG. The extensive experimental results of two medical image seg-mentation tasks demonstrate that the proposed MLA-BIN outperforms state-of-the-art methods.
△ Less
Submitted 29 June, 2023;
originally announced June 2023.
-
TranssionADD: A multi-frame reinforcement based sequence tagging model for audio deepfake detection
Authors:
Jie Liu,
Zhiba Su,
Hui Huang,
Caiyan Wan,
Quanxiu Wang,
Jiangli Hong,
Benlai Tang,
Fengjie Zhu
Abstract:
Thanks to recent advancements in end-to-end speech modeling technology, it has become increasingly feasible to imitate and clone a user`s voice. This leads to a significant challenge in differentiating between authentic and fabricated audio segments. To address the issue of user voice abuse and misuse, the second Audio Deepfake Detection Challenge (ADD 2023) aims to detect and analyze deepfake spe…
▽ More
Thanks to recent advancements in end-to-end speech modeling technology, it has become increasingly feasible to imitate and clone a user`s voice. This leads to a significant challenge in differentiating between authentic and fabricated audio segments. To address the issue of user voice abuse and misuse, the second Audio Deepfake Detection Challenge (ADD 2023) aims to detect and analyze deepfake speech utterances. Specifically, Track 2, named the Manipulation Region Location (RL), aims to pinpoint the location of manipulated regions in audio, which can be present in both real and generated audio segments. We propose our novel TranssionADD system as a solution to the challenging problem of model robustness and audio segment outliers in the trace competition. Our system provides three unique contributions: 1) we adapt sequence tagging task for audio deepfake detection; 2) we improve model generalization by various data augmentation techniques; 3) we incorporate multi-frame detection (MFD) module to overcome limited representation provided by a single frame and use isolated-frame penalty (IFP) loss to handle outliers in segments. Our best submission achieved 2nd place in Track 2, demonstrating the effectiveness and robustness of our proposed system.
△ Less
Submitted 27 June, 2023;
originally announced June 2023.
-
Self-Supervised Visual Representation Learning on Food Images
Authors:
Andrew Peng,
Jiangpeng He,
Fengqing Zhu
Abstract:
Food image analysis is the groundwork for image-based dietary assessment, which is the process of monitoring what kinds of food and how much energy is consumed using captured food or eating scene images. Existing deep learning-based methods learn the visual representation for downstream tasks based on human annotation of each food image. However, most food images in real life are obtained without…
▽ More
Food image analysis is the groundwork for image-based dietary assessment, which is the process of monitoring what kinds of food and how much energy is consumed using captured food or eating scene images. Existing deep learning-based methods learn the visual representation for downstream tasks based on human annotation of each food image. However, most food images in real life are obtained without labels, and data annotation requires plenty of time and human effort, which is not feasible for real-world applications. To make use of the vast amount of unlabeled images, many existing works focus on unsupervised or self-supervised learning of visual representations directly from unlabeled data. However, none of these existing works focus on food images, which is more challenging than general objects due to its high inter-class similarity and intra-class variance.
In this paper, we focus on the implementation and analysis of existing representative self-supervised learning methods on food images. Specifically, we first compare the performance of six selected self-supervised learning models on the Food-101 dataset. Then we analyze the pros and cons of each selected model when training on food data to identify the key factors that can help improve the performance. Finally, we propose several ideas for future work on self-supervised visual representation learning for food images.
△ Less
Submitted 15 March, 2023;
originally announced March 2023.
-
Nonlinear Hyperspectral Unmixing based on Multilinear Mixing Model using Convolutional Autoencoders
Authors:
Tingting Fang,
Fei Zhu,
Jie Chen
Abstract:
Unsupervised spectral unmixing consists of representing each observed pixel as a combination of several pure materials called endmembers with their corresponding abundance fractions. Beyond the linear assumption, various nonlinear unmixing models have been proposed, with the associated optimization problems solved either by traditional optimization algorithms or deep learning techniques. Current d…
▽ More
Unsupervised spectral unmixing consists of representing each observed pixel as a combination of several pure materials called endmembers with their corresponding abundance fractions. Beyond the linear assumption, various nonlinear unmixing models have been proposed, with the associated optimization problems solved either by traditional optimization algorithms or deep learning techniques. Current deep learning-based nonlinear unmixing focuses on the models in additive, bilinear-based formulations. By interpreting the reflection process using the discrete Markov chain, the multilinear mixing model (MLM) successfully accounts for the up to infinite-order interactions between endmembers. However, to simulate the physics process of MLM by neural networks explicitly is a challenging problem that has not been approached by far. In this article, we propose a novel autoencoder-based network for unsupervised unmixing based on MLM. Benefitting from an elaborate network design, the relationships among all the model parameters {\em i.e.}, endmembers, abundances, and transition probability parameters are explicitly modeled. There are two modes: MLM-1DAE considers only pixel-wise spectral information, and MLM-3DAE exploits the spectral-spatial correlations within input patches. Experiments on both the synthetic and real datasets demonstrate the effectiveness of the proposed method as it achieves competitive performance to the classic solutions of MLM.
△ Less
Submitted 14 March, 2023;
originally announced March 2023.
-
QARV: Quantization-Aware ResNet VAE for Lossy Image Compression
Authors:
Zhihao Duan,
Ming Lu,
Jack Ma,
Yuning Huang,
Zhan Ma,
Fengqing Zhu
Abstract:
This paper addresses the problem of lossy image compression, a fundamental problem in image processing and information theory that is involved in many real-world applications. We start by reviewing the framework of variational autoencoders (VAEs), a powerful class of generative probabilistic models that has a deep connection to lossy compression. Based on VAEs, we develop a novel scheme for lossy…
▽ More
This paper addresses the problem of lossy image compression, a fundamental problem in image processing and information theory that is involved in many real-world applications. We start by reviewing the framework of variational autoencoders (VAEs), a powerful class of generative probabilistic models that has a deep connection to lossy compression. Based on VAEs, we develop a novel scheme for lossy image compression, which we name quantization-aware ResNet VAE (QARV). Our method incorporates a hierarchical VAE architecture integrated with test-time quantization and quantization-aware training, without which efficient entropy coding would not be possible. In addition, we design the neural network architecture of QARV specifically for fast decoding and propose an adaptive normalization operation for variable-rate compression. Extensive experiments are conducted, and results show that QARV achieves variable-rate compression, high-speed decoding, and a better rate-distortion performance than existing baseline methods. The code of our method is publicly accessible at https://github.com/duanzhiihao/lossy-vae
△ Less
Submitted 1 December, 2023; v1 submitted 16 February, 2023;
originally announced February 2023.
-
Incremental Value and Interpretability of Radiomics Features of Both Lung and Epicardial Adipose Tissue for Detecting the Severity of COVID-19 Infection
Authors:
Ni Yao,
Yanhui Tian,
Daniel Gama das Neves,
Chen Zhao,
Claudio Tinoco Mesquita,
Wolney de Andrade Martins,
Alair Augusto Sarmet Moreira Damas dos Santos,
Yanting Li,
Chuang Han,
Fubao Zhu,
Neng Dai,
Weihua Zhou
Abstract:
Epicardial adipose tissue (EAT) is known for its pro-inflammatory properties and association with Coronavirus Disease 2019 (COVID-19) severity. However, current EAT segmentation methods do not consider positional information. Additionally, the detection of COVID-19 severity lacks consideration for EAT radiomics features, which limits interpretability. This study investigates the use of radiomics f…
▽ More
Epicardial adipose tissue (EAT) is known for its pro-inflammatory properties and association with Coronavirus Disease 2019 (COVID-19) severity. However, current EAT segmentation methods do not consider positional information. Additionally, the detection of COVID-19 severity lacks consideration for EAT radiomics features, which limits interpretability. This study investigates the use of radiomics features from EAT and lungs to detect the severity of COVID-19 infections. A retrospective analysis of 515 patients with COVID-19 (Cohort1: 415, Cohort2: 100) was conducted using a proposed three-stage deep learning approach for EAT extraction. Lung segmentation was achieved using a published method. A hybrid model for detecting the severity of COVID-19 was built in a derivation cohort, and its performance and uncertainty were evaluated in internal (125, Cohort1) and external (100, Cohort2) validation cohorts. For EAT extraction, the Dice similarity coefficients (DSC) of the two centers were 0.972 (+-0.011) and 0.968 (+-0.005), respectively. For severity detection, the hybrid model with radiomics features of both lungs and EAT showed improvements in AUC, net reclassification improvement (NRI), and integrated discrimination improvement (IDI) compared to the model with only lung radiomics features. The hybrid model exhibited an increase of 0.1 (p<0.001), 19.3%, and 18.0% respectively, in the internal validation cohort and an increase of 0.09 (p<0.001), 18.0%, and 18.0%, respectively, in the external validation cohort while outperforming existing detection methods. Uncertainty quantification and radiomics features analysis confirmed the interpretability of case prediction after inclusion of EAT features.
△ Less
Submitted 6 December, 2023; v1 submitted 28 January, 2023;
originally announced January 2023.