Search | arXiv e-print repository

4KAgent: Agentic Any Image to 4K Super-Resolution

Authors: Yushen Zuo, Qi Zheng, Mingyang Wu, Xinrui Jiang, Renjie Li, Jian Wang, Yide Zhang, Gengchen Mai, Lihong V. Wang, James Zou, Xiaoyu Wang, Ming-Hsuan Yang, Zhengzhong Tu

Abstract: We present 4KAgent, a unified agentic super-resolution generalist system designed to universally upscale any image to 4K resolution (and even higher, if applied iteratively). Our system can transform images from extremely low resolutions with severe degradations, for example, highly distorted inputs at 256x256, into crystal-clear, photorealistic 4K outputs. 4KAgent comprises three core components:… ▽ More We present 4KAgent, a unified agentic super-resolution generalist system designed to universally upscale any image to 4K resolution (and even higher, if applied iteratively). Our system can transform images from extremely low resolutions with severe degradations, for example, highly distorted inputs at 256x256, into crystal-clear, photorealistic 4K outputs. 4KAgent comprises three core components: (1) Profiling, a module that customizes the 4KAgent pipeline based on bespoke use cases; (2) A Perception Agent, which leverages vision-language models alongside image quality assessment experts to analyze the input image and make a tailored restoration plan; and (3) A Restoration Agent, which executes the plan, following a recursive execution-reflection paradigm, guided by a quality-driven mixture-of-expert policy to select the optimal output for each step. Additionally, 4KAgent embeds a specialized face restoration pipeline, significantly enhancing facial details in portrait and selfie photos. We rigorously evaluate our 4KAgent across 11 distinct task categories encompassing a total of 26 diverse benchmarks, setting new state-of-the-art on a broad spectrum of imaging domains. Our evaluations cover natural images, portrait photos, AI-generated content, satellite imagery, fluorescence microscopy, and medical imaging like fundoscopy, ultrasound, and X-ray, demonstrating superior performance in terms of both perceptual (e.g., NIQE, MUSIQ) and fidelity (e.g., PSNR) metrics. By establishing a novel agentic paradigm for low-level vision tasks, we aim to catalyze broader interest and innovation within vision-centric autonomous agents across diverse research communities. We will release all the code, models, and results at: https://4kagent.github.io. △ Less

Submitted 9 July, 2025; originally announced July 2025.

Comments: Project page: https://4kagent.github.io

arXiv:2507.02289 [pdf, ps, other]

CineMyoPS: Segmenting Myocardial Pathologies from Cine Cardiac MR

Authors: Wangbin Ding, Lei Li, Junyi Qiu, Bogen Lin, Mingjing Yang, Liqin Huang, Lianming Wu, Sihan Wang, Xiahai Zhuang

Abstract: Myocardial infarction (MI) is a leading cause of death worldwide. Late gadolinium enhancement (LGE) and T2-weighted cardiac magnetic resonance (CMR) imaging can respectively identify scarring and edema areas, both of which are essential for MI risk stratification and prognosis assessment. Although combining complementary information from multi-sequence CMR is useful, acquiring these sequences can… ▽ More Myocardial infarction (MI) is a leading cause of death worldwide. Late gadolinium enhancement (LGE) and T2-weighted cardiac magnetic resonance (CMR) imaging can respectively identify scarring and edema areas, both of which are essential for MI risk stratification and prognosis assessment. Although combining complementary information from multi-sequence CMR is useful, acquiring these sequences can be time-consuming and prohibitive, e.g., due to the administration of contrast agents. Cine CMR is a rapid and contrast-free imaging technique that can visualize both motion and structural abnormalities of the myocardium induced by acute MI. Therefore, we present a new end-to-end deep neural network, referred to as CineMyoPS, to segment myocardial pathologies, \ie scars and edema, solely from cine CMR images. Specifically, CineMyoPS extracts both motion and anatomy features associated with MI. Given the interdependence between these features, we design a consistency loss (resembling the co-training strategy) to facilitate their joint learning. Furthermore, we propose a time-series aggregation strategy to integrate MI-related features across the cardiac cycle, thereby enhancing segmentation accuracy for myocardial pathologies. Experimental results on a multi-center dataset demonstrate that CineMyoPS achieves promising performance in myocardial pathology segmentation, motion estimation, and anatomy segmentation. △ Less

Submitted 2 July, 2025; originally announced July 2025.

arXiv:2506.17425 [pdf, ps, other]

Trans${^2}$-CBCT: A Dual-Transformer Framework for Sparse-View CBCT Reconstruction

Authors: Minmin Yang, Huantao Ren, Senem Velipasalar

Abstract: Cone-beam computed tomography (CBCT) using only a few X-ray projection views enables faster scans with lower radiation dose, but the resulting severe under-sampling causes strong artifacts and poor spatial coverage. We address these challenges in a unified framework. First, we replace conventional UNet/ResNet encoders with TransUNet, a hybrid CNN-Transformer model. Convolutional layers capture loc… ▽ More Cone-beam computed tomography (CBCT) using only a few X-ray projection views enables faster scans with lower radiation dose, but the resulting severe under-sampling causes strong artifacts and poor spatial coverage. We address these challenges in a unified framework. First, we replace conventional UNet/ResNet encoders with TransUNet, a hybrid CNN-Transformer model. Convolutional layers capture local details, while self-attention layers enhance global context. We adapt TransUNet to CBCT by combining multi-scale features, querying view-specific features per 3D point, and adding a lightweight attenuation-prediction head. This yields Trans-CBCT, which surpasses prior baselines by 1.17 dB PSNR and 0.0163 SSIM on the LUNA16 dataset with six views. Second, we introduce a neighbor-aware Point Transformer to enforce volumetric coherence. This module uses 3D positional encoding and attention over k-nearest neighbors to improve spatial consistency. The resulting model, Trans$^2$-CBCT, provides an additional gain of 0.63 dB PSNR and 0.0117 SSIM. Experiments on LUNA16 and ToothFairy show consistent gains from six to ten views, validating the effectiveness of combining CNN-Transformer features with point-based geometry reasoning for sparse-view CBCT reconstruction. △ Less

Submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.16961 [pdf, ps, other]

Reversing Flow for Image Restoration

Authors: Haina Qin, Wenyang Luo, Libin Wang, Dandan Zheng, Jingdong Chen, Ming Yang, Bing Li, Weiming Hu

Abstract: Image restoration aims to recover high-quality (HQ) images from degraded low-quality (LQ) ones by reversing the effects of degradation. Existing generative models for image restoration, including diffusion and score-based models, often treat the degradation process as a stochastic transformation, which introduces inefficiency and complexity. In this work, we propose ResFlow, a novel image restorat… ▽ More Image restoration aims to recover high-quality (HQ) images from degraded low-quality (LQ) ones by reversing the effects of degradation. Existing generative models for image restoration, including diffusion and score-based models, often treat the degradation process as a stochastic transformation, which introduces inefficiency and complexity. In this work, we propose ResFlow, a novel image restoration framework that models the degradation process as a deterministic path using continuous normalizing flows. ResFlow augments the degradation process with an auxiliary process that disambiguates the uncertainty in HQ prediction to enable reversible modeling of the degradation process. ResFlow adopts entropy-preserving flow paths and learns the augmented degradation flow by matching the velocity field. ResFlow significantly improves the performance and speed of image restoration, completing the task in fewer than four sampling steps. Extensive experiments demonstrate that ResFlow achieves state-of-the-art results across various image restoration benchmarks, offering a practical and efficient solution for real-world applications. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: CVPR2025 Final Version; Corresponding Author: Bing Li

MSC Class: 68U10 ACM Class: I.4.4

arXiv:2506.05706 [pdf, other]

Bridging the Modality Gap: Softly Discretizing Audio Representation for LLM-based Automatic Speech Recognition

Authors: Mu Yang, Szu-Jui Chen, Jiamin Xie, John Hansen

Abstract: One challenge of integrating speech input with large language models (LLMs) stems from the discrepancy between the continuous nature of audio data and the discrete token-based paradigm of LLMs. To mitigate this gap, we propose a method for integrating vector quantization (VQ) into LLM-based automatic speech recognition (ASR). Using the LLM embedding table as the VQ codebook, the VQ module aligns t… ▽ More One challenge of integrating speech input with large language models (LLMs) stems from the discrepancy between the continuous nature of audio data and the discrete token-based paradigm of LLMs. To mitigate this gap, we propose a method for integrating vector quantization (VQ) into LLM-based automatic speech recognition (ASR). Using the LLM embedding table as the VQ codebook, the VQ module aligns the continuous representations from the audio encoder with the discrete LLM inputs, enabling the LLM to operate on a discretized audio representation that better reflects the linguistic structure. We further create a soft "discretization" of the audio representation by updating the codebook and performing a weighted sum over the codebook embeddings. Empirical results demonstrate that our proposed method significantly improves upon the LLM-based ASR baseline, particularly in out-of-domain conditions. This work highlights the potential of soft discretization as a modality bridge in LLM-based ASR. △ Less

Submitted 5 June, 2025; originally announced June 2025.

arXiv:2505.24651 [pdf, ps, other]

Robust Distributed Phase Retrieval for Multi-View Compressive Networked Sensing With Outliers

Authors: Ming-Hsun Yang

Abstract: This work examines the multi-view compressive phase retrieval problem in a distributed sensor network, where each sensor device, limited by storage and sensing capabilities, can access only intensity measurements from an unknown part of the global sparse vector. The goal is to enable each sensor to recover its observable sparse signal when measurements are corrupted by outliers. To achieve reliabl… ▽ More This work examines the multi-view compressive phase retrieval problem in a distributed sensor network, where each sensor device, limited by storage and sensing capabilities, can access only intensity measurements from an unknown part of the global sparse vector. The goal is to enable each sensor to recover its observable sparse signal when measurements are corrupted by outliers. To achieve reliable local signal recovery with limited data access, we propose a distributed reconstruction algorithm that enables collaboration among sensor devices without the need to share individual raw data. The proposed scheme employs a two-stage approach that first recovers the amplitude of the global signal (at a central server) and subsequently estimates the observable nonzero signal entries (at each local device). Our analytic results show that perfect global signal amplitude recovery can be achieved under mild conditions on the support size of sparse outliers and the view blockage level. In addition, the exact reconstruction of locally observed signal components is shown to be attainable in the noise-free case by solving a binary optimization problem, subject to a mild requirement on the structure of the sensing matrix. Computer simulations are provided to illustrate the effectiveness of the proposed scheme. △ Less

Submitted 30 May, 2025; originally announced May 2025.

arXiv:2505.16027 [pdf]

Benchmarking Chest X-ray Diagnosis Models Across Multinational Datasets

Authors: Qinmei Xu, Yiheng Li, Xianghao Zhan, Ahmet Gorkem Er, Brittany Dashevsky, Chuanjun Xu, Mohammed Alawad, Mengya Yang, Liu Ya, Changsheng Zhou, Xiao Li, Haruka Itakura, Olivier Gevaert

Abstract: Foundation models leveraging vision-language pretraining have shown promise in chest X-ray (CXR) interpretation, yet their real-world performance across diverse populations and diagnostic tasks remains insufficiently evaluated. This study benchmarks the diagnostic performance and generalizability of foundation models versus traditional convolutional neural networks (CNNs) on multinational CXR data… ▽ More Foundation models leveraging vision-language pretraining have shown promise in chest X-ray (CXR) interpretation, yet their real-world performance across diverse populations and diagnostic tasks remains insufficiently evaluated. This study benchmarks the diagnostic performance and generalizability of foundation models versus traditional convolutional neural networks (CNNs) on multinational CXR datasets. We evaluated eight CXR diagnostic models - five vision-language foundation models and three CNN-based architectures - across 37 standardized classification tasks using six public datasets from the USA, Spain, India, and Vietnam, and three private datasets from hospitals in China. Performance was assessed using AUROC, AUPRC, and other metrics across both shared and dataset-specific tasks. Foundation models outperformed CNNs in both accuracy and task coverage. MAVL, a model incorporating knowledge-enhanced prompts and structured supervision, achieved the highest performance on public (mean AUROC: 0.82; AUPRC: 0.32) and private (mean AUROC: 0.95; AUPRC: 0.89) datasets, ranking first in 14 of 37 public and 3 of 4 private tasks. All models showed reduced performance on pediatric cases, with average AUROC dropping from 0.88 +/- 0.18 in adults to 0.57 +/- 0.29 in children (p = 0.0202). These findings highlight the value of structured supervision and prompt design in radiologic AI and suggest future directions including geographic expansion and ensemble modeling for clinical deployment. Code for all evaluated models is available at https://drive.google.com/drive/folders/1B99yMQm7bB4h1sVMIBja0RfUu8gLktCE △ Less

Submitted 21 May, 2025; originally announced May 2025.

Comments: 78 pages, 7 figures, 2 tabeles

MSC Class: I.2 ACM Class: I.2

arXiv:2505.07916 [pdf, ps, other]

MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder

Authors: Bowen Zhang, Congchao Guo, Geng Yang, Hang Yu, Haozhe Zhang, Heidi Lei, Jialong Mai, Junjie Yan, Kaiyue Yang, Mingqi Yang, Peikai Huang, Ruiyang Jin, Sitan Jiang, Weihua Cheng, Yawei Li, Yichen Xiao, Yiying Zhou, Yongmao Zhang, Yuan Lu, Yucen He

Abstract: We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech. A key innovation is our learnable speaker encoder, which extracts timbre features from a reference audio without requiring its transcription. This enables MiniMax-Speech to produce highly expressive speech with timbre consistent with the reference in a zero-shot manner, w… ▽ More We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech. A key innovation is our learnable speaker encoder, which extracts timbre features from a reference audio without requiring its transcription. This enables MiniMax-Speech to produce highly expressive speech with timbre consistent with the reference in a zero-shot manner, while also supporting one-shot voice cloning with exceptionally high similarity to the reference voice. In addition, the overall quality of the synthesized audio is enhanced through the proposed Flow-VAE. Our model supports 32 languages and demonstrates excellent performance across multiple objective and subjective evaluations metrics. Notably, it achieves state-of-the-art (SOTA) results on objective voice cloning metrics (Word Error Rate and Speaker Similarity) and has secured the top position on the public TTS Arena leaderboard. Another key strength of MiniMax-Speech, granted by the robust and disentangled representations from the speaker encoder, is its extensibility without modifying the base model, enabling various applications such as: arbitrary voice emotion control via LoRA; text to voice (T2V) by synthesizing timbre features directly from text description; and professional voice cloning (PVC) by fine-tuning timbre features with additional data. We encourage readers to visit https://minimax-ai.github.io/tts_tech_report for more examples. △ Less

Submitted 12 May, 2025; originally announced May 2025.

arXiv:2504.18022 [pdf, ps, other]

Iterative Joint Detection of Kalman Filter and Channel Decoder for Sensor-to-Controller Link in Wireless Networked Control Systems

Authors: Jinnan Piao, Dong Li, Yiming Sun, Zhibo Li, Ming Yang, Xueting Yu

Abstract: In this letter, we propose an iterative joint detection algorithm of Kalman filter (KF) and channel decoder for the sensor-to-controller link of wireless networked control systems, which utilizes the prior information of control system to improve control and communication performance. In this algorithm, we first use the KF to estimate the probability density of the control system outputs and calcu… ▽ More In this letter, we propose an iterative joint detection algorithm of Kalman filter (KF) and channel decoder for the sensor-to-controller link of wireless networked control systems, which utilizes the prior information of control system to improve control and communication performance. In this algorithm, we first use the KF to estimate the probability density of the control system outputs and calculate the prior probability of received signals to assist decoder. Then, the possible outputs of the control system are traversed to update the prior probability in order to implement iterative detection. The simulation results show that the prior information and the iterative structure can reduce the block error rate performance of communications while improving the root mean square error performance of controls. △ Less

Submitted 29 May, 2025; v1 submitted 24 April, 2025; originally announced April 2025.

Comments: 5 pages, 4 figures

arXiv:2503.13468 [pdf, other]

A CGAN-LSTM-Based Framework for Time-Varying Non-Stationary Channel Modeling

Authors: Keying Guo, Ruisi He, Mi Yang, Yuxin Zhang, Bo Ai, Haoxiang Zhang, Jiahui Han, Ruifeng Chen

Abstract: Time-varying non-stationary channels, with complex dynamic variations and temporal evolution characteristics, have significant challenges in channel modeling and communication system performance evaluation. Most existing methods of time-varying channel modeling focus on predicting channel state at a given moment or simulating short-term channel fluctuations, which are unable to capture the long-te… ▽ More Time-varying non-stationary channels, with complex dynamic variations and temporal evolution characteristics, have significant challenges in channel modeling and communication system performance evaluation. Most existing methods of time-varying channel modeling focus on predicting channel state at a given moment or simulating short-term channel fluctuations, which are unable to capture the long-term evolution of the channel. This paper emphasizes the generation of long-term dynamic channel to fully capture evolution of non-stationary channel properties. The generated channel not only reflects temporal dynamics but also ensures consistent stationarity. We propose a hybrid deep learning framework that combines conditional generative adversarial networks (CGAN) with long short-term memory (LSTM) networks. A stationarity-constrained approach is designed to ensure temporal correlation of the generated time-series channel. This method can generate channel with required temporal non-stationarity. The model is validated by comparing channel statistical features, and the results show that the generated channel is in good agreement with raw channel and provides good performance in terms of non-stationarity. △ Less

Submitted 2 March, 2025; originally announced March 2025.

Comments: 11 pages,7 figures

arXiv:2503.11124 [pdf, other]

Flow-Aware Navigation of Magnetic Micro-Robots in Complex Fluids via PINN-Based Prediction

Authors: Yongyi Jia, Shu Miao, Jiayu Wu, Ming Yang, Chengzhi Hu, Xiang Li

Abstract: While magnetic micro-robots have demonstrated significant potential across various applications, including drug delivery and microsurgery, the open issue of precise navigation and control in complex fluid environments is crucial for in vivo implementation. This paper introduces a novel flow-aware navigation and control strategy for magnetic micro-robots that explicitly accounts for the impact of f… ▽ More While magnetic micro-robots have demonstrated significant potential across various applications, including drug delivery and microsurgery, the open issue of precise navigation and control in complex fluid environments is crucial for in vivo implementation. This paper introduces a novel flow-aware navigation and control strategy for magnetic micro-robots that explicitly accounts for the impact of fluid flow on their movement. First, the proposed method employs a Physics-Informed U-Net (PI-UNet) to refine the numerically predicted fluid velocity using local observations. Then, the predicted velocity is incorporated in a flow-aware A* path planning algorithm, ensuring efficient navigation while mitigating flow-induced disturbances. Finally, a control scheme is developed to compensate for the predicted fluid velocity, thereby optimizing the micro-robot's performance. A series of simulation studies and real-world experiments are conducted to validate the efficacy of the proposed approach. This method enhances both planning accuracy and control precision, expanding the potential applications of magnetic micro-robots in fluid-affected environments typical of many medical scenarios. △ Less

Submitted 14 March, 2025; originally announced March 2025.

Comments: 8

arXiv:2503.01383 [pdf, other]

Channel Semantic Characterization for Integrated Sensing and Communication Scenarios: From Measurements to Modeling

Authors: Zhengyu Zhang, Ruisi He, Bo Ai, Mi Yang, Xuejian Zhang, Ziyi Qi, Zhangdui Zhong

Abstract: With the advancement of sixth-generation (6G) wireless communication systems, integrated sensing and communication (ISAC) is crucial for perceiving and interacting with the environment via electromagnetic propagation, termed channel semantics, to support tasks like decision-making. However, channel models focusing on physical characteristics face challenges in representing semantics embedded in… ▽ More With the advancement of sixth-generation (6G) wireless communication systems, integrated sensing and communication (ISAC) is crucial for perceiving and interacting with the environment via electromagnetic propagation, termed channel semantics, to support tasks like decision-making. However, channel models focusing on physical characteristics face challenges in representing semantics embedded in the channel, thereby limiting the evaluation of ISAC systems. To tackle this, we present a novel framework for channel modeling from the conceptual event perspective. By leveraging a multi-level semantic structure and characterized knowledge libraries, the framework decomposes complex channel characteristics into extensible semantic characterization, thereby better capturing the relationship between environment and channel, and enabling more flexible adjustments of channel models for different events without requiring a complete reset. Specifically, we define channel semantics on three levels: status semantics, behavior semantics, and event semantics, corresponding to channel multipaths, channel time-varying trajectories, and channel topology, respectively. Taking realistic vehicular ISAC scenarios as an example, we perform semantic clustering, characterizing status semantics via multipath statistical distributions, modeling behavior semantics using Markov chains for time variation, and representing event semantics through a co-occurrence matrix. Results show the model accurately generates channels while capturing rich semantic information. Moreover, its generalization supports customized semantics. △ Less

Submitted 3 March, 2025; originally announced March 2025.

arXiv:2502.18846 [pdf, other]

RL-OGM-Parking: Lidar OGM-Based Hybrid Reinforcement Learning Planner for Autonomous Parking

Authors: Zhitao Wang, Zhe Chen, Mingyang Jiang, Tong Qin, Ming Yang

Abstract: Autonomous parking has become a critical application in automatic driving research and development. Parking operations often suffer from limited space and complex environments, requiring accurate perception and precise maneuvering. Traditional rule-based parking algorithms struggle to adapt to diverse and unpredictable conditions, while learning-based algorithms lack consistent and stable performa… ▽ More Autonomous parking has become a critical application in automatic driving research and development. Parking operations often suffer from limited space and complex environments, requiring accurate perception and precise maneuvering. Traditional rule-based parking algorithms struggle to adapt to diverse and unpredictable conditions, while learning-based algorithms lack consistent and stable performance in various scenarios. Therefore, a hybrid approach is necessary that combines the stability of rule-based methods and the generalizability of learning-based methods. Recently, reinforcement learning (RL) based policy has shown robust capability in planning tasks. However, the simulation-to-reality (sim-to-real) transfer gap seriously blocks the real-world deployment. To address these problems, we employ a hybrid policy, consisting of a rule-based Reeds-Shepp (RS) planner and a learning-based reinforcement learning (RL) planner. A real-time LiDAR-based Occupancy Grid Map (OGM) representation is adopted to bridge the sim-to-real gap, leading the hybrid policy can be applied to real-world systems seamlessly. We conducted extensive experiments both in the simulation environment and real-world scenarios, and the result demonstrates that the proposed method outperforms pure rule-based and learning-based methods. The real-world experiment further validates the feasibility and efficiency of the proposed method. △ Less

Submitted 26 February, 2025; originally announced February 2025.

arXiv:2502.16342 [pdf, other]

Revealing Microscopic Objects in Fluorescence Live Imaging by Video-to-video Translation Based on A Spatial-temporal Generative Adversarial Network

Authors: Yang Jiao, Mei Yang, Mo Weng

Abstract: In spite of being a valuable tool to simultaneously visualize multiple types of subcellular structures using spectrally distinct fluorescent labels, a standard fluoresce microscope is only able to identify a few microscopic objects; such a limit is largely imposed by the number of fluorescent labels available to the sample. In order to simultaneously visualize more objects, in this paper, we propo… ▽ More In spite of being a valuable tool to simultaneously visualize multiple types of subcellular structures using spectrally distinct fluorescent labels, a standard fluoresce microscope is only able to identify a few microscopic objects; such a limit is largely imposed by the number of fluorescent labels available to the sample. In order to simultaneously visualize more objects, in this paper, we propose to use video-to-video translation that mimics the development process of microscopic objects. In essence, we use a microscopy video-to-video translation framework namely Spatial-temporal Generative Adversarial Network (STGAN) to reveal the spatial and temporal relationships between the microscopic objects, after which a microscopy video of one object can be translated to another object in a different domain. The experimental results confirm that the proposed STGAN is effective in microscopy video-to-video translation that mitigates the spectral conflicts caused by the limited fluorescent labels, allowing multiple microscopic objects be simultaneously visualized. △ Less

Submitted 22 February, 2025; originally announced February 2025.

arXiv:2502.09654 [pdf, other]

Heterogeneous Mixture of Experts for Remote Sensing Image Super-Resolution

Authors: Bowen Chen, Keyan Chen, Mohan Yang, Zhengxia Zou, Zhenwei Shi

Abstract: Remote sensing image super-resolution (SR) aims to reconstruct high-resolution remote sensing images from low-resolution inputs, thereby addressing limitations imposed by sensors and imaging conditions. However, the inherent characteristics of remote sensing images, including diverse ground object types and complex details, pose significant challenges to achieving high-quality reconstruction. Exis… ▽ More Remote sensing image super-resolution (SR) aims to reconstruct high-resolution remote sensing images from low-resolution inputs, thereby addressing limitations imposed by sensors and imaging conditions. However, the inherent characteristics of remote sensing images, including diverse ground object types and complex details, pose significant challenges to achieving high-quality reconstruction. Existing methods typically employ a uniform structure to process various types of ground objects without distinction, making it difficult to adapt to the complex characteristics of remote sensing images. To address this issue, we introduce a Mixture of Experts (MoE) model and design a set of heterogeneous experts. These experts are organized into multiple expert groups, where experts within each group are homogeneous while being heterogeneous across groups. This design ensures that specialized activation parameters can be employed to handle the diverse and intricate details of ground objects effectively. To better accommodate the heterogeneous experts, we propose a multi-level feature aggregation strategy to guide the routing process. Additionally, we develop a dual-routing mechanism to adaptively select the optimal expert for each pixel. Experiments conducted on the UCMerced and AID datasets demonstrate that our proposed method achieves superior SR reconstruction accuracy compared to state-of-the-art methods. The code will be available at https://github.com/Mr-Bamboo/MFG-HMoE. △ Less

Submitted 2 April, 2025; v1 submitted 11 February, 2025; originally announced February 2025.

arXiv:2502.00800 [pdf, other]

Adversarial Semantic Augmentation for Training Generative Adversarial Networks under Limited Data

Authors: Mengping Yang, Zhe Wang, Ziqiu Chi, Dongdong Li, Wenli Du

Abstract: Generative adversarial networks (GANs) have made remarkable achievements in synthesizing images in recent years. Typically, training GANs requires massive data, and the performance of GANs deteriorates significantly when training data is limited. To improve the synthesis performance of GANs in low-data regimes, existing approaches use various data augmentation techniques to enlarge the training se… ▽ More Generative adversarial networks (GANs) have made remarkable achievements in synthesizing images in recent years. Typically, training GANs requires massive data, and the performance of GANs deteriorates significantly when training data is limited. To improve the synthesis performance of GANs in low-data regimes, existing approaches use various data augmentation techniques to enlarge the training sets. However, it is identified that these augmentation techniques may leak or even alter the data distribution. To remedy this, we propose an adversarial semantic augmentation (ASA) technique to enlarge the training data at the semantic level instead of the image level. Concretely, considering semantic features usually encode informative information of images, we estimate the covariance matrices of semantic features for both real and generated images to find meaningful transformation directions. Such directions translate original features to another semantic representation, e.g., changing the backgrounds or expressions of the human face dataset. Moreover, we derive an upper bound of the expected adversarial loss. By optimizing the upper bound, our semantic augmentation is implicitly achieved. Such design avoids redundant sampling of the augmented features and introduces negligible computation overhead, making our approach computation efficient. Extensive experiments on both few-shot and large-scale datasets demonstrate that our method consistently improve the synthesis quality under various data regimes, and further visualized and analytic results suggesting satisfactory versatility of our proposed method. △ Less

Submitted 2 February, 2025; originally announced February 2025.

Comments: This work was completed in 2022 and submitted to an IEEE journal for potential publication

arXiv:2501.15726 [pdf, other]

Vision-Aided Channel Prediction Based on Image Segmentation at Street Intersection Scenarios

Authors: Xuejian Zhang, Ruisi He, Mi Yang, Ziyi Qi, Zhengyu Zhang, Bo Ai, Zhangdui Zhong

Abstract: Intelligent vehicular communication with vehicle road collaboration capability is a key technology enabled by 6G, and the integration of various visual sensors on vehicles and infrastructures plays a crucial role. Moreover, accurate channel prediction is foundational to realizing intelligent vehicular communication. Traditional methods are still limited by the inability to balance accuracy and ope… ▽ More Intelligent vehicular communication with vehicle road collaboration capability is a key technology enabled by 6G, and the integration of various visual sensors on vehicles and infrastructures plays a crucial role. Moreover, accurate channel prediction is foundational to realizing intelligent vehicular communication. Traditional methods are still limited by the inability to balance accuracy and operability based on substantial spectrum resource consumption and highly refined description of environment. Therefore, leveraging out-of-band information introduced by visual sensors provides a new solution and is increasingly applied across various communication tasks. In this paper, we propose a computer vision (CV)-based prediction model for vehicular communications, realizing accurate channel characterization prediction including path loss, Rice K-factor and delay spread based on image segmentation. First, we conduct extensive vehicle-to-infrastructure measurement campaigns, collecting channel and visual data from various street intersection scenarios. The image-channel dataset is generated after a series of data post-processing steps. Image data consists of individual segmentation of target user using YOLOv8 network. Subsequently, established dataset is used to train and test prediction network ResNet-32, where segmented images serve as input of network, and various channel characteristics are treated as labels or target outputs of network. Finally, self-validation and cross-validation experiments are performed. The results indicate that models trained with segmented images achieve high prediction accuracy and remarkable generalization performance across different streets and target users. The model proposed in this paper offers novel solutions for achieving intelligent channel prediction in vehicular communications. △ Less

Submitted 26 January, 2025; originally announced January 2025.

Comments: 12 pages, 9 figures, submitted to IEEE Transactions on Cognitive Communications and Networking

arXiv:2501.13134 [pdf, ps, other]

UniRestore: Unified Perceptual and Task-Oriented Image Restoration Model Using Diffusion Prior

Authors: I-Hsiang Chen, Wei-Ting Chen, Yu-Wei Liu, Yuan-Chun Chiang, Sy-Yen Kuo, Ming-Hsuan Yang

Abstract: Image restoration aims to recover content from inputs degraded by various factors, such as adverse weather, blur, and noise. Perceptual Image Restoration (PIR) methods improve visual quality but often do not support downstream tasks effectively. On the other hand, Task-oriented Image Restoration (TIR) methods focus on enhancing image utility for high-level vision tasks, sometimes compromising visu… ▽ More Image restoration aims to recover content from inputs degraded by various factors, such as adverse weather, blur, and noise. Perceptual Image Restoration (PIR) methods improve visual quality but often do not support downstream tasks effectively. On the other hand, Task-oriented Image Restoration (TIR) methods focus on enhancing image utility for high-level vision tasks, sometimes compromising visual quality. This paper introduces UniRestore, a unified image restoration model that bridges the gap between PIR and TIR by using a diffusion prior. The diffusion prior is designed to generate images that align with human visual quality preferences, but these images are often unsuitable for TIR scenarios. To solve this limitation, UniRestore utilizes encoder features from an autoencoder to adapt the diffusion prior to specific tasks. We propose a Complementary Feature Restoration Module (CFRM) to reconstruct degraded encoder features and a Task Feature Adapter (TFA) module to facilitate adaptive feature fusion in the decoder. This design allows UniRestore to optimize images for both human perception and downstream task requirements, addressing discrepancies between visual quality and functional needs. Integrating these modules also enhances UniRestore's adapability and efficiency across diverse tasks. Extensive expertments demonstrate the superior performance of UniRestore in both PIR and TIR scenarios. △ Less

Submitted 1 June, 2025; v1 submitted 22 January, 2025; originally announced January 2025.

Comments: Accepted by CVPR2025 (Highlight); Project Page: https://unirestore.github.io

arXiv:2501.08639 [pdf]

Detecting Wildfire Flame and Smoke through Edge Computing using Transfer Learning Enhanced Deep Learning Models

Authors: Giovanny Vazquez, Shengjie Zhai, Mei Yang

Abstract: Autonomous unmanned aerial vehicles (UAVs) integrated with edge computing capabilities empower real-time data processing directly on the device, dramatically reducing latency in critical scenarios such as wildfire detection. This study underscores Transfer Learning's (TL) significance in boosting the performance of object detectors for identifying wildfire smoke and flames, especially when trained… ▽ More Autonomous unmanned aerial vehicles (UAVs) integrated with edge computing capabilities empower real-time data processing directly on the device, dramatically reducing latency in critical scenarios such as wildfire detection. This study underscores Transfer Learning's (TL) significance in boosting the performance of object detectors for identifying wildfire smoke and flames, especially when trained on limited datasets, and investigates the impact TL has on edge computing metrics. With the latter focusing how TL-enhanced You Only Look Once (YOLO) models perform in terms of inference time, power usage, and energy consumption when using edge computing devices. This study utilizes the Aerial Fire and Smoke Essential (AFSE) dataset as the target, with the Flame and Smoke Detection Dataset (FASDD) and the Microsoft Common Objects in Context (COCO) dataset serving as source datasets. We explore a two-stage cascaded TL method, utilizing D-Fire or FASDD as initial stage target datasets and AFSE as the subsequent stage. Through fine-tuning, TL significantly enhances detection precision, achieving up to 79.2% mean Average Precision ([email protected]), reduces training time, and increases model generalizability across the AFSE dataset. However, cascaded TL yielded no notable improvements and TL alone did not benefit the edge computing metrics evaluated. Lastly, this work found that YOLOv5n remains a powerful model when lacking hardware acceleration, finding that YOLOv5n can process images nearly twice as fast as its newer counterpart, YOLO11n. Overall, the results affirm TL's role in augmenting the accuracy of object detectors while also illustrating that additional enhancements are needed to improve edge computing performance. △ Less

Submitted 15 January, 2025; originally announced January 2025.

Comments: 11 pages, 7 figures

arXiv:2412.11393 [pdf]

STDHL: Spatio-Temporal Dynamic Hypergraph Learning for Wind Power Forecasting

Authors: Xiaochong Dong, Xuemin Zhang, Ming Yang, Shengwei Mei

Abstract: Leveraging spatio-temporal correlations among wind farms can significantly enhance the accuracy of ultra-short-term wind power forecasting. However, the complex and dynamic nature of these correlations presents significant modeling challenges. To address this, we propose a spatio-temporal dynamic hypergraph learning (STDHL) model. This model uses a hypergraph structure to represent spatial feature… ▽ More Leveraging spatio-temporal correlations among wind farms can significantly enhance the accuracy of ultra-short-term wind power forecasting. However, the complex and dynamic nature of these correlations presents significant modeling challenges. To address this, we propose a spatio-temporal dynamic hypergraph learning (STDHL) model. This model uses a hypergraph structure to represent spatial features among wind farms. Unlike traditional graph structures, which only capture pair-wise node features, hypergraphs create hyperedges connecting multiple nodes, enabling the representation and transmission of higher-order spatial features. The STDHL model incorporates a novel dynamic hypergraph convolutional layer to model dynamic spatial correlations and a grouped temporal convolutional layer for channel-independent temporal modeling. The model uses spatio-temporal encoders to extract features from multi-source covariates, which are mapped to quantile results through a forecast decoder. Experimental results using the GEFCom dataset show that the STDHL model outperforms existing state-of-the-art methods. Furthermore, an in-depth analysis highlights the critical role of spatio-temporal covariates in improving ultra-short-term forecasting accuracy. △ Less

Submitted 15 December, 2024; originally announced December 2024.

arXiv:2412.07074 [pdf, other]

Channel Spreading Function-Inspired Channel Transfer Function Estimation for OFDM Systems with High-Mobility

Authors: Yiyan Ma, Bo Ai, Guoyu Ma, Akram Shafie, Qingqing Cheng, Mi Yang, Jingli Li, Xuebo Pang, Jinhong Yuan, Zhangdui Zhong

Abstract: In this letter, we propose a novel channel transfer function (CTF) estimation approach for orthogonal frequency division multiplexing (OFDM) systems in high-mobility scenarios, that leverages the stationary properties of the delay-Doppler domain channel spreading function (CSF). First, we develop a CSF estimation model for OFDM systems that relies solely on discrete pilot symbols in the time-frequ… ▽ More In this letter, we propose a novel channel transfer function (CTF) estimation approach for orthogonal frequency division multiplexing (OFDM) systems in high-mobility scenarios, that leverages the stationary properties of the delay-Doppler domain channel spreading function (CSF). First, we develop a CSF estimation model for OFDM systems that relies solely on discrete pilot symbols in the time-frequency (TF) domain, positioned at predefined resource elements. We then present theorems to elucidate the relationship between CSF compactness and pilot spacing in the TF domain for accurate CSF acquisition. Based on the estimated CSF, we finally estimate the CTF for data symbols. Numerical results show that, in high-mobility scenarios, the proposed approach outperforms traditional interpolation-based methods and closely matches the optimal estimator in terms of estimation accuracy. This work may pave the way for CSF estimation in commercial OFDM systems, benefiting high-mobility communications, integrated sensing and communications, and related applications. △ Less

Submitted 9 December, 2024; originally announced December 2024.

arXiv:2411.19509 [pdf, other]

Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis

Authors: Tianqi Li, Ruobing Zheng, Minghui Yang, Jingdong Chen, Ming Yang

Abstract: Recent advances in diffusion models have endowed talking head synthesis with subtle expressions and vivid head movements, but have also led to slow inference speed and insufficient control over generated results. To address these issues, we propose Ditto, a diffusion-based talking head framework that enables fine-grained controls and real-time inference. Specifically, we utilize an off-the-shelf m… ▽ More Recent advances in diffusion models have endowed talking head synthesis with subtle expressions and vivid head movements, but have also led to slow inference speed and insufficient control over generated results. To address these issues, we propose Ditto, a diffusion-based talking head framework that enables fine-grained controls and real-time inference. Specifically, we utilize an off-the-shelf motion extractor and devise a diffusion transformer to generate representations in a specific motion space. We optimize the model architecture and training strategy to address the issues in generating motion representations, including insufficient disentanglement between motion and identity, and large internal discrepancies within the representation. Besides, we employ diverse conditional signals while establishing a mapping between motion representation and facial semantics, enabling control over the generation process and correction of the results. Moreover, we jointly optimize the holistic framework to enable streaming processing, real-time inference, and low first-frame delay, offering functionalities crucial for interactive applications such as AI assistants. Extensive experimental results demonstrate that Ditto generates compelling talking head videos and exhibits superiority in both controllability and real-time performance. △ Less

Submitted 30 April, 2025; v1 submitted 29 November, 2024; originally announced November 2024.

Comments: Project Page: https://digital-avatar.github.io/ai/Ditto/

arXiv:2411.11798 [pdf]

COST CA20120 INTERACT Framework of Artificial Intelligence Based Channel Modeling

Authors: Ruisi He, Nicola D. Cicco, Bo Ai, Mi Yang, Yang Miao, Mate Boban

Abstract: Accurate channel models are the prerequisite for communication-theoretic investigations as well as system design. Channel modeling generally relies on statistical and deterministic approaches. However, there are still significant limits for the traditional modeling methods in terms of accuracy, generalization ability, and computational complexity. The fundamental reason is that establishing a quan… ▽ More Accurate channel models are the prerequisite for communication-theoretic investigations as well as system design. Channel modeling generally relies on statistical and deterministic approaches. However, there are still significant limits for the traditional modeling methods in terms of accuracy, generalization ability, and computational complexity. The fundamental reason is that establishing a quantified and accurate mapping between physical environment and channel characteristics becomes increasing challenging for modern communication systems. Here, in the context of COST CA20120 Action, we evaluate and discuss the feasibility and implementation of using artificial intelligence (AI) for channel modeling, and explore where the future of this field lies. Firstly, we present a framework of AI-based channel modeling to characterize complex wireless channels. Then, we highlight in detail some major challenges and present the possible solutions: i) estimating the uncertainty of AI-based channel predictions, ii) integrating prior knowledge of propagation to improve generalization capabilities, and iii) interpretable AI for channel modeling. We present and discuss illustrative numerical results to showcase the capabilities of AI-based channel modeling. △ Less

Submitted 31 October, 2024; originally announced November 2024.

Comments: to appear in IEEE Wireless Communications Magazine

arXiv:2411.11539 [pdf, ps, other]

Channel Capacity-Aware Distributed Encoding for Multi-View Sensing and Edge Inference

Authors: Mingjie Yang, Guangming Liang, Dongzhu Liu, Lei Zhang, Kaibin Huang

Abstract: Integrated sensing and communication (ISAC) unifies wireless communication and sensing by sharing spectrum and hardware, which often incurs trade-offs between two functions due to limited resources. However, this paper shifts focus to exploring the synergy between communication and sensing, using WiFi sensing as an exemplary scenario where communication signals are repurposed to probe the environm… ▽ More Integrated sensing and communication (ISAC) unifies wireless communication and sensing by sharing spectrum and hardware, which often incurs trade-offs between two functions due to limited resources. However, this paper shifts focus to exploring the synergy between communication and sensing, using WiFi sensing as an exemplary scenario where communication signals are repurposed to probe the environment without dedicated sensing waveforms, followed by data uploading to the edge server for inference. While increased device participation enhances multi-view sensing data, it also imposes significant communication overhead between devices and the edge server. To address this challenge, we aim to maximize the sensing task performance, measured by mutual information, under the channel capacity constraint. The information-theoretic optimization problem is solved by the proposed ADE-MI, a novel framework that employs a two-stage optimization two-stage optimization approach: (1) adaptive distributed encoding (ADE) at the device, which ensures transmitted bits are most relevant to sensing tasks, and (2) multi-view Inference (MI) at the edge server, which orchestrates multi-view data from distributed devices. Our experimental results highlight the synergy between communication and sensing, showing that more frequent communication from WiFi access points to edge devices improves sensing inference accuracy. The proposed ADE-MI achieves 92\% recognition accuracy with over $10^4$-fold reduction in latency compared to schemes with raw data communication, achieving both high sensing inference accuracy and low communication latency simultaneously. △ Less

Submitted 18 November, 2024; originally announced November 2024.

arXiv:2411.05835 [pdf, other]

Improved Convolution-Based Analysis for Worst-Case Probability Response Time of CAN

Authors: Haozhe Yi, Junyi Liu, Maolin Yang, Zewei Chen, Xu Jiang

Abstract: Controller Area Networks (CANs) are widely adopted in real-time automotive control and are increasingly standard in factory automation. Considering their critical application in safety-critical systems, The error rate of the system must be accurately predicted and guaranteed. Through simulation, it is possible to obtain a low-precision overview of the system's behavior. However, for low-probabilit… ▽ More Controller Area Networks (CANs) are widely adopted in real-time automotive control and are increasingly standard in factory automation. Considering their critical application in safety-critical systems, The error rate of the system must be accurately predicted and guaranteed. Through simulation, it is possible to obtain a low-precision overview of the system's behavior. However, for low-probability events, the required number of samples in simulation increases rapidly, making it difficult to conduct a sufficient number of simulations in practical applications, and the statistical results may deviate from the actual outcomes. Therefore, a formal analysis is needed to evaluate the error rate of the system. This paper improves the worst-case probability response time analysis by using convolution-based busy-window and backlog techniques under the error retransmission protocol of CANs. Empirical analysis shows that the proposed method improves upon existing methods in terms of accuracy and efficiency. △ Less

Submitted 28 November, 2024; v1 submitted 6 November, 2024; originally announced November 2024.

arXiv:2411.05141 [pdf, ps, other]

Audiobox TTA-RAG: Improving Zero-Shot and Few-Shot Text-To-Audio with Retrieval-Augmented Generation

Authors: Mu Yang, Bowen Shi, Matthew Le, Wei-Ning Hsu, Andros Tjandra

Abstract: This work focuses on improving Text-To-Audio (TTA) generation on zero-shot and few-shot settings (i.e. generating unseen or uncommon audio events). Inspired by the success of Retrieval-Augmented Generation (RAG) in Large Language Models, we propose Audiobox TTA-RAG, a novel retrieval-augmented TTA approach based on Audiobox, a flow-matching audio generation model. Unlike the vanilla Audiobox TTA s… ▽ More This work focuses on improving Text-To-Audio (TTA) generation on zero-shot and few-shot settings (i.e. generating unseen or uncommon audio events). Inspired by the success of Retrieval-Augmented Generation (RAG) in Large Language Models, we propose Audiobox TTA-RAG, a novel retrieval-augmented TTA approach based on Audiobox, a flow-matching audio generation model. Unlike the vanilla Audiobox TTA solution that generates audio conditioned on text only, we extend the TTA process by augmenting the conditioning input with both text and retrieved audio samples. Our retrieval method does not require the external database to have labeled audio, offering more practical use cases. We show that the proposed model can effectively leverage the retrieved audio samples and significantly improve zero-shot and few-shot TTA performance, with large margins on multiple evaluation metrics, while maintaining the ability to generate semantically aligned audio for the in-domain setting. △ Less

Submitted 6 June, 2025; v1 submitted 7 November, 2024; originally announced November 2024.

Comments: Interspeech 2025

arXiv:2410.21276 [pdf, other]

GPT-4o System Card

Authors: OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis , et al. (395 additional authors not shown)

Abstract: GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 mil… ▽ More GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50\% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House, we are sharing the GPT-4o System Card, which includes our Preparedness Framework evaluations. In this System Card, we provide a detailed look at GPT-4o's capabilities, limitations, and safety evaluations across multiple categories, focusing on speech-to-speech while also evaluating text and image capabilities, and measures we've implemented to ensure the model is safe and aligned. We also include third-party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o's text and vision capabilities. △ Less

Submitted 25 October, 2024; originally announced October 2024.

arXiv:2410.12177 [pdf, other]

Towards Large Scale Atomic Manufacturing: Heterodyne Grating Interferometer with Zero Dead-Zone

Authors: Can Cui, Lvye Gao, Pengbo Zhao, Menghan Yang, Lifu Liu, Yu Ma, Guangyao Huang, Shengtong Wang, Linbin Luo, Xinghui Li

Abstract: This paper presents a novel heterodyne grating interferometer designed to meet the precise measurement requirements of next-generation lithography systems and large-scale atomic-level manufacturing. Utilizing a dual-frequency light source, the interferometer enables simultaneous measurement of three degrees of freedom. Key advancements include a compact zero Dead-Zone optical path configuration, s… ▽ More This paper presents a novel heterodyne grating interferometer designed to meet the precise measurement requirements of next-generation lithography systems and large-scale atomic-level manufacturing. Utilizing a dual-frequency light source, the interferometer enables simultaneous measurement of three degrees of freedom. Key advancements include a compact zero Dead-Zone optical path configuration, significantly enhancing measurement reliability by mitigating the impact of light source fluctuations and air refractive index variations. A comprehensive crosstalk error analysis was conducted, resulting in a robust correction algorithm that reduces errors to below 5%. Performance testing of the prototype, size of 90mm*90mm*40mm, demonstrated exceptional resolution (0.25 nm in the XY-axis and 0.3 nm in the Z-axis), superior linearity (6.9e-5, 8.1e-5 and 16.2e-5 for the X, Y, and Z axes, respectively), high repeatability (0.8 nm/1000 nm for the three axes) and stability (20 nm for the XY-axis and 60 nm for the Z-axis over 1000 seconds). Comparative analysis with existing measurement sensors highlights the proposed method's significant advantages in integration, multidimensional capabilities, and is expected to be widely used in fields such as integrated circuits, atomic-level manufacturing and aerospace technology. △ Less

Submitted 15 October, 2024; originally announced October 2024.

Comments: 8 pages,11 figures

arXiv:2410.02957 [pdf, other]

Human Balancing on a Log: A Switched Multi-Layer Controller

Authors: Jiayi Zhao, Mo Yang, Jing Shuang Li

Abstract: We study the task of balancing a human on a log that is fixed in place. Balancing on a log is substantially more challenging than balancing on a flat surface due to increased instability -- nonetheless, we are able to balance by composing simple (e.g., PID, LQR) controllers in a bio-inspired switched multi-layer configuration. The controller consists of an upper-layer LQR planner (akin to the cent… ▽ More We study the task of balancing a human on a log that is fixed in place. Balancing on a log is substantially more challenging than balancing on a flat surface due to increased instability -- nonetheless, we are able to balance by composing simple (e.g., PID, LQR) controllers in a bio-inspired switched multi-layer configuration. The controller consists of an upper-layer LQR planner (akin to the central nervous system) that coordinates ankle and hip torques, and lower-layer PID trackers (akin to local motor units) that follow this plan subject to nonlinear dynamics. The controller switches between three operational modes depending on the state of the human. The efficacy of the controller is verified in simulation, where our controller is able to stabilize the human for a variety of initial conditions and disturbances. We also introduce a controller that outputs muscle activations to perform the same balancing task. △ Less

Submitted 19 March, 2025; v1 submitted 3 October, 2024; originally announced October 2024.

Comments: to appear at 2025 IEEE American Control Conference (ACC)

arXiv:2409.06722 [pdf, other]

doi 10.1109/CCWC.2018.8301750

Automated Quantification of White Blood Cells in Light Microscopic Images of Injured Skeletal Muscle

Authors: Yang Jiao, Hananeh Derakhshan, Barbara St. Pierre Schneider, Emma Regentova, Mei Yang

Abstract: White blood cells (WBCs) are the most diverse cell types observed in the healing process of injured skeletal muscles. In the course of healing, WBCs exhibit dynamic cellular response and undergo multiple protein expression changes. The progress of healing can be analyzed by quantifying the number of WBCs or the amount of specific proteins in light microscopic images obtained at different time poin… ▽ More White blood cells (WBCs) are the most diverse cell types observed in the healing process of injured skeletal muscles. In the course of healing, WBCs exhibit dynamic cellular response and undergo multiple protein expression changes. The progress of healing can be analyzed by quantifying the number of WBCs or the amount of specific proteins in light microscopic images obtained at different time points after injury. In this paper, we propose an automated quantifying and analysis framework to analyze WBCs using light microscopic images of uninjured and injured muscles. The proposed framework is based on the Localized Iterative Otsu's threshold method with muscle edge detection and region of interest extraction. Compared with the threshold methods used in ImageJ, the LI Otsu's threshold method has high resistance to background area and achieves better accuracy. The CD68-positive cell results are presented for demonstrating the effectiveness of the proposed work. △ Less

Submitted 26 August, 2024; originally announced September 2024.

Comments: 2 tables, 7 figures, 8 pages

arXiv:2408.09241 [pdf, other]

Re-boosting Self-Collaboration Parallel Prompt GAN for Unsupervised Image Restoration

Authors: Xin Lin, Yuyan Zhou, Jingtong Yue, Chao Ren, Kelvin C. K. Chan, Lu Qi, Ming-Hsuan Yang

Abstract: Unsupervised restoration approaches based on generative adversarial networks (GANs) offer a promising solution without requiring paired datasets. Yet, these GAN-based approaches struggle to surpass the performance of conventional unsupervised GAN-based frameworks without significantly modifying model structures or increasing the computational complexity. To address these issues, we propose a self-… ▽ More Unsupervised restoration approaches based on generative adversarial networks (GANs) offer a promising solution without requiring paired datasets. Yet, these GAN-based approaches struggle to surpass the performance of conventional unsupervised GAN-based frameworks without significantly modifying model structures or increasing the computational complexity. To address these issues, we propose a self-collaboration (SC) strategy for existing restoration models. This strategy utilizes information from the previous stage as feedback to guide subsequent stages, achieving significant performance improvement without increasing the framework's inference complexity. The SC strategy comprises a prompt learning (PL) module and a restorer ($Res$). It iteratively replaces the previous less powerful fixed restorer $\overline{Res}$ in the PL module with a more powerful $Res$. The enhanced PL module generates better pseudo-degraded/clean image pairs, leading to a more powerful $Res$ for the next iteration. Our SC can significantly improve the $Res$'s performance by over 1.5 dB without adding extra parameters or computational complexity during inference. Meanwhile, existing self-ensemble (SE) and our SC strategies enhance the performance of pre-trained restorers from different perspectives. As SE increases computational complexity during inference, we propose a re-boosting module to the SC (Reb-SC) to improve the SC strategy further by incorporating SE into SC without increasing inference time. This approach further enhances the restorer's performance by approximately 0.3 dB. Extensive experimental results on restoration tasks demonstrate that the proposed model performs favorably against existing state-of-the-art unsupervised restoration methods. Source code and trained models are publicly available at: \url{https://github.com/linxin0/RSCP2GAN}. △ Less

Submitted 17 August, 2024; originally announced August 2024.

Comments: This paper is an extended and revised version of our previous work "Unsupervised Image Denoising in Real-World Scenarios via Self-Collaboration Parallel Generative Adversarial Branches"(https://openaccess.thecvf.com/content/ICCV2023/papers/Lin_Unsupervised_Image_Denoising_in_Real-World_Scenarios_via_Self-Collaboration_Parallel_Generative_ICCV_2023_paper.pdf)

arXiv:2407.04675 [pdf, other]

Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

Authors: Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chuang Ding, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao, Lu Gao, Yi Guo, Minglun Han, Ting Han, Wenchao Hu, Xinying Hu, Yuxiang Hu, Deyu Hua, Lu Huang, Mingkun Huang, Youjia Huang, Jishuo Jin, Fanliu Kong, Zongwei Lan, Tianyu Li , et al. (30 additional authors not shown)

Abstract: Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this wor… ▽ More Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this work, we introduce Seed-ASR, a large language model (LLM) based speech recognition model. Seed-ASR is developed based on the framework of audio conditioned LLM (AcLLM), leveraging the capabilities of LLMs by inputting continuous speech representations together with contextual information into the LLM. Through stage-wise large-scale training and the elicitation of context-aware capabilities in LLM, Seed-ASR demonstrates significant improvement over end-to-end models on comprehensive evaluation sets, including multiple domains, accents/dialects and languages. Additionally, Seed-ASR can be further deployed to support specific needs in various scenarios without requiring extra language models. Compared to recently released large ASR models, Seed-ASR achieves 10%-40% reduction in word (or character, for Chinese) error rates on Chinese and English public test sets, further demonstrating its powerful performance. △ Less

Submitted 10 July, 2024; v1 submitted 5 July, 2024; originally announced July 2024.

arXiv:2406.16871 [pdf, other]

Neural network based model predictive control of voltage for a polymer electrolyte fuel cell system with constraints

Authors: Xiufei Li, Miao Yang, Yuanxin Qi, Miao Zhang

Abstract: A fuel cell system must output a steady voltage as a power source in practical use. A neural network (NN) based model predictive control (MPC) approach is developed in this work to regulate the fuel cell output voltage with safety constraints. The developed NN MPC controller stabilizes the polymer electrolyte fuel cell system's output voltage by controlling the hydrogen and air flow rates at the s… ▽ More A fuel cell system must output a steady voltage as a power source in practical use. A neural network (NN) based model predictive control (MPC) approach is developed in this work to regulate the fuel cell output voltage with safety constraints. The developed NN MPC controller stabilizes the polymer electrolyte fuel cell system's output voltage by controlling the hydrogen and air flow rates at the same time. The safety constraints regarding the hydrogen pressure limit and input change rate limit are considered. The neural network model is built to describe the system voltage and hydrogen pressure behavior. Simulation results show that the NN MPC can control the voltage at the desired value while satisfying the safety constraints under workload disturbance. The NN MPC shows a comparable performance of the MPC based on the detailed underlying system physical model. △ Less

Submitted 24 March, 2024; originally announced June 2024.

arXiv:2406.12596 [pdf, ps, other]

Beyond Near-Field: Far-Field Location Division Multiple Access in Downlink MIMO Systems

Authors: Haoyan Liu, Caijian Jie, Min Yang, Chengguang Li

Abstract: Exploring channel dimensions has been the driving force behind breakthroughs in successive generations of mobile communication systems. In 5G, space division multiple access (SDMA) leveraging massive MIMO has been crucial in enhancing system capacity through spatial differentiation of users. However, SDMA can only finely distinguish users at adjacent angles in ultra-dense networks by extremely lar… ▽ More Exploring channel dimensions has been the driving force behind breakthroughs in successive generations of mobile communication systems. In 5G, space division multiple access (SDMA) leveraging massive MIMO has been crucial in enhancing system capacity through spatial differentiation of users. However, SDMA can only finely distinguish users at adjacent angles in ultra-dense networks by extremely large-scale antenna arrays. For a long time, most research has focused on the angle domain of the space, overlooking the potential of the distance domain. Near-field location division multiple access (LDMA) was proposed based on the beam-focusing effect yielded by near-field spherical propagation model, partitioning channel resources by both angle and distance. To achieve a similar idea in the far-field region, this paper introduces a far-field LDMA scheme for wideband systems based on orthogonal frequency division multiplexing (OFDM). Benefiting from frequency diverse arrays (FDA), it becomes possible to manipulate beams in the distance domain. Combined with OFDM, the inherent cyclic prefix ensures a complete OFDM symbol can be received without losing distance information, while the matched filter of OFDM helps eliminate the time-variance of FDA steering vectors. Theoretical and simulation results show that LDMA can fully exploit the additional degrees of freedom in the distance domain to significantly improve spectral efficiency, especially in narrow sector multiple access (MA) scenarios. Moreover, LDMA can maintain independence between array elements even in single-path channels, making it stand out in MA schemes at millimeter-wave and higher frequency bands. △ Less

Submitted 30 January, 2025; v1 submitted 18 June, 2024; originally announced June 2024.

Comments: We have omitted an important detail of the baseband equivalent model, which may mislead the reader. We are currently trying to resolve this issue, please withdraw our submission

arXiv:2406.10137 [pdf, ps, other]

Compressed Sensor Caching and Collaborative Sparse Data Recovery with Anchor Alignment

Authors: Yi-Jen Yang, Ming-Hsun Yang, Jwo-Yuh Wu, Y. -W. Peter Hong

Abstract: This work examines the compressed sensor caching problem in wireless sensor networks and devises efficient distributed sparse data recovery algorithms to enable collaboration among multiple caches. In this problem, each cache is only allowed to access measurements from a small subset of sensors within its vicinity to reduce both cache size and data acquisition overhead. To enable reliable data rec… ▽ More This work examines the compressed sensor caching problem in wireless sensor networks and devises efficient distributed sparse data recovery algorithms to enable collaboration among multiple caches. In this problem, each cache is only allowed to access measurements from a small subset of sensors within its vicinity to reduce both cache size and data acquisition overhead. To enable reliable data recovery with limited access to measurements, we propose a distributed sparse data recovery method, called the collaborative sparse recovery by anchor alignment (CoSR-AA) algorithm, where collaboration among caches is enabled by aligning their locally recovered data at a few anchor nodes. The proposed algorithm is based on the consensus alternating direction method of multipliers (ADMM) algorithm but with message exchange that is reduced by considering the proposed anchor alignment strategy. Then, by the deep unfolding of the ADMM iterations, we further propose the Deep CoSR-AA algorithm that can be used to significantly reduce the number of iterations. We obtain a graph neural network architecture where message exchange is done more efficiently by an embedded autoencoder. Simulations are provided to demonstrate the effectiveness of the proposed collaborative recovery algorithms in terms of the improved reconstruction quality and the reduced communication overhead due to anchor alignment. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: v1 was submitted to IEEE Transactions on Signal Processing on Sept. 18, 2023

arXiv:2405.10589 [pdf, other]

Improving Point-based Crowd Counting and Localization Based on Auxiliary Point Guidance

Authors: I-Hsiang Chen, Wei-Ting Chen, Yu-Wei Liu, Ming-Hsuan Yang, Sy-Yen Kuo

Abstract: Crowd counting and localization have become increasingly important in computer vision due to their wide-ranging applications. While point-based strategies have been widely used in crowd counting methods, they face a significant challenge, i.e., the lack of an effective learning strategy to guide the matching process. This deficiency leads to instability in matching point proposals to target points… ▽ More Crowd counting and localization have become increasingly important in computer vision due to their wide-ranging applications. While point-based strategies have been widely used in crowd counting methods, they face a significant challenge, i.e., the lack of an effective learning strategy to guide the matching process. This deficiency leads to instability in matching point proposals to target points, adversely affecting overall performance. To address this issue, we introduce an effective approach to stabilize the proposal-target matching in point-based methods. We propose Auxiliary Point Guidance (APG) to provide clear and effective guidance for proposal selection and optimization, addressing the core issue of matching uncertainty. Additionally, we develop Implicit Feature Interpolation (IFI) to enable adaptive feature extraction in diverse crowd scenarios, further enhancing the model's robustness and accuracy. Extensive experiments demonstrate the effectiveness of our approach, showing significant improvements in crowd counting and localization performance, particularly under challenging conditions. The source codes and trained models will be made publicly available. △ Less

Submitted 17 May, 2024; originally announced May 2024.

arXiv:2405.07442 [pdf]

Rene: A Pre-trained Multi-modal Architecture for Auscultation of Respiratory Diseases

Authors: Pengfei Zhang, Zhihang Zheng, Shichen Zhang, Minghao Yang, Shaojun Tang

Abstract: Compared with invasive examinations that require tissue sampling, respiratory sound testing is a non-invasive examination method that is safer and easier for patients to accept. In this study, we introduce Rene, a pioneering large-scale model tailored for respiratory sound recognition. Rene has been rigorously fine-tuned with an extensive dataset featuring a broad array of respiratory audio sample… ▽ More Compared with invasive examinations that require tissue sampling, respiratory sound testing is a non-invasive examination method that is safer and easier for patients to accept. In this study, we introduce Rene, a pioneering large-scale model tailored for respiratory sound recognition. Rene has been rigorously fine-tuned with an extensive dataset featuring a broad array of respiratory audio samples, targeting disease detection, sound pattern classification, and event identification. Our innovative approach applies a pre-trained speech recognition model to process respiratory sounds, augmented with patient medical records. The resulting multi-modal deep-learning framework addresses interpretability and real-time diagnostic challenges that have hindered previous respiratory-focused models. Benchmark comparisons reveal that Rene significantly outperforms existing models, achieving improvements of 10.27%, 16.15%, 15.29%, and 18.90% in respiratory event detection and audio classification on the SPRSound database. Disease prediction accuracy on the ICBHI database improved by 23% over the baseline in both mean average and harmonic scores. Moreover, we have developed a real-time respiratory sound discrimination system utilizing the Rene architecture. Employing state-of-the-art Edge AI technology, this system enables rapid and accurate responses for respiratory sound auscultation(https://github.com/zpforlove/Rene). △ Less

Submitted 6 June, 2024; v1 submitted 12 May, 2024; originally announced May 2024.

arXiv:2405.01200 [pdf, other]

Learning-to-solve unit commitment based on few-shot physics-guided spatial-temporal graph convolution network

Authors: Mei Yang, Gao Qiu andJunyong Liu, Kai Liu

Abstract: This letter proposes a few-shot physics-guided spatial temporal graph convolutional network (FPG-STGCN) to fast solve unit commitment (UC). Firstly, STGCN is tailored to parameterize UC. Then, few-shot physics-guided learning scheme is proposed. It exploits few typical UC solutions yielded via commercial optimizer to escape from local minimum, and leverages the augmented Lagrangian method for cons… ▽ More This letter proposes a few-shot physics-guided spatial temporal graph convolutional network (FPG-STGCN) to fast solve unit commitment (UC). Firstly, STGCN is tailored to parameterize UC. Then, few-shot physics-guided learning scheme is proposed. It exploits few typical UC solutions yielded via commercial optimizer to escape from local minimum, and leverages the augmented Lagrangian method for constraint satisfaction. To further enable both feasibility and continuous relaxation for integers in learning process, straight-through estimator for Tanh-Sign composition is proposed to fully differentiate the mixed integer solution space. Case study on the IEEE benchmark justifies that, our method bests mainstream learning ways on UC feasibility, and surpasses traditional solver on efficiency. △ Less

Submitted 2 May, 2024; originally announced May 2024.

arXiv:2404.17736 [pdf, other]

Diffusion-Aided Joint Source Channel Coding For High Realism Wireless Image Transmission

Authors: Mingyu Yang, Bowen Liu, Boyang Wang, Hun-Seok Kim

Abstract: Deep learning-based joint source-channel coding (deep JSCC) has been demonstrated to be an effective approach for wireless image transmission. Nevertheless, most existing work adopts an autoencoder framework to optimize conventional criteria such as Mean Squared Error (MSE) and Structural Similarity Index (SSIM) which do not suffice to maintain the perceptual quality of reconstructed images. Such… ▽ More Deep learning-based joint source-channel coding (deep JSCC) has been demonstrated to be an effective approach for wireless image transmission. Nevertheless, most existing work adopts an autoencoder framework to optimize conventional criteria such as Mean Squared Error (MSE) and Structural Similarity Index (SSIM) which do not suffice to maintain the perceptual quality of reconstructed images. Such an issue is more prominent under stringent bandwidth constraints or low signal-to-noise ratio (SNR) conditions. To tackle this challenge, we propose DiffJSCC, a novel framework that leverages the prior knowledge of the pre-trained Statble Diffusion model to produce high-realism images via the conditional diffusion denoising process. Our DiffJSCC first extracts multimodal spatial and textual features from the noisy channel symbols in the generation phase. Then, it produces an initial reconstructed image as an intermediate representation to aid robust feature extraction and a stable training process. In the following diffusion step, DiffJSCC uses the derived multimodal features, together with channel state information such as the signal-to-noise ratio (SNR), as conditions to guide the denoising diffusion process, which converts the initial random noise to the final reconstruction. DiffJSCC employs a novel control module to fine-tune the Stable Diffusion model and adjust it to the multimodal conditions. Extensive experiments on diverse datasets reveal that our method significantly surpasses prior deep JSCC approaches on both perceptual metrics and downstream task performance, showcasing its ability to preserve the semantics of the original transmitted images. Notably, DiffJSCC can achieve highly realistic reconstructions for 768x512 pixel Kodak images with only 3072 symbols (<0.008 symbols per pixel) under 1dB SNR channels. △ Less

Submitted 21 March, 2025; v1 submitted 26 April, 2024; originally announced April 2024.

arXiv:2404.13153 [pdf, other]

Motion-adaptive Separable Collaborative Filters for Blind Motion Deblurring

Authors: Chengxu Liu, Xuan Wang, Xiangyu Xu, Ruhao Tian, Shuai Li, Xueming Qian, Ming-Hsuan Yang

Abstract: Eliminating image blur produced by various kinds of motion has been a challenging problem. Dominant approaches rely heavily on model capacity to remove blurring by reconstructing residual from blurry observation in feature space. These practices not only prevent the capture of spatially variable motion in the real world but also ignore the tailored handling of various motions in image space. In th… ▽ More Eliminating image blur produced by various kinds of motion has been a challenging problem. Dominant approaches rely heavily on model capacity to remove blurring by reconstructing residual from blurry observation in feature space. These practices not only prevent the capture of spatially variable motion in the real world but also ignore the tailored handling of various motions in image space. In this paper, we propose a novel real-world deblurring filtering model called the Motion-adaptive Separable Collaborative (MISC) Filter. In particular, we use a motion estimation network to capture motion information from neighborhoods, thereby adaptively estimating spatially-variant motion flow, mask, kernels, weights, and offsets to obtain the MISC Filter. The MISC Filter first aligns the motion-induced blurring patterns to the motion middle along the predicted flow direction, and then collaboratively filters the aligned image through the predicted kernels, weights, and offsets to generate the output. This design can handle more generalized and complex motion in a spatially differentiated manner. Furthermore, we analyze the relationships between the motion estimation network and the residual reconstruction network. Extensive experiments on four widely used benchmarks demonstrate that our method provides an effective solution for real-world motion blur removal and achieves state-of-the-art performance. Code is available at https://github.com/ChengxuLiu/MISCFilter △ Less

Submitted 19 April, 2024; originally announced April 2024.

Comments: CVPR 2024

arXiv:2404.11836 [pdf, other]

AI-Empowered RIS-Assisted Networks: CV-Enabled RIS Selection and DNN-Enabled Transmission

Authors: Conggang Hu, Yang Lu, Hongyang Du, Mi Yang, Bo Ai, Dusit Niyato

Abstract: This paper investigates artificial intelligence (AI) empowered schemes for reconfigurable intelligent surface (RIS) assisted networks from the perspective of fast implementation. We formulate a weighted sum-rate maximization problem for a multi-RIS-assisted network. To avoid huge channel estimation overhead due to activate all RISs, we propose a computer vision (CV) enabled RIS selection scheme ba… ▽ More This paper investigates artificial intelligence (AI) empowered schemes for reconfigurable intelligent surface (RIS) assisted networks from the perspective of fast implementation. We formulate a weighted sum-rate maximization problem for a multi-RIS-assisted network. To avoid huge channel estimation overhead due to activate all RISs, we propose a computer vision (CV) enabled RIS selection scheme based on a single shot multi-box detector. To realize real-time resource allocation, a deep neural network (DNN) enabled transmit design is developed to learn the optimal mapping from channel information to transmit beamformers and phase shift matrix. Numerical results illustrate that the CV module is able to select of RIS with the best propagation condition. The well-trained DNN achieves similar sum-rate performance to the existing alternative optimization method but with much smaller inference time. △ Less

Submitted 17 April, 2024; originally announced April 2024.

arXiv:2404.11313 [pdf, other]

NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results

Authors: Xin Li, Kun Yuan, Yajing Pei, Yiting Lu, Ming Sun, Chao Zhou, Zhibo Chen, Radu Timofte, Wei Sun, Haoning Wu, Zicheng Zhang, Jun Jia, Zhichao Zhang, Linhan Cao, Qiubo Chen, Xiongkuo Min, Weisi Lin, Guangtao Zhai, Jianhui Sun, Tianyi Wang, Lei Li, Han Kong, Wenxuan Wang, Bing Li, Cheng Luo , et al. (43 additional authors not shown)

Abstract: This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The… ▽ More This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The purpose is to build new benchmarks and advance the development of S-UGC VQA. The competition had 200 participants and 13 teams submitted valid solutions for the final testing phase. The proposed solutions achieved state-of-the-art performances for S-UGC VQA. The project can be found at https://github.com/lixinustc/KVQChallenge-CVPR-NTIRE2024. △ Less

Submitted 17 April, 2024; originally announced April 2024.

Comments: Accepted by CVPR2024 Workshop. The challenge report for CVPR NTIRE2024 Short-form UGC Video Quality Assessment Challenge

arXiv:2404.06265 [pdf, other]

Spatial-Temporal Multi-level Association for Video Object Segmentation

Authors: Deshui Miao, Xin Li, Zhenyu He, Huchuan Lu, Ming-Hsuan Yang

Abstract: Existing semi-supervised video object segmentation methods either focus on temporal feature matching or spatial-temporal feature modeling. However, they do not address the issues of sufficient target interaction and efficient parallel processing simultaneously, thereby constraining the learning of dynamic, target-aware features. To tackle these limitations, this paper proposes a spatial-temporal m… ▽ More Existing semi-supervised video object segmentation methods either focus on temporal feature matching or spatial-temporal feature modeling. However, they do not address the issues of sufficient target interaction and efficient parallel processing simultaneously, thereby constraining the learning of dynamic, target-aware features. To tackle these limitations, this paper proposes a spatial-temporal multi-level association framework, which jointly associates reference frame, test frame, and object features to achieve sufficient interaction and parallel target ID association with a spatial-temporal memory bank for efficient video object segmentation. Specifically, we construct a spatial-temporal multi-level feature association module to learn better target-aware features, which formulates feature extraction and interaction as the efficient operations of object self-attention, reference object enhancement, and test reference correlation. In addition, we propose a spatial-temporal memory to assist feature association and temporal ID assignment and correlation. We evaluate the proposed method by conducting extensive experiments on numerous video object segmentation datasets, including DAVIS 2016/2017 val, DAVIS 2017 test-dev, and YouTube-VOS 2018/2019 val. The favorable performance against the state-of-the-art methods demonstrates the effectiveness of our approach. All source code and trained models will be made publicly available. △ Less

Submitted 9 April, 2024; originally announced April 2024.

arXiv:2403.16170 [pdf, other]

Voltage Regulation in Polymer Electrolyte Fuel Cell Systems Using Gaussian Process Model Predictive Control

Authors: Xiufei Li, Miao Zhang, Yuanxin Qi, Miao Yang

Abstract: This study introduces a novel approach utilizing Gaussian process model predictive control (MPC) to stabilize the output voltage of a polymer electrolyte fuel cell (PEFC) system by simultaneously regulating hydrogen and airflow rates. Two Gaussian process models are developed to capture PEFC dynamics, taking into account constraints including hydrogen pressure and input change rates, thereby aidin… ▽ More This study introduces a novel approach utilizing Gaussian process model predictive control (MPC) to stabilize the output voltage of a polymer electrolyte fuel cell (PEFC) system by simultaneously regulating hydrogen and airflow rates. Two Gaussian process models are developed to capture PEFC dynamics, taking into account constraints including hydrogen pressure and input change rates, thereby aiding in mitigating errors inherent to PEFC predictive control. The dynamic performance of the physical model and Gaussian process MPC in constraint handling and system inputs is compared and analyzed. Simulation outcomes demonstrate that the proposed Gaussian process MPC effectively maintains the voltage at the target 48 V while adhering to safety constraints, even amidst workload disturbances ranging from 110-120 A. In comparison to traditional MPC using detailed system models, Gaussian process MPC exhibits a 43\% higher overshoot and 25\% slower response time. Nonetheless, it offers the advantage of not requiring the underlying true system model and needing less system information. △ Less

Submitted 24 March, 2024; originally announced March 2024.

arXiv:2403.00605 [pdf, other]

Channel Measurements and Modeling for Dynamic Vehicular ISAC Scenarios at 28 GHz

Authors: Zhengyu Zhang, Ruisi He, Bo Ai, Mi Yang, Xuejian Zhang, Ziyi Qi, Yuan Yuan

Abstract: Integrated sensing and communication (ISAC) is a promising technology for 6G, with the goal of providing end-to-end information processing and inherent perception capabilities for future communication systems. Within ISAC emerging application scenarios, vehicular ISAC technologies have the potential to enhance traffic efficiency and safety through integration of communication and synchronized perc… ▽ More Integrated sensing and communication (ISAC) is a promising technology for 6G, with the goal of providing end-to-end information processing and inherent perception capabilities for future communication systems. Within ISAC emerging application scenarios, vehicular ISAC technologies have the potential to enhance traffic efficiency and safety through integration of communication and synchronized perception abilities. To establish a foundational theoretical support for vehicular ISAC system design and standardization, it is necessary to conduct channel measurements, and modeling to obtain a deep understanding of the radio propagation. In this paper, a dynamic statistical channel model is proposed for vehicular ISAC scenarios, incorporating Sensing Multipath Components (S-MPCs) and Clutter Multipath Components (C-MPCs), which are identified by the proposed tracking algorithm. Based on actual vehicular ISAC channel measurements at 28 GHz, time-varying sensing characteristics in front, left, and right directions are investigated. To model the dynamic evolution process of channel, number of new S-MPCs, lifetimes, initial power and delay positions, dynamic variations within their lifetimes, clustering, power decay, and fading of C-MPCs are statistically characterized. Finally, the paper provides implementation of dynamic vehicular ISAC model and validates it by comparing key simulation statistics between measurements and simulations. △ Less

Submitted 1 March, 2024; originally announced March 2024.

arXiv:2403.00569 [pdf, other]

Characterization of Wireless Channel Semantics: A New Paradigm

Authors: Zhengyu Zhang, Ruisi He, Mi Yang, Xuejian Zhang, Ziyi Qi, Yuan Yuan, Bo Ai

Abstract: Recently, deep learning enabled semantic communications have been developed to understand transmission content from semantic level, which realize effective and accurate information transfer. Aiming to the vision of sixth generation (6G) networks, wireless devices are expected to have native perception and intelligent capabilities, which associate wireless channel with surrounding environments from… ▽ More Recently, deep learning enabled semantic communications have been developed to understand transmission content from semantic level, which realize effective and accurate information transfer. Aiming to the vision of sixth generation (6G) networks, wireless devices are expected to have native perception and intelligent capabilities, which associate wireless channel with surrounding environments from physical propagation dimension to semantic information dimension. Inspired by these, we aim to provide a new paradigm on wireless channel from semantic level. A channel semantic model and its characterization framework are proposed in this paper. Specifically, a channel semantic model composes of status semantics, behavior semantics and event semantics. Based on actual channel measurement at 28 GHz, as well as multi-mode data, example results of channel semantic characterization are provided and analyzed, which exhibits reasonable and interpretable semantic information. △ Less

Submitted 1 March, 2024; originally announced March 2024.

arXiv:2403.00557 [pdf, other]

Non-stationarity Characteristics in Dynamic Vehicular ISAC Channels at 28 GHz

Authors: Zhengyu Zhang, Ruisi He, Mi Yang, Xuejian Zhang, Ziyi Qi, Hang Mi, Guiqi Sun, Jingya Yang, Bo Ai

Abstract: Integrated sensing and communications (ISAC) is a potential technology of 6G, aiming to enable end-to-end information processing ability and native perception capability for future communication systems. As an important part of the ISAC application scenarios, ISAC aided vehicle-to-everything (V2X) can improve the traffic efficiency and safety through intercommunication and synchronous perception.… ▽ More Integrated sensing and communications (ISAC) is a potential technology of 6G, aiming to enable end-to-end information processing ability and native perception capability for future communication systems. As an important part of the ISAC application scenarios, ISAC aided vehicle-to-everything (V2X) can improve the traffic efficiency and safety through intercommunication and synchronous perception. It is necessary to carry out measurement, characterization, and modeling for vehicular ISAC channels as the basic theoretical support for system design. In this paper, dynamic vehicular ISAC channel measurements at 28 GHz are carried out and provide data for the characterization of non-stationarity characteristics. Based on the actual measurements, this paper analyzes the time-varying PDPs, RMSDS and non-stationarity characteristics of front, lower front, left and right perception directions in a complicated V2X scenarios. The research in this paper can enrich the investigation of vehicular ISAC channels and enable the analysis and design of vehicular ISAC systems. △ Less

Submitted 1 March, 2024; originally announced March 2024.

arXiv:2403.00505 [pdf, other]

A Cluster-Based Statistical Channel Model for Integrated Sensing and Communication Channels

Authors: Zhengyu Zhang, Ruisi He, Bo Ai, Mi Yang, Yong Niu, Zhangdui Zhong, Yujian Li, Xuejian Zhang, Jing Li

Abstract: The emerging 6G network envisions integrated sensing and communication (ISAC) as a promising solution to meet growing demand for native perception ability. To optimize and evaluate ISAC systems and techniques, it is crucial to have an accurate and realistic wireless channel model. However, some important features of ISAC channels have not been well characterized, for example, most existing ISAC ch… ▽ More The emerging 6G network envisions integrated sensing and communication (ISAC) as a promising solution to meet growing demand for native perception ability. To optimize and evaluate ISAC systems and techniques, it is crucial to have an accurate and realistic wireless channel model. However, some important features of ISAC channels have not been well characterized, for example, most existing ISAC channel models consider communication channels and sensing channels independently, whereas ignoring correlation under the consistent environment. Moreover, sensing channels have not been well modeled in the existing standard-level channel models. Therefore, in order to better model ISAC channel, a cluster-based statistical channel model is proposed in this paper, which is based on measurements conducted at 28 GHz. In the proposed model, a new framework based on 3GPP standard is proposed, which includes communication clusters and sensing clusters. Clustering and tracking algorithms are used to extract and analyze ISAC channel characteristics. Furthermore, some special sensing cluster structures such as shared sensing cluster, newborn sensing cluster, etc., are defined to model correlation and difference between communication and sensing channels. Finally, accuracy of the proposed model is validated based on measurements and simulations. △ Less

Submitted 1 March, 2024; originally announced March 2024.

arXiv:2402.10427 [pdf, other]

Evaluating and Improving Continual Learning in Spoken Language Understanding

Authors: Muqiao Yang, Xiang Li, Umberto Cappellazzo, Shinji Watanabe, Bhiksha Raj

Abstract: Continual learning has emerged as an increasingly important challenge across various tasks, including Spoken Language Understanding (SLU). In SLU, its objective is to effectively handle the emergence of new concepts and evolving environments. The evaluation of continual learning algorithms typically involves assessing the model's stability, plasticity, and generalizability as fundamental aspects o… ▽ More Continual learning has emerged as an increasingly important challenge across various tasks, including Spoken Language Understanding (SLU). In SLU, its objective is to effectively handle the emergence of new concepts and evolving environments. The evaluation of continual learning algorithms typically involves assessing the model's stability, plasticity, and generalizability as fundamental aspects of standards. However, existing continual learning metrics primarily focus on only one or two of the properties. They neglect the overall performance across all tasks, and do not adequately disentangle the plasticity versus stability/generalizability trade-offs within the model. In this work, we propose an evaluation methodology that provides a unified evaluation on stability, plasticity, and generalizability in continual learning. By employing the proposed metric, we demonstrate how introducing various knowledge distillations can improve different aspects of these three properties of the SLU model. We further show that our proposed metric is more sensitive in capturing the impact of task ordering in continual learning, making it better suited for practical use-case scenarios. △ Less

Submitted 15 February, 2024; originally announced February 2024.

arXiv:2401.02046 [pdf, other]

CTC Blank Triggered Dynamic Layer-Skipping for Efficient CTC-based Speech Recognition

Authors: Junfeng Hou, Peiyao Wang, Jincheng Zhang, Meng Yang, Minwei Feng, Jingcheng Yin

Abstract: Deploying end-to-end speech recognition models with limited computing resources remains challenging, despite their impressive performance. Given the gradual increase in model size and the wide range of model applications, selectively executing model components for different inputs to improve the inference efficiency is of great interest. In this paper, we propose a dynamic layer-skipping method th… ▽ More Deploying end-to-end speech recognition models with limited computing resources remains challenging, despite their impressive performance. Given the gradual increase in model size and the wide range of model applications, selectively executing model components for different inputs to improve the inference efficiency is of great interest. In this paper, we propose a dynamic layer-skipping method that leverages the CTC blank output from intermediate layers to trigger the skipping of the last few encoder layers for frames with high blank probabilities. Furthermore, we factorize the CTC output distribution and perform knowledge distillation on intermediate layers to reduce computation and improve recognition accuracy. Experimental results show that by utilizing the CTC blank, the encoder layer depth can be adjusted dynamically, resulting in 29% acceleration of the CTC model inference with minor performance degradation. △ Less

Submitted 3 January, 2024; originally announced January 2024.

Comments: accepted by ASRU 2023

Showing 1–50 of 145 results for author: Yang, M