-
PLUS: Plug-and-Play Enhanced Liver Lesion Diagnosis Model on Non-Contrast CT Scans
Authors:
Jiacheng Hao,
Xiaoming Zhang,
Wei Liu,
Xiaoli Yin,
Yuan Gao,
Chunli Li,
Ling Zhang,
Le Lu,
Yu Shi,
Xu Han,
Ke Yan
Abstract:
Focal liver lesions (FLL) are common clinical findings during physical examination. Early diagnosis and intervention of liver malignancies are crucial to improving patient survival. Although the current 3D segmentation paradigm can accurately detect lesions, it faces limitations in distinguishing between malignant and benign liver lesions, primarily due to its inability to differentiate subtle var…
▽ More
Focal liver lesions (FLL) are common clinical findings during physical examination. Early diagnosis and intervention of liver malignancies are crucial to improving patient survival. Although the current 3D segmentation paradigm can accurately detect lesions, it faces limitations in distinguishing between malignant and benign liver lesions, primarily due to its inability to differentiate subtle variations between different lesions. Furthermore, existing methods predominantly rely on specialized imaging modalities such as multi-phase contrast-enhanced CT and magnetic resonance imaging, whereas non-contrast CT (NCCT) is more prevalent in routine abdominal imaging. To address these limitations, we propose PLUS, a plug-and-play framework that enhances FLL analysis on NCCT images for arbitrary 3D segmentation models. In extensive experiments involving 8,651 patients, PLUS demonstrated a significant improvement with existing methods, improving the lesion-level F1 score by 5.66%, the malignant patient-level F1 score by 6.26%, and the benign patient-level F1 score by 4.03%. Our results demonstrate the potential of PLUS to improve malignant FLL screening using widely available NCCT imaging substantially.
△ Less
Submitted 4 July, 2025;
originally announced July 2025.
-
FundaQ-8: A Clinically-Inspired Scoring Framework for Automated Fundus Image Quality Assessment
Authors:
Lee Qi Zun,
Oscar Wong Jin Hao,
Nor Anita Binti Che Omar,
Zalifa Zakiah Binti Asnir,
Mohamad Sabri bin Sinal Zainal,
Goh Man Fye
Abstract:
Automated fundus image quality assessment (FIQA) remains a challenge due to variations in image acquisition and subjective expert evaluations. We introduce FundaQ-8, a novel expert-validated framework for systematically assessing fundus image quality using eight critical parameters, including field coverage, anatomical visibility, illumination, and image artifacts. Using FundaQ-8 as a structured s…
▽ More
Automated fundus image quality assessment (FIQA) remains a challenge due to variations in image acquisition and subjective expert evaluations. We introduce FundaQ-8, a novel expert-validated framework for systematically assessing fundus image quality using eight critical parameters, including field coverage, anatomical visibility, illumination, and image artifacts. Using FundaQ-8 as a structured scoring reference, we develop a ResNet18-based regression model to predict continuous quality scores in the 0 to 1 range. The model is trained on 1800 fundus images from real-world clinical sources and Kaggle datasets, using transfer learning, mean squared error optimization, and standardized preprocessing. Validation against the EyeQ dataset and statistical analyses confirm the framework's reliability and clinical interpretability. Incorporating FundaQ-8 into deep learning models for diabetic retinopathy grading also improves diagnostic robustness, highlighting the value of quality-aware training in real-world screening applications.
△ Less
Submitted 25 June, 2025;
originally announced June 2025.
-
Short Wins Long: Short Codes with Language Model Semantic Correction Outperform Long Codes
Authors:
Jiafu Hao,
Chentao Yue,
Hao Chang,
Branka Vucetic,
Yonghui Li
Abstract:
This paper presents a novel semantic-enhanced decoding scheme for transmitting natural language sentences with multiple short block codes over noisy wireless channels. After ASCII source coding, the natural language sentence message is divided into segments, where each is encoded with short block channel codes independently before transmission. At the receiver, each short block of codewords is dec…
▽ More
This paper presents a novel semantic-enhanced decoding scheme for transmitting natural language sentences with multiple short block codes over noisy wireless channels. After ASCII source coding, the natural language sentence message is divided into segments, where each is encoded with short block channel codes independently before transmission. At the receiver, each short block of codewords is decoded in parallel, followed by a semantic error correction (SEC) model to reconstruct corrupted segments semantically. We design and train the SEC model based on Bidirectional and Auto-Regressive Transformers (BART). Simulations demonstrate that the proposed scheme can significantly outperform encoding the sentence with one conventional long LDPC code, in terms of block error rate (BLER), semantic metrics, and decoding latency. Finally, we proposed a semantic hybrid automatic repeat request (HARQ) scheme to further enhance the error performance, which selectively requests retransmission depends on semantic uncertainty.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Skull stripping with purely synthetic data
Authors:
Jong Sung Park,
Juhyung Ha,
Siddhesh Thakur,
Alexandra Badea,
Spyridon Bakas,
Eleftherios Garyfallidis
Abstract:
While many skull stripping algorithms have been developed for multi-modal and multi-species cases, there is still a lack of a fundamentally generalizable approach. We present PUMBA(PUrely synthetic Multimodal/species invariant Brain extrAction), a strategy to train a model for brain extraction with no real brain images or labels. Our results show that even without any real images or anatomical pri…
▽ More
While many skull stripping algorithms have been developed for multi-modal and multi-species cases, there is still a lack of a fundamentally generalizable approach. We present PUMBA(PUrely synthetic Multimodal/species invariant Brain extrAction), a strategy to train a model for brain extraction with no real brain images or labels. Our results show that even without any real images or anatomical priors, the model achieves comparable accuracy in multi-modal, multi-species and pathological cases. This work presents a new direction of research for any generalizable medical image segmentation task.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
ViMo: A Generative Visual GUI World Model for App Agents
Authors:
Dezhao Luo,
Bohan Tang,
Kang Li,
Georgios Papoudakis,
Jifei Song,
Shaogang Gong,
Jianye Hao,
Jun Wang,
Kun Shao
Abstract:
App agents, which autonomously operate mobile Apps through Graphical User Interfaces (GUIs), have gained significant interest in real-world applications. Yet, they often struggle with long-horizon planning, failing to find the optimal actions for complex tasks with longer steps. To address this, world models are used to predict the next GUI observation based on user actions, enabling more effectiv…
▽ More
App agents, which autonomously operate mobile Apps through Graphical User Interfaces (GUIs), have gained significant interest in real-world applications. Yet, they often struggle with long-horizon planning, failing to find the optimal actions for complex tasks with longer steps. To address this, world models are used to predict the next GUI observation based on user actions, enabling more effective agent planning. However, existing world models primarily focus on generating only textual descriptions, lacking essential visual details. To fill this gap, we propose ViMo, the first visual world model designed to generate future App observations as images. For the challenge of generating text in image patches, where even minor pixel errors can distort readability, we decompose GUI generation into graphic and text content generation. We propose a novel data representation, the Symbolic Text Representation~(STR) to overlay text content with symbolic placeholders while preserving graphics. With this design, ViMo employs a STR Predictor to predict future GUIs' graphics and a GUI-text Predictor for generating the corresponding text. Moreover, we deploy ViMo to enhance agent-focused tasks by predicting the outcome of different action options. Experiments show ViMo's ability to generate visually plausible and functionally effective GUIs that enable App agents to make more informed decisions.
△ Less
Submitted 20 May, 2025; v1 submitted 15 April, 2025;
originally announced April 2025.
-
HPGN: Hybrid Priors-Guided Network for Compressed Low-Light Image Enhancement
Authors:
Hantang Li,
Jinhua Hao,
Lei Xiong,
Shuyuan Zhu
Abstract:
In practical applications, conventional methods generate large volumes of low-light images that require compression for efficient storage and transmission. However, most existing methods either disregard the removal of potential compression artifacts during the enhancement process or fail to establish a unified framework for joint task enhancement of images with varying compression qualities. To s…
▽ More
In practical applications, conventional methods generate large volumes of low-light images that require compression for efficient storage and transmission. However, most existing methods either disregard the removal of potential compression artifacts during the enhancement process or fail to establish a unified framework for joint task enhancement of images with varying compression qualities. To solve this problem, we propose the hybrid priors-guided network (HPGN), which enhances compressed low-light images by integrating both compression and illumination priors. Our approach fully utilizes the JPEG quality factor (QF) and DCT quantization matrix (QM) to guide the design of efficient joint task plug-and-play modules. Additionally, we employ a random QF generation strategy to guide model training, enabling a single model to enhance images across different compression levels. Experimental results confirm the superiority of our proposed method.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
Boosting the Transferability of Audio Adversarial Examples with Acoustic Representation Optimization
Authors:
Weifei Jin,
Junjie Su,
Hejia Wang,
Yulin Ye,
Jie Hao
Abstract:
With the widespread application of automatic speech recognition (ASR) systems, their vulnerability to adversarial attacks has been extensively studied. However, most existing adversarial examples are generated on specific individual models, resulting in a lack of transferability. In real-world scenarios, attackers often cannot access detailed information about the target model, making query-based…
▽ More
With the widespread application of automatic speech recognition (ASR) systems, their vulnerability to adversarial attacks has been extensively studied. However, most existing adversarial examples are generated on specific individual models, resulting in a lack of transferability. In real-world scenarios, attackers often cannot access detailed information about the target model, making query-based attacks unfeasible. To address this challenge, we propose a technique called Acoustic Representation Optimization that aligns adversarial perturbations with low-level acoustic characteristics derived from speech representation models. Rather than relying on model-specific, higher-layer abstractions, our approach leverages fundamental acoustic representations that remain consistent across diverse ASR architectures. By enforcing an acoustic representation loss to guide perturbations toward these robust, lower-level representations, we enhance the cross-model transferability of adversarial examples without degrading audio quality. Our method is plug-and-play and can be integrated with any existing attack methods. We evaluate our approach on three modern ASR models, and the experimental results demonstrate that our method significantly improves the transferability of adversarial examples generated by previous methods while preserving the audio quality.
△ Less
Submitted 25 March, 2025;
originally announced March 2025.
-
Dual-domain Multi-path Self-supervised Diffusion Model for Accelerated MRI Reconstruction
Authors:
Yuxuan Zhang,
Jinkui Hao,
Bo Zhou
Abstract:
Magnetic resonance imaging (MRI) is a vital diagnostic tool, but its inherently long acquisition times reduce clinical efficiency and patient comfort. Recent advancements in deep learning, particularly diffusion models, have improved accelerated MRI reconstruction. However, existing diffusion models' training often relies on fully sampled data, models incur high computational costs, and often lack…
▽ More
Magnetic resonance imaging (MRI) is a vital diagnostic tool, but its inherently long acquisition times reduce clinical efficiency and patient comfort. Recent advancements in deep learning, particularly diffusion models, have improved accelerated MRI reconstruction. However, existing diffusion models' training often relies on fully sampled data, models incur high computational costs, and often lack uncertainty estimation, limiting their clinical applicability. To overcome these challenges, we propose a novel framework, called Dual-domain Multi-path Self-supervised Diffusion Model (DMSM), that integrates a self-supervised dual-domain diffusion model training scheme, a lightweight hybrid attention network for the reconstruction diffusion model, and a multi-path inference strategy, to enhance reconstruction accuracy, efficiency, and explainability. Unlike traditional diffusion-based models, DMSM eliminates the dependency on training from fully sampled data, making it more practical for real-world clinical settings. We evaluated DMSM on two human MRI datasets, demonstrating that it achieves favorable performance over several supervised and self-supervised baselines, particularly in preserving fine anatomical structures and suppressing artifacts under high acceleration factors. Additionally, our model generates uncertainty maps that correlate reasonably well with reconstruction errors, offering valuable clinically interpretable guidance and potentially enhancing diagnostic confidence.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
Towards Patient-Specific Surgical Planning for Bicuspid Aortic Valve Repair: Fully Automated Segmentation of the Aortic Valve in 4D CT
Authors:
Zaiyang Guo,
Ningjun J Dong,
Harold Litt,
Natalie Yushkevich,
Melanie Freas,
Jessica Nunez,
Victor Ferrari,
Jilei Hao,
Shir Goldfinger,
Matthew A. Jolley,
Joseph Bavaria,
Nimesh Desai,
Alison M. Pouch
Abstract:
The bicuspid aortic valve (BAV) is the most prevalent congenital heart defect and may require surgery for complications such as stenosis, regurgitation, and aortopathy. BAV repair surgery is effective but challenging due to the heterogeneity of BAV morphology. Multiple imaging modalities can be employed to assist the quantitative assessment of BAVs for surgical planning. Contrast-enhanced 4D compu…
▽ More
The bicuspid aortic valve (BAV) is the most prevalent congenital heart defect and may require surgery for complications such as stenosis, regurgitation, and aortopathy. BAV repair surgery is effective but challenging due to the heterogeneity of BAV morphology. Multiple imaging modalities can be employed to assist the quantitative assessment of BAVs for surgical planning. Contrast-enhanced 4D computed tomography (CT) produces volumetric temporal sequences with excellent contrast and spatial resolution. Segmentation of the aortic cusps and root in these images is an essential step in creating patient specific models for visualization and quantification. While deep learning-based methods are capable of fully automated segmentation, no BAV-specific model exists. Among valve segmentation studies, there has been limited quantitative assessment of the clinical usability of the segmentation results. In this work, we developed a fully automated multi-label BAV segmentation pipeline based on nnU-Net. The predicted segmentations were used to carry out surgically relevant morphological measurements including geometric cusp height, commissural angle and annulus diameter, and the results were compared against manual segmentation. Automated segmentation achieved average Dice scores of over 0.7 and symmetric mean distance below 0.7 mm for all three aortic cusps and the root wall. Clinically relevant benchmarks showed good consistency between manual and predicted segmentations. Overall, fully automated BAV segmentation of 3D frames in 4D CT can produce clinically usable measurements for surgical risk stratification, but the temporal consistency of segmentations needs to be improved.
△ Less
Submitted 13 February, 2025;
originally announced February 2025.
-
Spatiotemporal Trajectory Tracking Method for Vehicles Incorporating Lead-Lag Judgement
Authors:
Yuan Li,
Xiang Dong,
Tao Li,
Junfeng Hao,
Xiaoxue Xu,
Sana Ullaha,
Yincai Cai,
Peng Wu,
Ting Peng
Abstract:
In the domain of intelligent transportation systems, especially within the context of autonomous vehicle control, the preemptive holistic collaborative system has been presented as a promising solution to bring a remarkable enhancement in traffic efficiency and a substantial reduction in the accident rate, demonstrating a great potential of development. In order to ensure this system operates as i…
▽ More
In the domain of intelligent transportation systems, especially within the context of autonomous vehicle control, the preemptive holistic collaborative system has been presented as a promising solution to bring a remarkable enhancement in traffic efficiency and a substantial reduction in the accident rate, demonstrating a great potential of development. In order to ensure this system operates as intended, accurate tracking of the spatiotemporal trajectory is of crucial significance. Moreover, minimizing the tracking error is a necessary step in this process. To this end, a novel lead-lag judgment mechanism is proposed. This mechanism precisely quantifies the longitudinal positional deviation between the vehicle and the target trajectory over time, then the deviation is corrected with a real - time acceleration compensation strategy, as a result, the accuracy and reliability of trajectory tracking are significantly enhanced. Real - vehicle experiments were conducted in a dedicated test field to validate the feasibility of this innovative approach empirically. Subsequently, the obtained tracking data was subsequent processed using the lead-lag judgment mechanism. In this step, we carefully analyzed the spatiotemporal error patterns between the vehicle and the target trajectory under different alignments and speeds. Finally, using real highway speed and alignment data, we conducted comprehensive spatiotemporal trajectory tracking simulations. Through experiments and simulations, tracking errors maintained in an acceptable range and reasonable spatiotemporal distance is given during the preemptive merging process on highway ramps. Overall, this study offers valuable insights for highway ramp emerging safety. Future work can expand on these findings.
△ Less
Submitted 6 February, 2025;
originally announced February 2025.
-
FECT: Classification of Breast Cancer Pathological Images Based on Fusion Features
Authors:
Jiacheng Hao,
Yiqing Liu,
Siqi Zeng,
Yonghong He
Abstract:
Breast cancer is one of the most common cancers among women globally, with early diagnosis and precise classification being crucial. With the advancement of deep learning and computer vision, the automatic classification of breast tissue pathological images has emerged as a research focus. Existing methods typically rely on singular cell or tissue features and lack design considerations for morpho…
▽ More
Breast cancer is one of the most common cancers among women globally, with early diagnosis and precise classification being crucial. With the advancement of deep learning and computer vision, the automatic classification of breast tissue pathological images has emerged as a research focus. Existing methods typically rely on singular cell or tissue features and lack design considerations for morphological characteristics of challenging-to-classify categories, resulting in suboptimal classification performance. To address these problems, we proposes a novel breast cancer tissue classification model that Fused features of Edges, Cells, and Tissues (FECT), employing the ResMTUNet and an attention-based aggregator to extract and aggregate these features. Extensive testing on the BRACS dataset demonstrates that our model surpasses current advanced methods in terms of classification accuracy and F1 scores. Moreover, due to its feature fusion that aligns with the diagnostic approach of pathologists, our model exhibits interpretability and holds promise for significant roles in future clinical applications.
△ Less
Submitted 17 January, 2025;
originally announced January 2025.
-
Processing and Analyzing Real-World Driving Data: Insights on Trips, Scenarios, and Human Driving Behaviors
Authors:
Jihun Han,
Dominik Karbowski,
Ayman Moawad,
Namdoo Kim,
Aymeric Rousseau,
Shihong Fan,
Jason Hoon Lee,
Jinho Ha
Abstract:
Analyzing large volumes of real-world driving data is essential for providing meaningful and reliable insights into real-world trips, scenarios, and human driving behaviors. To this end, we developed a multi-level data processing approach that adds new information, segments data, and extracts desired parameters. Leveraging a confidential but extensive dataset (over 1 million km), this approach lea…
▽ More
Analyzing large volumes of real-world driving data is essential for providing meaningful and reliable insights into real-world trips, scenarios, and human driving behaviors. To this end, we developed a multi-level data processing approach that adds new information, segments data, and extracts desired parameters. Leveraging a confidential but extensive dataset (over 1 million km), this approach leads to three levels of in-depth analysis: trip, scenario, and driving. The trip-level analysis explains representative properties observed in real-world trips, while the scenario-level analysis focuses on scenario conditions resulting from road events that reduce vehicle speed. The driving-level analysis identifies the cause of driving regimes for specific situations and characterizes typical human driving behaviors. Such analyses can support the design of both trip- and scenario-based tests, the modeling of human drivers, and the establishment of guidelines for connected and automated vehicles.
△ Less
Submitted 15 January, 2025;
originally announced January 2025.
-
Plug-and-Play Tri-Branch Invertible Block for Image Rescaling
Authors:
Jingwei Bao,
Jinhua Hao,
Pengcheng Xu,
Ming Sun,
Chao Zhou,
Shuyuan Zhu
Abstract:
High-resolution (HR) images are commonly downscaled to low-resolution (LR) to reduce bandwidth, followed by upscaling to restore their original details. Recent advancements in image rescaling algorithms have employed invertible neural networks (INNs) to create a unified framework for downscaling and upscaling, ensuring a one-to-one mapping between LR and HR images. Traditional methods, utilizing d…
▽ More
High-resolution (HR) images are commonly downscaled to low-resolution (LR) to reduce bandwidth, followed by upscaling to restore their original details. Recent advancements in image rescaling algorithms have employed invertible neural networks (INNs) to create a unified framework for downscaling and upscaling, ensuring a one-to-one mapping between LR and HR images. Traditional methods, utilizing dual-branch based vanilla invertible blocks, process high-frequency and low-frequency information separately, often relying on specific distributions to model high-frequency components. However, processing the low-frequency component directly in the RGB domain introduces channel redundancy, limiting the efficiency of image reconstruction. To address these challenges, we propose a plug-and-play tri-branch invertible block (T-InvBlocks) that decomposes the low-frequency branch into luminance (Y) and chrominance (CbCr) components, reducing redundancy and enhancing feature processing. Additionally, we adopt an all-zero mapping strategy for high-frequency components during upscaling, focusing essential rescaling information within the LR image. Our T-InvBlocks can be seamlessly integrated into existing rescaling models, improving performance in both general rescaling tasks and scenarios involving lossy compression. Extensive experiments confirm that our method advances the state of the art in HR image reconstruction.
△ Less
Submitted 18 December, 2024;
originally announced December 2024.
-
Multi-resolution Guided 3D GANs for Medical Image Translation
Authors:
Juhyung Ha,
Jong Sung Park,
David Crandall,
Eleftherios Garyfallidis,
Xuhong Zhang
Abstract:
Medical image translation is the process of converting from one imaging modality to another, in order to reduce the need for multiple image acquisitions from the same patient. This can enhance the efficiency of treatment by reducing the time, equipment, and labor needed. In this paper, we introduce a multi-resolution guided Generative Adversarial Network (GAN)-based framework for 3D medical image…
▽ More
Medical image translation is the process of converting from one imaging modality to another, in order to reduce the need for multiple image acquisitions from the same patient. This can enhance the efficiency of treatment by reducing the time, equipment, and labor needed. In this paper, we introduce a multi-resolution guided Generative Adversarial Network (GAN)-based framework for 3D medical image translation. Our framework uses a 3D multi-resolution Dense-Attention UNet (3D-mDAUNet) as the generator and a 3D multi-resolution UNet as the discriminator, optimized with a unique combination of loss functions including voxel-wise GAN loss and 2.5D perception loss. Our approach yields promising results in volumetric image quality assessment (IQA) across a variety of imaging modalities, body regions, and age groups, demonstrating its robustness. Furthermore, we propose a synthetic-to-real applicability assessment as an additional evaluation to assess the effectiveness of synthetic data in downstream applications such as segmentation. This comprehensive evaluation shows that our method produces synthetic medical images not only of high-quality but also potentially useful in clinical applications. Our code is available at github.com/juhha/3D-mADUNet.
△ Less
Submitted 30 November, 2024;
originally announced December 2024.
-
Mitigating Unauthorized Speech Synthesis for Voice Protection
Authors:
Zhisheng Zhang,
Qianyi Yang,
Derui Wang,
Pengyang Huang,
Yuxin Cao,
Kai Ye,
Jie Hao
Abstract:
With just a few speech samples, it is possible to perfectly replicate a speaker's voice in recent years, while malicious voice exploitation (e.g., telecom fraud for illegal financial gain) has brought huge hazards in our daily lives. Therefore, it is crucial to protect publicly accessible speech data that contains sensitive information, such as personal voiceprints. Most previous defense methods h…
▽ More
With just a few speech samples, it is possible to perfectly replicate a speaker's voice in recent years, while malicious voice exploitation (e.g., telecom fraud for illegal financial gain) has brought huge hazards in our daily lives. Therefore, it is crucial to protect publicly accessible speech data that contains sensitive information, such as personal voiceprints. Most previous defense methods have focused on spoofing speaker verification systems in timbre similarity but the synthesized deepfake speech is still of high quality. In response to the rising hazards, we devise an effective, transferable, and robust proactive protection technology named Pivotal Objective Perturbation (POP) that applies imperceptible error-minimizing noises on original speech samples to prevent them from being effectively learned for text-to-speech (TTS) synthesis models so that high-quality deepfake speeches cannot be generated. We conduct extensive experiments on state-of-the-art (SOTA) TTS models utilizing objective and subjective metrics to comprehensively evaluate our proposed method. The experimental results demonstrate outstanding effectiveness and transferability across various models. Compared to the speech unclarity score of 21.94% from voice synthesizers trained on samples without protection, POP-protected samples significantly increase it to 127.31%. Moreover, our method shows robustness against noise reduction and data augmentation techniques, thereby greatly reducing potential hazards.
△ Less
Submitted 28 October, 2024;
originally announced October 2024.
-
DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents
Authors:
Taiyi Wang,
Zhihao Wu,
Jianheng Liu,
Jianye Hao,
Jun Wang,
Kun Shao
Abstract:
On-device control agents, especially on mobile devices, are responsible for operating mobile devices to fulfill users' requests, enabling seamless and intuitive interactions. Integrating Multimodal Large Language Models (MLLMs) into these agents enhances their ability to understand and execute complex commands, thereby improving user experience. However, fine-tuning MLLMs for on-device control pre…
▽ More
On-device control agents, especially on mobile devices, are responsible for operating mobile devices to fulfill users' requests, enabling seamless and intuitive interactions. Integrating Multimodal Large Language Models (MLLMs) into these agents enhances their ability to understand and execute complex commands, thereby improving user experience. However, fine-tuning MLLMs for on-device control presents significant challenges due to limited data availability and inefficient online training processes. This paper introduces DistRL, a novel framework designed to enhance the efficiency of online RL fine-tuning for mobile device control agents. DistRL employs centralized training and decentralized data acquisition to ensure efficient fine-tuning in the context of dynamic online interactions. Additionally, the framework is backed by our tailor-made RL algorithm, which effectively balances exploration with the prioritized utilization of collected data to ensure stable and robust training. Our experiments show that, on average, DistRL delivers a 3X improvement in training efficiency and enables training data collection 2.4X faster than the leading synchronous multi-machine methods. Notably, after training, DistRL achieves a 20% relative improvement in success rate compared to state-of-the-art methods on general Android tasks from an open benchmark, significantly outperforming existing approaches while maintaining the same training time. These results validate DistRL as a scalable and efficient solution, offering substantial improvements in both training efficiency and agent performance for real-world, in-the-wild device control tasks.
△ Less
Submitted 21 February, 2025; v1 submitted 18 October, 2024;
originally announced October 2024.
-
OAPT: Offset-Aware Partition Transformer for Double JPEG Artifacts Removal
Authors:
Qiao Mo,
Yukang Ding,
Jinhua Hao,
Qiang Zhu,
Ming Sun,
Chao Zhou,
Feiyu Chen,
Shuyuan Zhu
Abstract:
Deep learning-based methods have shown remarkable performance in single JPEG artifacts removal task. However, existing methods tend to degrade on double JPEG images, which are prevalent in real-world scenarios. To address this issue, we propose Offset-Aware Partition Transformer for double JPEG artifacts removal, termed as OAPT. We conduct an analysis of double JPEG compression that results in up…
▽ More
Deep learning-based methods have shown remarkable performance in single JPEG artifacts removal task. However, existing methods tend to degrade on double JPEG images, which are prevalent in real-world scenarios. To address this issue, we propose Offset-Aware Partition Transformer for double JPEG artifacts removal, termed as OAPT. We conduct an analysis of double JPEG compression that results in up to four patterns within each 8x8 block and design our model to cluster the similar patterns to remedy the difficulty of restoration. Our OAPT consists of two components: compression offset predictor and image reconstructor. Specifically, the predictor estimates pixel offsets between the first and second compression, which are then utilized to divide different patterns. The reconstructor is mainly based on several Hybrid Partition Attention Blocks (HPAB), combining vanilla window-based self-attention and sparse attention for clustered pattern features. Extensive experiments demonstrate that OAPT outperforms the state-of-the-art method by more than 0.16dB in double JPEG image restoration task. Moreover, without increasing any computation cost, the pattern clustering module in HPAB can serve as a plugin to enhance other transformer-based image restoration methods. The code will be available at https://github.com/QMoQ/OAPT.git .
△ Less
Submitted 24 September, 2024; v1 submitted 21 August, 2024;
originally announced August 2024.
-
Beyond the Eye: A Relational Model for Early Dementia Detection Using Retinal OCTA Images
Authors:
Shouyue Liu,
Ziyi Zhang,
Yuanyuan Gu,
Jinkui Hao,
Yonghuai Liu,
Huazhu Fu,
Xinyu Guo,
Hong Song,
Shuting Zhang,
Yitian Zhao
Abstract:
Early detection of dementia, such as Alzheimer's disease (AD) or mild cognitive impairment (MCI), is essential to enable timely intervention and potential treatment. Accurate detection of AD/MCI is challenging due to the high complexity, cost, and often invasive nature of current diagnostic techniques, which limit their suitability for large-scale population screening. Given the shared embryologic…
▽ More
Early detection of dementia, such as Alzheimer's disease (AD) or mild cognitive impairment (MCI), is essential to enable timely intervention and potential treatment. Accurate detection of AD/MCI is challenging due to the high complexity, cost, and often invasive nature of current diagnostic techniques, which limit their suitability for large-scale population screening. Given the shared embryological origins and physiological characteristics of the retina and brain, retinal imaging is emerging as a potentially rapid and cost-effective alternative for the identification of individuals with or at high risk of AD. In this paper, we present a novel PolarNet+ that uses retinal optical coherence tomography angiography (OCTA) to discriminate early-onset AD (EOAD) and MCI subjects from controls. Our method first maps OCTA images from Cartesian coordinates to polar coordinates, allowing approximate sub-region calculation to implement the clinician-friendly early treatment of diabetic retinopathy study (ETDRS) grid analysis. We then introduce a multi-view module to serialize and analyze the images along three dimensions for comprehensive, clinically useful information extraction. Finally, we abstract the sequence embedding into a graph, transforming the detection task into a general graph classification problem. A regional relationship module is applied after the multi-view module to excavate the relationship between the sub-regions. Such regional relationship analyses validate known eye-brain links and reveal new discriminative patterns.
△ Less
Submitted 12 March, 2025; v1 submitted 9 August, 2024;
originally announced August 2024.
-
Multiscale Spatio-Temporal Enhanced Short-term Load Forecasting of Electric Vehicle Charging Stations
Authors:
Zongbao Zhang,
Jiao Hao,
Wenmeng Zhao,
Yan Liu,
Yaohui Huang,
Xinhang Luo
Abstract:
The rapid expansion of electric vehicles (EVs) has rendered the load forecasting of electric vehicle charging stations (EVCS) increasingly critical. The primary challenge in achieving precise load forecasting for EVCS lies in accounting for the nonlinear of charging behaviors, the spatial interactions among different stations, and the intricate temporal variations in usage patterns. To address the…
▽ More
The rapid expansion of electric vehicles (EVs) has rendered the load forecasting of electric vehicle charging stations (EVCS) increasingly critical. The primary challenge in achieving precise load forecasting for EVCS lies in accounting for the nonlinear of charging behaviors, the spatial interactions among different stations, and the intricate temporal variations in usage patterns. To address these challenges, we propose a Multiscale Spatio-Temporal Enhanced Model (MSTEM) for effective load forecasting at EVCS. MSTEM incorporates a multiscale graph neural network to discern hierarchical nonlinear temporal dependencies across various time scales. Besides, it also integrates a recurrent learning component and a residual fusion mechanism, enhancing its capability to accurately capture spatial and temporal variations in charging patterns. The effectiveness of the proposed MSTEM has been validated through comparative analysis with six baseline models using three evaluation metrics. The case studies utilize real-world datasets for both fast and slow charging loads at EVCS in Perth, UK. The experimental results demonstrate the superiority of MSTEM in short-term continuous load forecasting for EVCS.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer
Authors:
Weifei Jin,
Yuxin Cao,
Junjie Su,
Qi Shen,
Kai Ye,
Derui Wang,
Jie Hao,
Ziyao Liu
Abstract:
In light of the widespread application of Automatic Speech Recognition (ASR) systems, their security concerns have received much more attention than ever before, primarily due to the susceptibility of Deep Neural Networks. Previous studies have illustrated that surreptitiously crafting adversarial perturbations enables the manipulation of speech recognition systems, resulting in the production of…
▽ More
In light of the widespread application of Automatic Speech Recognition (ASR) systems, their security concerns have received much more attention than ever before, primarily due to the susceptibility of Deep Neural Networks. Previous studies have illustrated that surreptitiously crafting adversarial perturbations enables the manipulation of speech recognition systems, resulting in the production of malicious commands. These attack methods mostly require adding noise perturbations under $\ell_p$ norm constraints, inevitably leaving behind artifacts of manual modifications. Recent research has alleviated this limitation by manipulating style vectors to synthesize adversarial examples based on Text-to-Speech (TTS) synthesis audio. However, style modifications based on optimization objectives significantly reduce the controllability and editability of audio styles. In this paper, we propose an attack on ASR systems based on user-customized style transfer. We first test the effect of Style Transfer Attack (STA) which combines style transfer and adversarial attack in sequential order. And then, as an improvement, we propose an iterative Style Code Attack (SCA) to maintain audio quality. Experimental results show that our method can meet the need for user-customized styles and achieve a success rate of 82% in attacks, while keeping sound naturalness due to our user study.
△ Less
Submitted 15 May, 2024;
originally announced May 2024.
-
Resilient by Design: Simulating Street Network Disruptions across Every Urban Area in the World
Authors:
Geoff Boeing,
Jaehyun Ha
Abstract:
Street networks allow people and goods to move through cities, but they are vulnerable to disasters like floods, earthquakes, and terrorist attacks. Well-planned network design can make a city more resilient and robust to such disruptions, but we still know little about worldwide patterns of vulnerability, or worldwide empirical relationships between specific design characteristics and resilience.…
▽ More
Street networks allow people and goods to move through cities, but they are vulnerable to disasters like floods, earthquakes, and terrorist attacks. Well-planned network design can make a city more resilient and robust to such disruptions, but we still know little about worldwide patterns of vulnerability, or worldwide empirical relationships between specific design characteristics and resilience. This study quantifies and measures the vulnerability of the street networks of every urban area in the world then models the relationships between vulnerability and street network design characteristics. To do so, we simulate over 2.4 billion trips across more than 8,000 urban areas in 178 countries, while also simulating network disruption events representing floods, earthquakes, and targeted attacks. We find that disrupting high-centrality nodes severely impacts network function. All else equal, networks with higher connectivity, fewer chokepoints, or less circuity are less vulnerable to disruption's impacts. This study thus contributes a new global understanding of network design and vulnerability to the literature. We argue that these design characteristics offer high leverage points for street network resilience and robustness that planners should emphasize when designing or retrofitting urban networks.
△ Less
Submitted 15 March, 2024;
originally announced March 2024.
-
CPGA: Coding Priors-Guided Aggregation Network for Compressed Video Quality Enhancement
Authors:
Qiang Zhu,
Jinhua Hao,
Yukang Ding,
Yu Liu,
Qiao Mo,
Ming Sun,
Chao Zhou,
Shuyuan Zhu
Abstract:
Recently, numerous approaches have achieved notable success in compressed video quality enhancement (VQE). However, these methods usually ignore the utilization of valuable coding priors inherently embedded in compressed videos, such as motion vectors and residual frames, which carry abundant temporal and spatial information. To remedy this problem, we propose the Coding Priors-Guided Aggregation…
▽ More
Recently, numerous approaches have achieved notable success in compressed video quality enhancement (VQE). However, these methods usually ignore the utilization of valuable coding priors inherently embedded in compressed videos, such as motion vectors and residual frames, which carry abundant temporal and spatial information. To remedy this problem, we propose the Coding Priors-Guided Aggregation (CPGA) network to utilize temporal and spatial information from coding priors. The CPGA mainly consists of an inter-frame temporal aggregation (ITA) module and a multi-scale non-local aggregation (MNA) module. Specifically, the ITA module aggregates temporal information from consecutive frames and coding priors, while the MNA module globally captures spatial information guided by residual frames. In addition, to facilitate research in VQE task, we newly construct the Video Coding Priors (VCP) dataset, comprising 300 videos with various coding priors extracted from corresponding bitstreams. It remedies the shortage of previous datasets on the lack of coding information. Experimental results demonstrate the superiority of our method compared to existing state-of-the-art methods. The code and dataset will be released at https://github.com/VQE-CPGA/CPGA.git .
△ Less
Submitted 19 November, 2024; v1 submitted 15 March, 2024;
originally announced March 2024.
-
Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation
Authors:
Heeseung Kim,
Soonshin Seo,
Kyeongseok Jeong,
Ohsung Kwon,
Soyoon Kim,
Jungwhan Kim,
Jaehong Lee,
Eunwoo Song,
Myungwoo Oh,
Jung-Woo Ha,
Sungroh Yoon,
Kang Min Yoo
Abstract:
Recent work shows promising results in expanding the capabilities of large language models (LLM) to directly understand and synthesize speech. However, an LLM-based strategy for modeling spoken dialogs remains elusive, calling for further investigation. This paper introduces an extensive speech-text LLM framework, the Unified Spoken Dialog Model (USDM), designed to generate coherent spoken respons…
▽ More
Recent work shows promising results in expanding the capabilities of large language models (LLM) to directly understand and synthesize speech. However, an LLM-based strategy for modeling spoken dialogs remains elusive, calling for further investigation. This paper introduces an extensive speech-text LLM framework, the Unified Spoken Dialog Model (USDM), designed to generate coherent spoken responses with naturally occurring prosodic features relevant to the given input speech without relying on explicit automatic speech recognition (ASR) or text-to-speech (TTS) systems. We have verified the inclusion of prosody in speech tokens that predominantly contain semantic information and have used this foundation to construct a prosody-infused speech-text model. Additionally, we propose a generalized speech-text pretraining scheme that enhances the capture of cross-modal semantics. To construct USDM, we fine-tune our speech-text model on spoken dialog data using a multi-step spoken dialog template that stimulates the chain-of-reasoning capabilities exhibited by the underlying LLM. Automatic and human evaluations on the DailyTalk dataset demonstrate that our approach effectively generates natural-sounding spoken responses, surpassing previous and cascaded baselines. Our code and checkpoints are available at https://github.com/naver-ai/usdm.
△ Less
Submitted 27 November, 2024; v1 submitted 8 February, 2024;
originally announced February 2024.
-
3D Volumetric Super-Resolution in Radiology Using 3D RRDB-GAN
Authors:
Juhyung Ha,
Nian Wang,
Surendra Maharjan,
Xuhong Zhang
Abstract:
This study introduces the 3D Residual-in-Residual Dense Block GAN (3D RRDB-GAN) for 3D super-resolution for radiology imagery. A key aspect of 3D RRDB-GAN is the integration of a 2.5D perceptual loss function, which contributes to improved volumetric image quality and realism. The effectiveness of our model was evaluated through 4x super-resolution experiments across diverse datasets, including Mi…
▽ More
This study introduces the 3D Residual-in-Residual Dense Block GAN (3D RRDB-GAN) for 3D super-resolution for radiology imagery. A key aspect of 3D RRDB-GAN is the integration of a 2.5D perceptual loss function, which contributes to improved volumetric image quality and realism. The effectiveness of our model was evaluated through 4x super-resolution experiments across diverse datasets, including Mice Brain MRH, OASIS, HCP1200, and MSD-Task-6. These evaluations, encompassing both quantitative metrics like LPIPS and FID and qualitative assessments through sample visualizations, demonstrate the models effectiveness in detailed image analysis. The 3D RRDB-GAN offers a significant contribution to medical imaging, particularly by enriching the depth, clarity, and volumetric detail of medical images. Its application shows promise in enhancing the interpretation and analysis of complex medical imagery from a comprehensive 3D perspective.
△ Less
Submitted 6 February, 2024;
originally announced February 2024.
-
DMCE: Diffusion Model Channel Enhancer for Multi-User Semantic Communication Systems
Authors:
Youcheng Zeng,
Xinxin He,
Xu Chen,
Haonan Tong,
Zhaohui Yang,
Yijun Guo,
Jianjun Hao
Abstract:
To achieve continuous massive data transmission with significantly reduced data payload, the users can adopt semantic communication techniques to compress the redundant information by transmitting semantic features instead. However, current works on semantic communication mainly focus on high compression ratio, neglecting the wireless channel effects including dynamic distortion and multi-user int…
▽ More
To achieve continuous massive data transmission with significantly reduced data payload, the users can adopt semantic communication techniques to compress the redundant information by transmitting semantic features instead. However, current works on semantic communication mainly focus on high compression ratio, neglecting the wireless channel effects including dynamic distortion and multi-user interference, which significantly limit the fidelity of semantic communication. To address this, this paper proposes a diffusion model (DM)-based channel enhancer (DMCE) for improving the performance of multi-user semantic communication, with the DM learning the particular data distribution of channel effects on the transmitted semantic features. In the considered system model, multiple users (such as road cameras) transmit semantic features of multi-source data to a receiver by applying the joint source-channel coding (JSCC) techniques, and the receiver fuses the semantic features from multiple users to complete specific tasks. Then, we propose DMCE to enhance the channel state information (CSI) estimation for improving the restoration of the received semantic features. Finally, the fusion results at the receiver are significantly enhanced, demonstrating a robust performance even under low signal-to-noise ratio (SNR) regimes, enabling the generation of effective object segmentation images. Extensive simulation results with a traffic scenario dataset show that the proposed scheme can improve the mean Intersection over Union (mIoU) by more than 25\% at low SNR regimes, compared with the benchmark schemes.
△ Less
Submitted 29 January, 2024;
originally announced January 2024.
-
Polar-Net: A Clinical-Friendly Model for Alzheimer's Disease Detection in OCTA Images
Authors:
Shouyue Liu,
Jinkui Hao,
Yanwu Xu,
Huazhu Fu,
Xinyu Guo,
Jiang Liu,
Yalin Zheng,
Yonghuai Liu,
Jiong Zhang,
Yitian Zhao
Abstract:
Optical Coherence Tomography Angiography (OCTA) is a promising tool for detecting Alzheimer's disease (AD) by imaging the retinal microvasculature. Ophthalmologists commonly use region-based analysis, such as the ETDRS grid, to study OCTA image biomarkers and understand the correlation with AD. However, existing studies have used general deep computer vision methods, which present challenges in pr…
▽ More
Optical Coherence Tomography Angiography (OCTA) is a promising tool for detecting Alzheimer's disease (AD) by imaging the retinal microvasculature. Ophthalmologists commonly use region-based analysis, such as the ETDRS grid, to study OCTA image biomarkers and understand the correlation with AD. However, existing studies have used general deep computer vision methods, which present challenges in providing interpretable results and leveraging clinical prior knowledge. To address these challenges, we propose a novel deep-learning framework called Polar-Net. Our approach involves mapping OCTA images from Cartesian coordinates to polar coordinates, which allows for the use of approximate sector convolution and enables the implementation of the ETDRS grid-based regional analysis method commonly used in clinical practice. Furthermore, Polar-Net incorporates clinical prior information of each sector region into the training process, which further enhances its performance. Additionally, our framework adapts to acquire the importance of the corresponding retinal region, which helps researchers and clinicians understand the model's decision-making process in detecting AD and assess its conformity to clinical observations. Through evaluations on private and public datasets, we have demonstrated that Polar-Net outperforms existing state-of-the-art methods and provides more valuable pathological evidence for the association between retinal vascular changes and AD. In addition, we also show that the two innovative modules introduced in our framework have a significant impact on improving overall performance.
△ Less
Submitted 10 November, 2023;
originally announced November 2023.
-
PIPO-Net: A Penalty-based Independent Parameters Optimization Deep Unfolding Network
Authors:
Xiumei Li,
Zhijie Zhang,
Huang Bai,
Ljubiša Stanković,
Junpeng Hao,
Junmei Sun
Abstract:
Compressive sensing (CS) has been widely applied in signal and image processing fields. Traditional CS reconstruction algorithms have a complete theoretical foundation but suffer from the high computational complexity, while fashionable deep network-based methods can achieve high-accuracy reconstruction of CS but are short of interpretability. These facts motivate us to develop a deep unfolding ne…
▽ More
Compressive sensing (CS) has been widely applied in signal and image processing fields. Traditional CS reconstruction algorithms have a complete theoretical foundation but suffer from the high computational complexity, while fashionable deep network-based methods can achieve high-accuracy reconstruction of CS but are short of interpretability. These facts motivate us to develop a deep unfolding network named the penalty-based independent parameters optimization network (PIPO-Net) to combine the merits of the above mentioned two kinds of CS methods. Each module of PIPO-Net can be viewed separately as an optimization problem with respective penalty function. The main characteristic of PIPO-Net is that, in each round of training, the learnable parameters in one module are updated independently from those of other modules. This makes the network more flexible to find the optimal solutions of the corresponding problems. Moreover, the mean-subtraction sampling and the high-frequency complementary blocks are developed to improve the performance of PIPO-Net. Experiments on reconstructing CS images demonstrate the effectiveness of the proposed PIPO-Net.
△ Less
Submitted 4 November, 2023;
originally announced November 2023.
-
Multi-granularity Backprojection Transformer for Remote Sensing Image Super-Resolution
Authors:
Jinglei Hao,
Wukai Li,
Binglu Wang,
Shunzhou Wang,
Yuting Lu,
Ning Li,
Yongqiang Zhao
Abstract:
Backprojection networks have achieved promising super-resolution performance for nature images but not well be explored in the remote sensing image super-resolution (RSISR) field due to the high computation costs. In this paper, we propose a Multi-granularity Backprojection Transformer termed MBT for RSISR. MBT incorporates the backprojection learning strategy into a Transformer framework. It cons…
▽ More
Backprojection networks have achieved promising super-resolution performance for nature images but not well be explored in the remote sensing image super-resolution (RSISR) field due to the high computation costs. In this paper, we propose a Multi-granularity Backprojection Transformer termed MBT for RSISR. MBT incorporates the backprojection learning strategy into a Transformer framework. It consists of Scale-aware Backprojection-based Transformer Layers (SPTLs) for scale-aware low-resolution feature learning and Context-aware Backprojection-based Transformer Blocks (CPTBs) for hierarchical feature learning. A backprojection-based reconstruction module (PRM) is also introduced to enhance the hierarchical features for image reconstruction. MBT stands out by efficiently learning low-resolution features without excessive modules for high-resolution processing, resulting in lower computational resources. Experiment results on UCMerced and AID datasets demonstrate that MBT obtains state-of-the-art results compared to other leading methods.
△ Less
Submitted 19 October, 2023;
originally announced October 2023.
-
Dataset Generation for Drone Optimal Placement Using Machine Learning
Authors:
Jialin Hao
Abstract:
Unmanned aerial vehicle (UAV), or drone is increasingly becoming a promising tool in communication system. This report explains the generation details of a dataset which will be used to designing an algorithm for the optimal placement of UAVs in the drone-assisted vehicular network (DAVN). The goal is to improve the drones' communication and energy efficiency after our previous work. The report is…
▽ More
Unmanned aerial vehicle (UAV), or drone is increasingly becoming a promising tool in communication system. This report explains the generation details of a dataset which will be used to designing an algorithm for the optimal placement of UAVs in the drone-assisted vehicular network (DAVN). The goal is to improve the drones' communication and energy efficiency after our previous work. The report is organized as followed: the first section is devoted to the delay analysis of the vehicle requests in the DAVN using queuing theory; the second part of the report models the energy consumption of the drones while the third section explains the simulation scenario and dataset features. The notations and terminologies used in this report are summarized in the last section.
△ Less
Submitted 4 September, 2023;
originally announced October 2023.
-
Domain knowledge-informed Synthetic fault sample generation with Health Data Map for cross-domain Planetary Gearbox Fault Diagnosis
Authors:
Jong Moon Ha,
Olga Fink
Abstract:
Extensive research has been conducted on fault diagnosis of planetary gearboxes using vibration signals and deep learning (DL) approaches. However, DL-based methods are susceptible to the domain shift problem caused by varying operating conditions of the gearbox. Although domain adaptation and data synthesis methods have been proposed to overcome such domain shifts, they are often not directly app…
▽ More
Extensive research has been conducted on fault diagnosis of planetary gearboxes using vibration signals and deep learning (DL) approaches. However, DL-based methods are susceptible to the domain shift problem caused by varying operating conditions of the gearbox. Although domain adaptation and data synthesis methods have been proposed to overcome such domain shifts, they are often not directly applicable in real-world situations where only healthy data is available in the target domain. To tackle the challenge of extreme domain shift scenarios where only healthy data is available in the target domain, this paper proposes two novel domain knowledge-informed data synthesis methods utilizing the health data map (HDMap). The two proposed approaches are referred to as scaled CutPaste and FaultPaste. The HDMap is used to physically represent the vibration signal of the planetary gearbox as an image-like matrix, allowing for visualization of fault-related features. CutPaste and FaultPaste are then applied to generate faulty samples based on the healthy data in the target domain, using domain knowledge and fault signatures extracted from the source domain, respectively. In addition to generating realistic faults, the proposed methods introduce scaling of fault signatures for controlled synthesis of faults with various severity levels. A case study is conducted on a planetary gearbox testbed to evaluate the proposed approaches. The results show that the proposed methods are capable of accurately diagnosing faults, even in cases of extreme domain shift, and can estimate the severity of faults that have not been previously observed in the target domain.
△ Less
Submitted 26 November, 2023; v1 submitted 31 May, 2023;
originally announced May 2023.
-
AsConvSR: Fast and Lightweight Super-Resolution Network with Assembled Convolutions
Authors:
Jiaming Guo,
Xueyi Zou,
Yuyi Chen,
Yi Liu,
Jia Hao,
Jianzhuang Liu,
Youliang Yan
Abstract:
In recent years, videos and images in 720p (HD), 1080p (FHD) and 4K (UHD) resolution have become more popular for display devices such as TVs, mobile phones and VR. However, these high resolution images cannot achieve the expected visual effect due to the limitation of the internet bandwidth, and bring a great challenge for super-resolution networks to achieve real-time performance. Following this…
▽ More
In recent years, videos and images in 720p (HD), 1080p (FHD) and 4K (UHD) resolution have become more popular for display devices such as TVs, mobile phones and VR. However, these high resolution images cannot achieve the expected visual effect due to the limitation of the internet bandwidth, and bring a great challenge for super-resolution networks to achieve real-time performance. Following this challenge, we explore multiple efficient network designs, such as pixel-unshuffle, repeat upscaling, and local skip connection removal, and propose a fast and lightweight super-resolution network. Furthermore, by analyzing the applications of the idea of divide-and-conquer in super-resolution, we propose assembled convolutions which can adapt convolution kernels according to the input features. Experiments suggest that our method outperforms all the state-of-the-art efficient super-resolution models, and achieves optimal results in terms of runtime and quality. In addition, our method also wins the first place in NTIRE 2023 Real-Time Super-Resolution - Track 1 ($\times$2). The code will be available at https://gitee.com/mindspore/models/tree/master/research/cv/AsConvSR
△ Less
Submitted 5 May, 2023;
originally announced May 2023.
-
Fixed-time safe tracking control of uncertain high-order nonlinear pure-feedback systems via unified transformation functions
Authors:
Chaoqun Guo,
Jiangping Hu,
Jiasheng Hao,
Sergej Celikovsky,
Xiaoming Hu
Abstract:
In this paper, a fixed-time safe control problem is investigated for an uncertain high-order nonlinear pure-feedback system with state constraints. A new nonlinear transformation function is firstly proposed to handle both the constrained and unconstrained cases in a unified way. Further, a radial basis function neural network is constructed to approximate the unknown dynamics in the system and a…
▽ More
In this paper, a fixed-time safe control problem is investigated for an uncertain high-order nonlinear pure-feedback system with state constraints. A new nonlinear transformation function is firstly proposed to handle both the constrained and unconstrained cases in a unified way. Further, a radial basis function neural network is constructed to approximate the unknown dynamics in the system and a fixed-time dynamic surface control (FDSC) technique is developed to facilitate the fixed-time control design for the uncertain high-order pure-feedback system. Combined with the proposed unified transformation function and the FDSC technique, an adaptive fixed-time control strategy is proposed to guarantee the fixed-time tracking. The proposed fixed-time control strategy can guarantee uniform control structure when addressing both constrained and unconstrained situations. Numerical examples are presented to demonstrate the proposed fixed-time tracking control strategy.
△ Less
Submitted 30 April, 2023;
originally announced May 2023.
-
Multi-User Cooperation for Covert Communication Under Quasi-Static Fading
Authors:
Jinyoung Lee,
Duc Trung Dinh,
Hyeonsik Yeom,
Si-Hyeon Lee,
Jeongseok Ha
Abstract:
This work studies a covert communication scheme for an uplink multi-user scenario in which some users are opportunistically selected to help a covert user. In particular, the selected users emit interfering signals via an orthogonal resource dedicated to the covert user together with signals for their own communications using orthogonal resources allocated to the selected users, which helps the co…
▽ More
This work studies a covert communication scheme for an uplink multi-user scenario in which some users are opportunistically selected to help a covert user. In particular, the selected users emit interfering signals via an orthogonal resource dedicated to the covert user together with signals for their own communications using orthogonal resources allocated to the selected users, which helps the covert user hide the presence of the covert communication. For the covert communication scheme, we carry out extensive analysis and find system parameters in closed forms. The analytic derivation for the system parameters allow one to find the optimal combination of system parameters by performing a simple one-dimensional search. In addition, the analytic results elucidate relations among the system parameters. In particular, it will be proved that the optimal strategy for the non-covert users is an on-off scheme with equal transmit power. The theoretical results derived in this work are confirmed by comparing them with numerical results obtained with exhaustive searches. Finally, we demonstrate that the results of work can be utilized in versatile ways by demonstrating a design of covert communication with energy efficiency into account.
△ Less
Submitted 10 April, 2023; v1 submitted 6 April, 2023;
originally announced April 2023.
-
Secure Power Control for Downlink Cell-Free Massive MIMO With Passive Eavesdroppers
Authors:
Junguk Park,
Sangseok Yun,
Jeongseok Ha
Abstract:
This work studies secure communications for a cell-free massive multiple-input multiple-output (CF-mMIMO) network which is attacked by multiple passive eavesdroppers overhearing communications between access points (APs) and users in the network. It will be revealed that the distributed APs in CF-mMIMO allows not only legitimate users but also eavesdroppers to reap the diversity gain, which seriou…
▽ More
This work studies secure communications for a cell-free massive multiple-input multiple-output (CF-mMIMO) network which is attacked by multiple passive eavesdroppers overhearing communications between access points (APs) and users in the network. It will be revealed that the distributed APs in CF-mMIMO allows not only legitimate users but also eavesdroppers to reap the diversity gain, which seriously degrades secrecy performance. Motivated by this, this work proposes an artificial noise (AN)-aided secure power control scheme for CF-mMIMO under passive eavesdropping aiming to achieve a higher secrecy rate and/or guarantee security. In particular, it will be demonstrated that a careful use of AN signal in the power control is especially important to improve the secrecy performance. The performance of the proposed power control scheme is evaluated and compared with various power control schemes via numerical experiments, which clearly shows that the proposed power control scheme outperforms all the competing schemes.
△ Less
Submitted 25 November, 2022;
originally announced November 2022.
-
Joint Design of Power Control and Access Point Scheduling for Uplink Cell-Free Massive MIMO Networks
Authors:
Hyeonsik Yeom,
Junguk Park,
Jinho Choi,
Jeongseok Ha
Abstract:
This work proposes a joint power control and access points (APs) scheduling algorithm for uplink cell-free massive multiple-input multiple-output (CF-mMIMO) networks without channel hardening assumption. Extensive studies have done on the joint optimization problem assuming the channel hardening. However, it has been reported that the channel hardening may not be validated in some CF-mMIMO environ…
▽ More
This work proposes a joint power control and access points (APs) scheduling algorithm for uplink cell-free massive multiple-input multiple-output (CF-mMIMO) networks without channel hardening assumption. Extensive studies have done on the joint optimization problem assuming the channel hardening. However, it has been reported that the channel hardening may not be validated in some CF-mMIMO environments. In particular, the existing Use-and-then-Forget (UatF) bound based on the channel hardening often seriously underestimates user rates in CF-mMIMO. Therefore, a new performance evaluation technique without resorting to the channel hardening is indispensable for accurate performance estimations. Motivated by this, we propose a new bound on the achievable rate of uplink CF-mMIMO. It is demonstrated that the proposed bound provides a more accurate performance estimate of CF-mMIMO than that of the existing UatF bound. The proposed bound also enables us to develop a joint power control and APs scheduling algorithm targeting at both improving fairness and reducing the resource between APs and a central processing unit (CPU). We conduct extensive performance evaluations and comparisons for systems designed with the proposed and existing algorithms. The comparisons show that a considerable performance improvement is achievable with the proposed algorithm even at reduced resource between APs and CPU.
△ Less
Submitted 25 November, 2022; v1 submitted 22 November, 2022;
originally announced November 2022.
-
Retinal Structure Detection in OCTA Image via Voting-based Multi-task Learning
Authors:
Jinkui Hao,
Ting Shen,
Xueli Zhu,
Yonghuai Liu,
Ardhendu Behera,
Dan Zhang,
Bang Chen,
Jiang Liu,
Jiong Zhang,
Yitian Zhao
Abstract:
Automated detection of retinal structures, such as retinal vessels (RV), the foveal avascular zone (FAZ), and retinal vascular junctions (RVJ), are of great importance for understanding diseases of the eye and clinical decision-making. In this paper, we propose a novel Voting-based Adaptive Feature Fusion multi-task network (VAFF-Net) for joint segmentation, detection, and classification of RV, FA…
▽ More
Automated detection of retinal structures, such as retinal vessels (RV), the foveal avascular zone (FAZ), and retinal vascular junctions (RVJ), are of great importance for understanding diseases of the eye and clinical decision-making. In this paper, we propose a novel Voting-based Adaptive Feature Fusion multi-task network (VAFF-Net) for joint segmentation, detection, and classification of RV, FAZ, and RVJ in optical coherence tomography angiography (OCTA). A task-specific voting gate module is proposed to adaptively extract and fuse different features for specific tasks at two levels: features at different spatial positions from a single encoder, and features from multiple encoders. In particular, since the complexity of the microvasculature in OCTA images makes simultaneous precise localization and classification of retinal vascular junctions into bifurcation/crossing a challenging task, we specifically design a task head by combining the heatmap regression and grid classification. We take advantage of three different \textit{en face} angiograms from various retinal layers, rather than following existing methods that use only a single \textit{en face}. To facilitate further research, part of these datasets with the source code and evaluation benchmark have been released for public access:https://github.com/iMED-Lab/VAFF-Net.
△ Less
Submitted 23 August, 2022;
originally announced August 2022.
-
Perception-Distortion Balanced ADMM Optimization for Single-Image Super-Resolution
Authors:
Yuehan Zhang,
Bo Ji,
Jia Hao,
Angela Yao
Abstract:
In image super-resolution, both pixel-wise accuracy and perceptual fidelity are desirable. However, most deep learning methods only achieve high performance in one aspect due to the perception-distortion trade-off, and works that successfully balance the trade-off rely on fusing results from separately trained models with ad-hoc post-processing. In this paper, we propose a novel super-resolution m…
▽ More
In image super-resolution, both pixel-wise accuracy and perceptual fidelity are desirable. However, most deep learning methods only achieve high performance in one aspect due to the perception-distortion trade-off, and works that successfully balance the trade-off rely on fusing results from separately trained models with ad-hoc post-processing. In this paper, we propose a novel super-resolution model with a low-frequency constraint (LFc-SR), which balances the objective and perceptual quality through a single model and yields super-resolved images with high PSNR and perceptual scores. We further introduce an ADMM-based alternating optimization method for the non-trivial learning of the constrained model. Experiments showed that our method, without cumbersome post-processing procedures, achieved the state-of-the-art performance. The code is available at https://github.com/Yuehan717/PDASR.
△ Less
Submitted 16 August, 2022; v1 submitted 5 August, 2022;
originally announced August 2022.
-
Improving Mandarin Speech Recogntion with Block-augmented Transformer
Authors:
Xiaoming Ren,
Huifeng Zhu,
Liuwei Wei,
Minghui Wu,
Jie Hao
Abstract:
Recently Convolution-augmented Transformer (Conformer) has shown promising results in Automatic Speech Recognition (ASR), outperforming the previous best published Transformer Transducer. In this work, we believe that the output information of each block in the encoder and decoder is not completely inclusive, in other words, their output information may be complementary. We study how to take advan…
▽ More
Recently Convolution-augmented Transformer (Conformer) has shown promising results in Automatic Speech Recognition (ASR), outperforming the previous best published Transformer Transducer. In this work, we believe that the output information of each block in the encoder and decoder is not completely inclusive, in other words, their output information may be complementary. We study how to take advantage of the complementary information of each block in a parameter-efficient way, and it is expected that this may lead to more robust performance. Therefore we propose the Block-augmented Transformer for speech recognition, named Blockformer. We have implemented two block ensemble methods: the base Weighted Sum of the Blocks Output (Base-WSBO), and the Squeeze-and-Excitation module to Weighted Sum of the Blocks Output (SE-WSBO). Experiments have proved that the Blockformer significantly outperforms the state-of-the-art Conformer-based models on AISHELL-1, our model achieves a CER of 4.29\% without using a language model and 4.05\% with an external language model on the testset.
△ Less
Submitted 1 December, 2022; v1 submitted 24 July, 2022;
originally announced July 2022.
-
GLD-Net: Improving Monaural Speech Enhancement by Learning Global and Local Dependency Features with GLD Block
Authors:
Xinmeng Xu,
Yang Wang,
Jie Jia,
Binbin Chen,
Jianjun Hao
Abstract:
For monaural speech enhancement, contextual information is important for accurate speech estimation. However, commonly used convolution neural networks (CNNs) are weak in capturing temporal contexts since they only build blocks that process one local neighborhood at a time. To address this problem, we learn from human auditory perception to introduce a two-stage trainable reasoning mechanism, refe…
▽ More
For monaural speech enhancement, contextual information is important for accurate speech estimation. However, commonly used convolution neural networks (CNNs) are weak in capturing temporal contexts since they only build blocks that process one local neighborhood at a time. To address this problem, we learn from human auditory perception to introduce a two-stage trainable reasoning mechanism, referred as global-local dependency (GLD) block. GLD blocks capture long-term dependency of time-frequency bins both in global level and local level from the noisy spectrogram to help detecting correlations among speech part, noise part, and whole noisy input. What is more, we conduct a monaural speech enhancement network called GLD-Net, which adopts encoder-decoder architecture and consists of speech object branch, interference branch, and global noisy branch. The extracted speech feature at global-level and local-level are efficiently reasoned and aggregated in each of the branches. We compare the proposed GLD-Net with existing state-of-art methods on WSJ0 and DEMAND dataset. The results show that GLD-Net outperforms the state-of-the-art methods in terms of PESQ and STOI.
△ Less
Submitted 29 June, 2022;
originally announced June 2022.
-
U-Former: Improving Monaural Speech Enhancement with Multi-head Self and Cross Attention
Authors:
Xinmeng Xu,
Jianjun Hao
Abstract:
For supervised speech enhancement, contextual information is important for accurate spectral mapping. However, commonly used deep neural networks (DNNs) are limited in capturing temporal contexts. To leverage long-term contexts for tracking a target speaker, this paper treats the speech enhancement as sequence-to-sequence mapping, and propose a novel monaural speech enhancement U-net structure bas…
▽ More
For supervised speech enhancement, contextual information is important for accurate spectral mapping. However, commonly used deep neural networks (DNNs) are limited in capturing temporal contexts. To leverage long-term contexts for tracking a target speaker, this paper treats the speech enhancement as sequence-to-sequence mapping, and propose a novel monaural speech enhancement U-net structure based on Transformer, dubbed U-Former. The key idea is to model long-term correlations and dependencies, which are crucial for accurate noisy speech modeling, through the multi-head attention mechanisms. For this purpose, U-Former incorporates multi-head attention mechanisms at two levels: 1) a multi-head self-attention module which calculate the attention map along both time- and frequency-axis to generate time and frequency sub-attention maps for leveraging global interactions between encoder features, while 2) multi-head cross-attention module which are inserted in the skip connections allows a fine recovery in the decoder by filtering out uncorrelated features. Experimental results illustrate that the U-Former obtains consistently better performance than recent models of PESQ, STOI, and SSNR scores.
△ Less
Submitted 12 October, 2022; v1 submitted 17 May, 2022;
originally announced May 2022.
-
AI-enabled Automatic Multimodal Fusion of Cone-Beam CT and Intraoral Scans for Intelligent 3D Tooth-Bone Reconstruction and Clinical Applications
Authors:
Jin Hao,
Jiaxiang Liu,
Jin Li,
Wei Pan,
Ruizhe Chen,
Huimin Xiong,
Kaiwei Sun,
Hangzheng Lin,
Wanlu Liu,
Wanghui Ding,
Jianfei Yang,
Haoji Hu,
Yueling Zhang,
Yang Feng,
Zeyu Zhao,
Huikai Wu,
Youyi Zheng,
Bing Fang,
Zuozhu Liu,
Zhihe Zhao
Abstract:
A critical step in virtual dental treatment planning is to accurately delineate all tooth-bone structures from CBCT with high fidelity and accurate anatomical information. Previous studies have established several methods for CBCT segmentation using deep learning. However, the inherent resolution discrepancy of CBCT and the loss of occlusal and dentition information largely limited its clinical ap…
▽ More
A critical step in virtual dental treatment planning is to accurately delineate all tooth-bone structures from CBCT with high fidelity and accurate anatomical information. Previous studies have established several methods for CBCT segmentation using deep learning. However, the inherent resolution discrepancy of CBCT and the loss of occlusal and dentition information largely limited its clinical applicability. Here, we present a Deep Dental Multimodal Analysis (DDMA) framework consisting of a CBCT segmentation model, an intraoral scan (IOS) segmentation model (the most accurate digital dental model), and a fusion model to generate 3D fused crown-root-bone structures with high fidelity and accurate occlusal and dentition information. Our model was trained with a large-scale dataset with 503 CBCT and 28,559 IOS meshes manually annotated by experienced human experts. For CBCT segmentation, we use a five-fold cross validation test, each with 50 CBCT, and our model achieves an average Dice coefficient and IoU of 93.99% and 88.68%, respectively, significantly outperforming the baselines. For IOS segmentations, our model achieves an mIoU of 93.07% and 95.70% on the maxillary and mandible on a test set of 200 IOS meshes, which are 1.77% and 3.52% higher than the state-of-art method. Our DDMA framework takes about 20 to 25 minutes to generate the fused 3D mesh model following the sequential processing order, compared to over 5 hours by human experts. Notably, our framework has been incorporated into a software by a clear aligner manufacturer, and real-world clinical cases demonstrate that our model can visualize crown-root-bone structures during the entire orthodontic treatment and can predict risks like dehiscence and fenestration. These findings demonstrate the potential of multi-modal deep learning to improve the quality of digital dental models and help dentists make better clinical decisions.
△ Less
Submitted 11 March, 2022;
originally announced March 2022.
-
Feasibility Study of Multi-Site Split Learning for Privacy-Preserving Medical Systems under Data Imbalance Constraints in COVID-19, X-Ray, and Cholesterol Dataset
Authors:
Yoo Jeong Ha,
Gusang Lee,
Minjae Yoo,
Soyi Jung,
Seehwan Yoo,
Joongheon Kim
Abstract:
It seems as though progressively more people are in the race to upload content, data, and information online; and hospitals haven't neglected this trend either. Hospitals are now at the forefront for multi-site medical data sharing to provide groundbreaking advancements in the way health records are shared and patients are diagnosed. Sharing of medical data is essential in modern medical research.…
▽ More
It seems as though progressively more people are in the race to upload content, data, and information online; and hospitals haven't neglected this trend either. Hospitals are now at the forefront for multi-site medical data sharing to provide groundbreaking advancements in the way health records are shared and patients are diagnosed. Sharing of medical data is essential in modern medical research. Yet, as with all data sharing technology, the challenge is to balance improved treatment with protecting patient's personal information. This paper provides a novel split learning algorithm coined the term, "multi-site split learning", which enables a secure transfer of medical data between multiple hospitals without fear of exposing personal data contained in patient records. It also explores the effects of varying the number of end-systems and the ratio of data-imbalance on the deep learning performance. A guideline for the most optimal configuration of split learning that ensures privacy of patient data whilst achieving performance is empirically given. We argue the benefits of our multi-site split learning algorithm, especially regarding the privacy preserving factor, using CT scans of COVID-19 patients, X-ray bone scans, and cholesterol level medical data.
△ Less
Submitted 20 February, 2022;
originally announced February 2022.
-
SEIHAI: A Sample-efficient Hierarchical AI for the MineRL Competition
Authors:
Hangyu Mao,
Chao Wang,
Xiaotian Hao,
Yihuan Mao,
Yiming Lu,
Chengjie Wu,
Jianye Hao,
Dong Li,
Pingzhong Tang
Abstract:
The MineRL competition is designed for the development of reinforcement learning and imitation learning algorithms that can efficiently leverage human demonstrations to drastically reduce the number of environment interactions needed to solve the complex \emph{ObtainDiamond} task with sparse rewards. To address the challenge, in this paper, we present \textbf{SEIHAI}, a \textbf{S}ample-\textbf{e}f…
▽ More
The MineRL competition is designed for the development of reinforcement learning and imitation learning algorithms that can efficiently leverage human demonstrations to drastically reduce the number of environment interactions needed to solve the complex \emph{ObtainDiamond} task with sparse rewards. To address the challenge, in this paper, we present \textbf{SEIHAI}, a \textbf{S}ample-\textbf{e}ff\textbf{i}cient \textbf{H}ierarchical \textbf{AI}, that fully takes advantage of the human demonstrations and the task structure. Specifically, we split the task into several sequentially dependent subtasks, and train a suitable agent for each subtask using reinforcement learning and imitation learning. We further design a scheduler to select different agents for different subtasks automatically. SEIHAI takes the first place in the preliminary and final of the NeurIPS-2020 MineRL competition.
△ Less
Submitted 16 November, 2021;
originally announced November 2021.
-
Spatio-Temporal Split Learning for Privacy-Preserving Medical Platforms: Case Studies with COVID-19 CT, X-Ray, and Cholesterol Data
Authors:
Yoo Jeong Ha,
Minjae Yoo,
Gusang Lee,
Soyi Jung,
Sae Won Choi,
Joongheon Kim,
Seehwan Yoo
Abstract:
Machine learning requires a large volume of sample data, especially when it is used in high-accuracy medical applications. However, patient records are one of the most sensitive private information that is not usually shared among institutes. This paper presents spatio-temporal split learning, a distributed deep neural network framework, which is a turning point in allowing collaboration among pri…
▽ More
Machine learning requires a large volume of sample data, especially when it is used in high-accuracy medical applications. However, patient records are one of the most sensitive private information that is not usually shared among institutes. This paper presents spatio-temporal split learning, a distributed deep neural network framework, which is a turning point in allowing collaboration among privacy-sensitive organizations. Our spatio-temporal split learning presents how distributed machine learning can be efficiently conducted with minimal privacy concerns. The proposed split learning consists of a number of clients and a centralized server. Each client has only has one hidden layer, which acts as the privacy-preserving layer, and the centralized server comprises the other hidden layers and the output layer. Since the centralized server does not need to access the training data and trains the deep neural network with parameters received from the privacy-preserving layer, privacy of original data is guaranteed. We have coined the term, spatio-temporal split learning, as multiple clients are spatially distributed to cover diverse datasets from different participants, and we can temporally split the learning process, detaching the privacy preserving layer from the rest of the learning process to minimize privacy breaches. This paper shows how we can analyze the medical data whilst ensuring privacy using our proposed multi-site spatio-temporal split learning algorithm on Coronavirus Disease-19 (COVID-19) chest Computed Tomography (CT) scans, MUsculoskeletal RAdiographs (MURA) X-ray images, and cholesterol levels.
△ Less
Submitted 20 August, 2021;
originally announced August 2021.
-
NTIRE 2021 Challenge on Quality Enhancement of Compressed Video: Methods and Results
Authors:
Ren Yang,
Radu Timofte,
Jing Liu,
Yi Xu,
Xinjian Zhang,
Minyi Zhao,
Shuigeng Zhou,
Kelvin C. K. Chan,
Shangchen Zhou,
Xiangyu Xu,
Chen Change Loy,
Xin Li,
Fanglong Liu,
He Zheng,
Lielin Jiang,
Qi Zhang,
Dongliang He,
Fu Li,
Qingqing Dang,
Yibin Huang,
Matteo Maggioni,
Zhongqian Fu,
Shuai Xiao,
Cheng li,
Thomas Tanay
, et al. (47 additional authors not shown)
Abstract:
This paper reviews the first NTIRE challenge on quality enhancement of compressed video, with a focus on the proposed methods and results. In this challenge, the new Large-scale Diverse Video (LDV) dataset is employed. The challenge has three tracks. Tracks 1 and 2 aim at enhancing the videos compressed by HEVC at a fixed QP, while Track 3 is designed for enhancing the videos compressed by x265 at…
▽ More
This paper reviews the first NTIRE challenge on quality enhancement of compressed video, with a focus on the proposed methods and results. In this challenge, the new Large-scale Diverse Video (LDV) dataset is employed. The challenge has three tracks. Tracks 1 and 2 aim at enhancing the videos compressed by HEVC at a fixed QP, while Track 3 is designed for enhancing the videos compressed by x265 at a fixed bit-rate. Besides, the quality enhancement of Tracks 1 and 3 targets at improving the fidelity (PSNR), and Track 2 targets at enhancing the perceptual quality. The three tracks totally attract 482 registrations. In the test phase, 12 teams, 8 teams and 11 teams submitted the final results of Tracks 1, 2 and 3, respectively. The proposed methods and solutions gauge the state-of-the-art of video quality enhancement. The homepage of the challenge: https://github.com/RenYang-home/NTIRE21_VEnh
△ Less
Submitted 31 August, 2022; v1 submitted 21 April, 2021;
originally announced April 2021.
-
3D Vessel Reconstruction in OCT-Angiography via Depth Map Estimation
Authors:
Shuai Yu,
Jianyang Xie,
Jinkui Hao,
Yalin Zheng,
Jiong Zhang,
Yan Hu,
Jiang Liu,
Yitian Zhao
Abstract:
Optical Coherence Tomography Angiography (OCTA) has been increasingly used in the management of eye and systemic diseases in recent years. Manual or automatic analysis of blood vessel in 2D OCTA images (en face angiograms) is commonly used in clinical practice, however it may lose rich 3D spatial distribution information of blood vessels or capillaries that are useful for clinical decision-making.…
▽ More
Optical Coherence Tomography Angiography (OCTA) has been increasingly used in the management of eye and systemic diseases in recent years. Manual or automatic analysis of blood vessel in 2D OCTA images (en face angiograms) is commonly used in clinical practice, however it may lose rich 3D spatial distribution information of blood vessels or capillaries that are useful for clinical decision-making. In this paper, we introduce a novel 3D vessel reconstruction framework based on the estimation of vessel depth maps from OCTA images. First, we design a network with structural constraints to predict the depth of blood vessels in OCTA images. In order to promote the accuracy of the predicted depth map at both the overall structure- and pixel- level, we combine MSE and SSIM loss as the training loss function. Finally, the 3D vessel reconstruction is achieved by utilizing the estimated depth map and 2D vessel segmentation results. Experimental results demonstrate that our method is effective in the depth prediction and 3D vessel reconstruction for OCTA images.% results may be used to guide subsequent vascular analysis
△ Less
Submitted 26 February, 2021;
originally announced February 2021.
-
AMFFCN: Attentional Multi-layer Feature Fusion Convolution Network for Audio-visual Speech Enhancement
Authors:
Xinmeng Xu,
Jianjun Hao
Abstract:
Audio-visual speech enhancement system is regarded to be one of promising solutions for isolating and enhancing speech of desired speaker. Conventional methods focus on predicting clean speech spectrum via a naive convolution neural network based encoder-decoder architecture, and these methods a) not adequate to use data fully and effectively, b) cannot process features selectively. The proposed m…
▽ More
Audio-visual speech enhancement system is regarded to be one of promising solutions for isolating and enhancing speech of desired speaker. Conventional methods focus on predicting clean speech spectrum via a naive convolution neural network based encoder-decoder architecture, and these methods a) not adequate to use data fully and effectively, b) cannot process features selectively. The proposed model addresses these drawbacks, by a) applying a model that fuses audio and visual features layer by layer in encoding phase, and that feeds fused audio-visual features to each corresponding decoder layer, and more importantly, b) introducing soft threshold attention into the model to select the informative modality softly. This paper proposes attentional audio-visual multi-layer feature fusion model, in which soft threshold attention unit are applied on feature mapping at every layer of decoder. The proposed model demonstrates the superior performance of the network against the state-of-the-art models.
△ Less
Submitted 26 September, 2022; v1 submitted 15 January, 2021;
originally announced January 2021.
-
Multi-layer Feature Fusion Convolution Network for Audio-visual Speech Enhancement
Authors:
Xinmeng Xu,
Jianjun Hao
Abstract:
Speech enhancement can potentially benefit from the visual information from the target speaker, such as lip movement and facial expressions, because the visual aspect of speech is essentially unaffected by acoustic environment. In this paper, we address the problem of enhancing corrupted speech signal from videos by using audio-visual (AV) neural processing. Most of recent AV speech enhancement ap…
▽ More
Speech enhancement can potentially benefit from the visual information from the target speaker, such as lip movement and facial expressions, because the visual aspect of speech is essentially unaffected by acoustic environment. In this paper, we address the problem of enhancing corrupted speech signal from videos by using audio-visual (AV) neural processing. Most of recent AV speech enhancement approaches separately process the acoustic and visual features and fuse them via a simple concatenation operation. Although this strategy is convenient and easy to implement, it comes with two major drawbacks: 1) evidence in speech perception suggests that in humans the AV integration occurs at a very early stage, in contrast to previous models that process the two modalities separately at early stage and combine them only at a later stage, thus making the system less robust, and 2) a simple concatenation does not allow to control how the information from the acoustic and the visual modalities is treated. To overcome these drawbacks, we propose a multi-layer feature fusion convolution network (MFFCN), which separately process acoustic and visual modalities for preserving each modality features while fusing both modalities' features layer by layer in encoding phase for enjoying the human AV speech perception. In addition, considering the balance between the two modalities, we design channel and spectral attention mechanisms to provide additional flexibility in dealing with different types of information expanding the representational ability of the convolution neural network. Experimental results show that the proposed MFFCN demonstrates the performance of the network superior to the state-of-the-art models.
△ Less
Submitted 23 May, 2022; v1 submitted 15 January, 2021;
originally announced January 2021.
-
SMARTS: Scalable Multi-Agent Reinforcement Learning Training School for Autonomous Driving
Authors:
Ming Zhou,
Jun Luo,
Julian Villella,
Yaodong Yang,
David Rusu,
Jiayu Miao,
Weinan Zhang,
Montgomery Alban,
Iman Fadakar,
Zheng Chen,
Aurora Chongxi Huang,
Ying Wen,
Kimia Hassanzadeh,
Daniel Graves,
Dong Chen,
Zhengbang Zhu,
Nhat Nguyen,
Mohamed Elsayed,
Kun Shao,
Sanjeevan Ahilan,
Baokuan Zhang,
Jiannan Wu,
Zhengang Fu,
Kasra Rezaee,
Peyman Yadmellat
, et al. (12 additional authors not shown)
Abstract:
Multi-agent interaction is a fundamental aspect of autonomous driving in the real world. Despite more than a decade of research and development, the problem of how to competently interact with diverse road users in diverse scenarios remains largely unsolved. Learning methods have much to offer towards solving this problem. But they require a realistic multi-agent simulator that generates diverse a…
▽ More
Multi-agent interaction is a fundamental aspect of autonomous driving in the real world. Despite more than a decade of research and development, the problem of how to competently interact with diverse road users in diverse scenarios remains largely unsolved. Learning methods have much to offer towards solving this problem. But they require a realistic multi-agent simulator that generates diverse and competent driving interactions. To meet this need, we develop a dedicated simulation platform called SMARTS (Scalable Multi-Agent RL Training School). SMARTS supports the training, accumulation, and use of diverse behavior models of road users. These are in turn used to create increasingly more realistic and diverse interactions that enable deeper and broader research on multi-agent interaction. In this paper, we describe the design goals of SMARTS, explain its basic architecture and its key features, and illustrate its use through concrete multi-agent experiments on interactive scenarios. We open-source the SMARTS platform and the associated benchmark tasks and evaluation metrics to encourage and empower research on multi-agent learning for autonomous driving. Our code is available at https://github.com/huawei-noah/SMARTS.
△ Less
Submitted 31 October, 2020; v1 submitted 19 October, 2020;
originally announced October 2020.
-
Using Empirical Trajectory Data to Design Connected Autonomous Vehicle Controllers for Traffic Stabilization
Authors:
Yujie Li,
Sikai Chen,
Runjia Du,
Paul Young Joun Ha,
Jiqian Dong,
Samuel Labi
Abstract:
Emerging transportation technologies offer unprecedented opportunities to improve the efficiency of the transportation system from the perspectives of energy consumption, congestion, and emissions. One of these technologies is connected and autonomous vehicles (CAVs). With the prospective duality of operations of CAVs and human driven vehicles in the same roadway space (also referred to as a mixed…
▽ More
Emerging transportation technologies offer unprecedented opportunities to improve the efficiency of the transportation system from the perspectives of energy consumption, congestion, and emissions. One of these technologies is connected and autonomous vehicles (CAVs). With the prospective duality of operations of CAVs and human driven vehicles in the same roadway space (also referred to as a mixed stream), CAVs are expected to address a variety of traffic problems particularly those that are either caused or exacerbated by the heterogeneous nature of human driving. In efforts to realize such specific benefits of CAVs in mixed-stream traffic, it is essential to understand and simulate the behavior of human drivers in such environments, and microscopic traffic flow (MTF) models can be used to carry out this task. By helping to comprehend the fundamental dynamics of traffic flow, MTF models serve as a powerful approach to assess the impacts of such flow in terms of safety, stability, and efficiency. In this paper, we seek to calibrate MTF models based on empirical trajectory data as basis of not only understanding traffic dynamics such as traffic instabilities, but ultimately using CAVs to mitigate stop-and-go wave propagation. The paper therefore duly considers the heterogeneity and uncertainty associated with human driving behavior in order to calibrate the dynamics of each HDV. Also, the paper designs the CAV controllers based on the microscopic HDV models that are calibrated in real time. The data for the calibration is from the Next Generation SIMulation (NGSIM) trajectory datasets. The results are encouraging, as they indicate the efficacy of the designed controller to significantly improve not only the stability of the mixed traffic stream but also the safety of both CAVs and HDVs in the traffic stream.
△ Less
Submitted 11 October, 2020;
originally announced October 2020.