-
ShiftLIC: Lightweight Learned Image Compression with Spatial-Channel Shift Operations
Authors:
Youneng Bao,
Wen Tan,
Chuanmin Jia,
Mu Li,
Yongsheng Liang,
Yonghong Tian
Abstract:
Learned Image Compression (LIC) has attracted considerable attention due to their outstanding rate-distortion (R-D) performance and flexibility. However, the substantial computational cost poses challenges for practical deployment. The issue of feature redundancy in LIC is rarely addressed. Our findings indicate that many features within the LIC backbone network exhibit similarities.
This paper…
▽ More
Learned Image Compression (LIC) has attracted considerable attention due to their outstanding rate-distortion (R-D) performance and flexibility. However, the substantial computational cost poses challenges for practical deployment. The issue of feature redundancy in LIC is rarely addressed. Our findings indicate that many features within the LIC backbone network exhibit similarities.
This paper introduces ShiftLIC, a novel and efficient LIC framework that employs parameter-free shift operations to replace large-kernel convolutions, significantly reducing the model's computational burden and parameter count. Specifically, we propose the Spatial Shift Block (SSB), which combines shift operations with small-kernel convolutions to replace large-kernel. This approach maintains feature extraction efficiency while reducing both computational complexity and model size. To further enhance the representation capability in the channel dimension, we propose a channel attention module based on recursive feature fusion. This module enhances feature interaction while minimizing computational overhead. Additionally, we introduce an improved entropy model integrated with the SSB module, making the entropy estimation process more lightweight and thereby comprehensively reducing computational costs.
Experimental results demonstrate that ShiftLIC outperforms leading compression methods, such as VVC Intra and GMM, in terms of computational cost, parameter count, and decoding latency. Additionally, ShiftLIC sets a new SOTA benchmark with a BD-rate gain per MACs/pixel of -102.6\%, showcasing its potential for practical deployment in resource-constrained environments. The code is released at https://github.com/baoyu2020/ShiftLIC.
△ Less
Submitted 29 March, 2025;
originally announced March 2025.
-
Adapting Automatic Speech Recognition for Accented Air Traffic Control Communications
Authors:
Marcus Yu Zhe Wee,
Justin Juin Hng Wong,
Lynus Lim,
Joe Yu Wei Tan,
Prannaya Gupta,
Dillion Lim,
En Hao Tew,
Aloysius Keng Siew Han,
Yong Zhi Lim
Abstract:
Effective communication in Air Traffic Control (ATC) is critical to maintaining aviation safety, yet the challenges posed by accented English remain largely unaddressed in Automatic Speech Recognition (ASR) systems. Existing models struggle with transcription accuracy for Southeast Asian-accented (SEA-accented) speech, particularly in noisy ATC environments. This study presents the development of…
▽ More
Effective communication in Air Traffic Control (ATC) is critical to maintaining aviation safety, yet the challenges posed by accented English remain largely unaddressed in Automatic Speech Recognition (ASR) systems. Existing models struggle with transcription accuracy for Southeast Asian-accented (SEA-accented) speech, particularly in noisy ATC environments. This study presents the development of ASR models fine-tuned specifically for Southeast Asian accents using a newly created dataset. Our research achieves significant improvements, achieving a Word Error Rate (WER) of 0.0982 or 9.82% on SEA-accented ATC speech. Additionally, the paper highlights the importance of region-specific datasets and accent-focused training, offering a pathway for deploying ASR systems in resource-constrained military operations. The findings emphasize the need for noise-robust training techniques and region-specific datasets to improve transcription accuracy for non-Western accents in ATC communications.
△ Less
Submitted 27 February, 2025;
originally announced February 2025.
-
Online Location Planning for AI-Defined Vehicles: Optimizing Joint Tasks of Order Serving and Spatio-Temporal Heterogeneous Model Fine-Tuning
Authors:
Bokeng Zheng,
Bo Rao,
Tianxiang Zhu,
Chee Wei Tan,
Jingpu Duan,
Zhi Zhou,
Xu Chen,
Xiaoxi Zhang
Abstract:
Advances in artificial intelligence (AI) including foundation models (FMs), are increasingly transforming human society, with smart city driving the evolution of urban living.Meanwhile, vehicle crowdsensing (VCS) has emerged as a key enabler, leveraging vehicles' mobility and sensor-equipped capabilities. In particular, ride-hailing vehicles can effectively facilitate flexible data collection and…
▽ More
Advances in artificial intelligence (AI) including foundation models (FMs), are increasingly transforming human society, with smart city driving the evolution of urban living.Meanwhile, vehicle crowdsensing (VCS) has emerged as a key enabler, leveraging vehicles' mobility and sensor-equipped capabilities. In particular, ride-hailing vehicles can effectively facilitate flexible data collection and contribute towards urban intelligence, despite resource limitations. Therefore, this work explores a promising scenario, where edge-assisted vehicles perform joint tasks of order serving and the emerging foundation model fine-tuning using various urban data. However, integrating the VCS AI task with the conventional order serving task is challenging, due to their inconsistent spatio-temporal characteristics: (i) The distributions of ride orders and data point-of-interests (PoIs) may not coincide in geography, both following a priori unknown patterns; (ii) they have distinct forms of temporal effects, i.e., prolonged waiting makes orders become instantly invalid while data with increased staleness gradually reduces its utility for model fine-tuning.To overcome these obstacles, we propose an online framework based on multi-agent reinforcement learning (MARL) with careful augmentation. A new quality-of-service (QoS) metric is designed to characterize and balance the utility of the two joint tasks, under the effects of varying data volumes and staleness. We also integrate graph neural networks (GNNs) with MARL to enhance state representations, capturing graph-structured, time-varying dependencies among vehicles and across locations. Extensive experiments on our testbed simulator, utilizing various real-world foundation model fine-tuning tasks and the New York City Taxi ride order dataset, demonstrate the advantage of our proposed method.
△ Less
Submitted 6 February, 2025;
originally announced February 2025.
-
XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework
Authors:
Sida Tian,
Can Zhang,
Wei Yuan,
Wei Tan,
Wenjie Zhu
Abstract:
In recent years, remarkable advancements in artificial intelligence-generated content (AIGC) have been achieved in the fields of image synthesis and text generation, generating content comparable to that produced by humans. However, the quality of AI-generated music has not yet reached this standard, primarily due to the challenge of effectively controlling musical emotions and ensuring high-quali…
▽ More
In recent years, remarkable advancements in artificial intelligence-generated content (AIGC) have been achieved in the fields of image synthesis and text generation, generating content comparable to that produced by humans. However, the quality of AI-generated music has not yet reached this standard, primarily due to the challenge of effectively controlling musical emotions and ensuring high-quality outputs. This paper presents a generalized symbolic music generation framework, XMusic, which supports flexible prompts (i.e., images, videos, texts, tags, and humming) to generate emotionally controllable and high-quality symbolic music. XMusic consists of two core components, XProjector and XComposer. XProjector parses the prompts of various modalities into symbolic music elements (i.e., emotions, genres, rhythms and notes) within the projection space to generate matching music. XComposer contains a Generator and a Selector. The Generator generates emotionally controllable and melodious music based on our innovative symbolic music representation, whereas the Selector identifies high-quality symbolic music by constructing a multi-task learning scheme involving quality assessment, emotion recognition, and genre recognition tasks. In addition, we build XMIDI, a large-scale symbolic music dataset that contains 108,023 MIDI files annotated with precise emotion and genre labels. Objective and subjective evaluations show that XMusic significantly outperforms the current state-of-the-art methods with impressive music quality. Our XMusic has been awarded as one of the nine Highlights of Collectibles at WAIC 2023. The project homepage of XMusic is https://xmusic-project.github.io.
△ Less
Submitted 15 January, 2025;
originally announced January 2025.
-
MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization
Authors:
Haina Zhu,
Yizhi Zhou,
Hangting Chen,
Jianwei Yu,
Ziyang Ma,
Rongzhi Gu,
Yi Luo,
Wei Tan,
Xie Chen
Abstract:
Recent years have witnessed the success of foundation models pre-trained with self-supervised learning (SSL) in various music informatics understanding tasks, including music tagging, instrument classification, key detection, and more. In this paper, we propose a self-supervised music representation learning model for music understanding. Distinguished from previous studies adopting random project…
▽ More
Recent years have witnessed the success of foundation models pre-trained with self-supervised learning (SSL) in various music informatics understanding tasks, including music tagging, instrument classification, key detection, and more. In this paper, we propose a self-supervised music representation learning model for music understanding. Distinguished from previous studies adopting random projection or existing neural codec, the proposed model, named MuQ, is trained to predict tokens generated by Mel Residual Vector Quantization (Mel-RVQ). Our Mel-RVQ utilizes residual linear projection structure for Mel spectrum quantization to enhance the stability and efficiency of target extraction and lead to better performance. Experiments in a large variety of downstream tasks demonstrate that MuQ outperforms previous self-supervised music representation models with only 0.9K hours of open-source pre-training data. Scaling up the data to over 160K hours and adopting iterative training consistently improve the model performance. To further validate the strength of our model, we present MuQ-MuLan, a joint music-text embedding model based on contrastive learning, which achieves state-of-the-art performance in the zero-shot music tagging task on the MagnaTagATune dataset. Code and checkpoints are open source in https://github.com/tencent-ailab/MuQ.
△ Less
Submitted 3 January, 2025; v1 submitted 2 January, 2025;
originally announced January 2025.
-
SongEditor: Adapting Zero-Shot Song Generation Language Model as a Multi-Task Editor
Authors:
Chenyu Yang,
Shuai Wang,
Hangting Chen,
Jianwei Yu,
Wei Tan,
Rongzhi Gu,
Yaoxun Xu,
Yizhi Zhou,
Haina Zhu,
Haizhou Li
Abstract:
The emergence of novel generative modeling paradigms, particularly audio language models, has significantly advanced the field of song generation. Although state-of-the-art models are capable of synthesizing both vocals and accompaniment tracks up to several minutes long concurrently, research about partial adjustments or editing of existing songs is still underexplored, which allows for more flex…
▽ More
The emergence of novel generative modeling paradigms, particularly audio language models, has significantly advanced the field of song generation. Although state-of-the-art models are capable of synthesizing both vocals and accompaniment tracks up to several minutes long concurrently, research about partial adjustments or editing of existing songs is still underexplored, which allows for more flexible and effective production. In this paper, we present SongEditor, the first song editing paradigm that introduces the editing capabilities into language-modeling song generation approaches, facilitating both segment-wise and track-wise modifications. SongEditor offers the flexibility to adjust lyrics, vocals, and accompaniments, as well as synthesizing songs from scratch. The core components of SongEditor include a music tokenizer, an autoregressive language model, and a diffusion generator, enabling generating an entire section, masked lyrics, or even separated vocals and background music. Extensive experiments demonstrate that the proposed SongEditor achieves exceptional performance in end-to-end song editing, as evidenced by both objective and subjective metrics. Audio samples are available in https://cypress-yang.github.io/SongEditor_demo/.
△ Less
Submitted 28 January, 2025; v1 submitted 18 December, 2024;
originally announced December 2024.
-
SSR: Alignment-Aware Modality Connector for Speech Language Models
Authors:
Weiting Tan,
Hirofumi Inaguma,
Ning Dong,
Paden Tomasello,
Xutai Ma
Abstract:
Fusing speech into pre-trained language model (SpeechLM) usually suffers from inefficient encoding of long-form speech and catastrophic forgetting of pre-trained text modality. We propose SSR-Connector (Segmented Speech Representation Connector) for better modality fusion. Leveraging speech-text alignments, our approach segments and compresses speech features to match the granularity of text embed…
▽ More
Fusing speech into pre-trained language model (SpeechLM) usually suffers from inefficient encoding of long-form speech and catastrophic forgetting of pre-trained text modality. We propose SSR-Connector (Segmented Speech Representation Connector) for better modality fusion. Leveraging speech-text alignments, our approach segments and compresses speech features to match the granularity of text embeddings. Additionally, we introduce a two-stage training pipeline that includes the distillation and fine-tuning phases to mitigate catastrophic forgetting. SSR-Connector outperforms existing mechanism for speech-text modality fusion, consistently achieving better speech understanding (e.g., +10 accuracy on StoryCloze and +20 on Speech-MMLU) while preserving pre-trained text ability.
△ Less
Submitted 17 May, 2025; v1 submitted 30 September, 2024;
originally announced October 2024.
-
MuCodec: Ultra Low-Bitrate Music Codec
Authors:
Yaoxun Xu,
Hangting Chen,
Jianwei Yu,
Wei Tan,
Rongzhi Gu,
Shun Lei,
Zhiwei Lin,
Zhiyong Wu
Abstract:
Music codecs are a vital aspect of audio codec research, and ultra low-bitrate compression holds significant importance for music transmission and generation. Due to the complexity of music backgrounds and the richness of vocals, solely relying on modeling semantic or acoustic information cannot effectively reconstruct music with both vocals and backgrounds. To address this issue, we propose MuCod…
▽ More
Music codecs are a vital aspect of audio codec research, and ultra low-bitrate compression holds significant importance for music transmission and generation. Due to the complexity of music backgrounds and the richness of vocals, solely relying on modeling semantic or acoustic information cannot effectively reconstruct music with both vocals and backgrounds. To address this issue, we propose MuCodec, specifically targeting music compression and reconstruction tasks at ultra low bitrates. MuCodec employs MuEncoder to extract both acoustic and semantic features, discretizes them with RVQ, and obtains Mel-VAE features via flow-matching. The music is then reconstructed using a pre-trained MEL-VAE decoder and HiFi-GAN. MuCodec can reconstruct high-fidelity music at ultra low (0.35kbps) or high bitrates (1.35kbps), achieving the best results to date in both subjective and objective metrics. Code and Demo: https://xuyaoxun.github.io/MuCodec_demo/.
△ Less
Submitted 28 September, 2024; v1 submitted 20 September, 2024;
originally announced September 2024.
-
Analysis of Dead Reckoning Accuracy in Swarm Robotics System
Authors:
Weihang Tan,
Timothy Anglea,
Yongqiang Wang
Abstract:
The objective of this paper is to determine the position of a single mobile robot in a swarm using dead reckoning techniques. We investigate the accuracy of navigation by using this process. The paper begins with the research background and social importance. Then, the specific experimental setup and analysis of experimental results are presented. Finally, the results are detailed and some potenti…
▽ More
The objective of this paper is to determine the position of a single mobile robot in a swarm using dead reckoning techniques. We investigate the accuracy of navigation by using this process. The paper begins with the research background and social importance. Then, the specific experimental setup and analysis of experimental results are presented. Finally, the results are detailed and some potential improvements are provided.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Enhancing Adversarial Training with Prior Knowledge Distillation for Robust Image Compression
Authors:
Zhi Cao,
Youneng Bao,
Fanyang Meng,
Chao Li,
Wen Tan,
Genhong Wang,
Yongsheng Liang
Abstract:
Deep neural network-based image compression (NIC) has achieved excellent performance, but NIC method models have been shown to be susceptible to backdoor attacks. Adversarial training has been validated in image compression models as a common method to enhance model robustness. However, the improvement effect of adversarial training on model robustness is limited. In this paper, we propose a prior…
▽ More
Deep neural network-based image compression (NIC) has achieved excellent performance, but NIC method models have been shown to be susceptible to backdoor attacks. Adversarial training has been validated in image compression models as a common method to enhance model robustness. However, the improvement effect of adversarial training on model robustness is limited. In this paper, we propose a prior knowledge-guided adversarial training framework for image compression models. Specifically, first, we propose a gradient regularization constraint for training robust teacher models. Subsequently, we design a knowledge distillation based strategy to generate a priori knowledge from the teacher model to the student model for guiding adversarial training. Experimental results show that our method improves the reconstruction quality by about 9dB when the Kodak dataset is elected as the backdoor attack object for psnr attack. Compared with Ma2023, our method has a 5dB higher PSNR output at high bitrate points.
△ Less
Submitted 15 March, 2024; v1 submitted 11 March, 2024;
originally announced March 2024.
-
Variational Bayesian Learning based Joint Localization and Path Loss Exponent with Distance-dependent Noise in Wireless Sensor Network
Authors:
Yunfei Li,
Yiting Luo,
Weiqiang Tan,
Chunguo Li,
Shaodan Ma,
Guanghua Yang
Abstract:
This paper focuses on the challenge of jointly optimizing location and path loss exponent (PLE) in distance-dependent noise. Departing from the conventional independent noise model used in localization and path loss exponent estimation problems, we consider a more realistic model incorporating distance-dependent noise variance, as revealed in recent theoretical analyses and experimental results. T…
▽ More
This paper focuses on the challenge of jointly optimizing location and path loss exponent (PLE) in distance-dependent noise. Departing from the conventional independent noise model used in localization and path loss exponent estimation problems, we consider a more realistic model incorporating distance-dependent noise variance, as revealed in recent theoretical analyses and experimental results. The distance-dependent noise introduces a complex noise model with unknown noise power and PLE, resulting in an exceptionally challenging non-convex and nonlinear optimization problem. In this study, we address a joint localization and path loss exponent estimation problem encompassing distance-dependent noise, unknown parameters, and uncertainties in sensor node locations. To surmount the intractable nonlinear and non-convex objective function inherent in the problem, we introduce a variational Bayesian learning-based framework that enables the joint optimization of localization, path loss exponent, and reference noise parameters by leveraging an effective approximation to the true posterior distribution. Furthermore, the proposed joint learning algorithm provides an iterative closed-form solution and exhibits superior performance in terms of computational complexity compared to existing algorithms. Computer simulation results demonstrate that the proposed algorithm approaches the performance of the Bayesian Cramer-Rao bound (BCRB), achieves localization performance comparable to the (maximum likelihood-Gaussian message passing) ML-GMP algorithm in some cases, and outperforms the other comparison algorithm in all cases.
△ Less
Submitted 20 July, 2024; v1 submitted 6 March, 2024;
originally announced March 2024.
-
Streaming Sequence Transduction through Dynamic Compression
Authors:
Weiting Tan,
Yunmo Chen,
Tongfei Chen,
Guanghui Qin,
Haoran Xu,
Heidi C. Zhang,
Benjamin Van Durme,
Philipp Koehn
Abstract:
We introduce STAR (Stream Transduction with Anchor Representations), a novel Transformer-based model designed for efficient sequence-to-sequence transduction over streams. STAR dynamically segments input streams to create compressed anchor representations, achieving nearly lossless compression (12x) in Automatic Speech Recognition (ASR) and outperforming existing methods. Moreover, STAR demonstrat…
▽ More
We introduce STAR (Stream Transduction with Anchor Representations), a novel Transformer-based model designed for efficient sequence-to-sequence transduction over streams. STAR dynamically segments input streams to create compressed anchor representations, achieving nearly lossless compression (12x) in Automatic Speech Recognition (ASR) and outperforming existing methods. Moreover, STAR demonstrates superior segmentation and latency-quality trade-offs in simultaneous speech-to-text tasks, optimizing latency, memory footprint, and quality.
△ Less
Submitted 21 May, 2025; v1 submitted 2 February, 2024;
originally announced February 2024.
-
Multi-Modality Deep Network for Extreme Learned Image Compression
Authors:
Xuhao Jiang,
Weimin Tan,
Tian Tan,
Bo Yan,
Liquan Shen
Abstract:
Image-based single-modality compression learning approaches have demonstrated exceptionally powerful encoding and decoding capabilities in the past few years , but suffer from blur and severe semantics loss at extremely low bitrates. To address this issue, we propose a multimodal machine learning method for text-guided image compression, in which the semantic information of text is used as prior i…
▽ More
Image-based single-modality compression learning approaches have demonstrated exceptionally powerful encoding and decoding capabilities in the past few years , but suffer from blur and severe semantics loss at extremely low bitrates. To address this issue, we propose a multimodal machine learning method for text-guided image compression, in which the semantic information of text is used as prior information to guide image compression for better compression performance. We fully study the role of text description in different components of the codec, and demonstrate its effectiveness. In addition, we adopt the image-text attention module and image-request complement module to better fuse image and text features, and propose an improved multimodal semantic-consistent loss to produce semantically complete reconstructions. Extensive experiments, including a user study, prove that our method can obtain visually pleasing results at extremely low bitrates, and achieves a comparable or even better performance than state-of-the-art methods, even though these methods are at 2x to 4x bitrates of ours.
△ Less
Submitted 26 April, 2023;
originally announced April 2023.
-
Exploring Adversarial Attacks on Neural Networks: An Explainable Approach
Authors:
Justus Renkhoff,
Wenkai Tan,
Alvaro Velasquez,
illiam Yichen Wang,
Yongxin Liu,
Jian Wang,
Shuteng Niu,
Lejla Begic Fazlic,
Guido Dartmann,
Houbing Song
Abstract:
Deep Learning (DL) is being applied in various domains, especially in safety-critical applications such as autonomous driving. Consequently, it is of great significance to ensure the robustness of these methods and thus counteract uncertain behaviors caused by adversarial attacks. In this paper, we use gradient heatmaps to analyze the response characteristics of the VGG-16 model when the input ima…
▽ More
Deep Learning (DL) is being applied in various domains, especially in safety-critical applications such as autonomous driving. Consequently, it is of great significance to ensure the robustness of these methods and thus counteract uncertain behaviors caused by adversarial attacks. In this paper, we use gradient heatmaps to analyze the response characteristics of the VGG-16 model when the input images are mixed with adversarial noise and statistically similar Gaussian random noise. In particular, we compare the network response layer by layer to determine where errors occurred. Several interesting findings are derived. First, compared to Gaussian random noise, intentionally generated adversarial noise causes severe behavior deviation by distracting the area of concentration in the networks. Second, in many cases, adversarial examples only need to compromise a few intermediate blocks to mislead the final decision. Third, our experiments revealed that specific blocks are more vulnerable and easier to exploit by adversarial examples. Finally, we demonstrate that the layers $Block4\_conv1$ and $Block5\_cov1$ of the VGG-16 model are more susceptible to adversarial attacks. Our work could provide valuable insights into developing more reliable Deep Neural Network (DNN) models.
△ Less
Submitted 8 March, 2023;
originally announced March 2023.
-
Automated Sex Classification of Children's Voices and Changes in Differentiating Factors with Age
Authors:
Fuling Chen,
Roberto Togneri,
Murray Maybery,
Diana Weiting Tan
Abstract:
Sex classification of children's voices allows for an investigation of the development of secondary sex characteristics which has been a key interest in the field of speech analysis. This research investigated a broad range of acoustic features from scripted and spontaneous speech and applied a hierarchical clustering-based machine learning model to distinguish the sex of children aged between 5 a…
▽ More
Sex classification of children's voices allows for an investigation of the development of secondary sex characteristics which has been a key interest in the field of speech analysis. This research investigated a broad range of acoustic features from scripted and spontaneous speech and applied a hierarchical clustering-based machine learning model to distinguish the sex of children aged between 5 and 15 years. We proposed an optimal feature set and our modelling achieved an average F1 score (the harmonic mean of the precision and recall) of 0.84 across all ages. Our results suggest that the sex classification is generally more accurate when a model is developed for each year group rather than for children in 4-year age bands, with classification accuracy being better for older age groups. We found that spontaneous speech could provide more helpful cues in sex classification than scripted speech, especially for children younger than 7 years. For younger age groups, a broad range of acoustic factors contributed evenly to sex classification, while for older age groups, F0-related acoustic factors were found to be the most critical predictors generally. Other important acoustic factors for older age groups include vocal tract length estimators, spectral flux, loudness and unvoiced features.
△ Less
Submitted 26 September, 2022;
originally announced September 2022.
-
Perception-Oriented Stereo Image Super-Resolution
Authors:
Chenxi Ma,
Bo Yan,
Weimin Tan,
Xuhao Jiang
Abstract:
Recent studies of deep learning based stereo image super-resolution (StereoSR) have promoted the development of StereoSR. However, existing StereoSR models mainly concentrate on improving quantitative evaluation metrics and neglect the visual quality of super-resolved stereo images. To improve the perceptual performance, this paper proposes the first perception-oriented stereo image super-resoluti…
▽ More
Recent studies of deep learning based stereo image super-resolution (StereoSR) have promoted the development of StereoSR. However, existing StereoSR models mainly concentrate on improving quantitative evaluation metrics and neglect the visual quality of super-resolved stereo images. To improve the perceptual performance, this paper proposes the first perception-oriented stereo image super-resolution approach by exploiting the feedback, provided by the evaluation on the perceptual quality of StereoSR results. To provide accurate guidance for the StereoSR model, we develop the first special stereo image super-resolution quality assessment (StereoSRQA) model, and further construct a StereoSRQA database. Extensive experiments demonstrate that our StereoSR approach significantly improves the perceptual quality and enhances the reliability of stereo images for disparity estimation.
△ Less
Submitted 13 July, 2022;
originally announced July 2022.
-
Two-Stage COVID19 Classification Using BERT Features
Authors:
Weijun Tan,
Qi Yao,
Jingfeng Liu
Abstract:
We propose an automatic COVID1-19 diagnosis framework from lung CT-scan slice images using double BERT feature extraction. In the first BERT feature extraction, A 3D-CNN is first used to extract CNN internal feature maps. Instead of using the global average pooling, a late BERT temporal pooing is used to aggregate the temporal information in these feature maps, followed by a classification layer.…
▽ More
We propose an automatic COVID1-19 diagnosis framework from lung CT-scan slice images using double BERT feature extraction. In the first BERT feature extraction, A 3D-CNN is first used to extract CNN internal feature maps. Instead of using the global average pooling, a late BERT temporal pooing is used to aggregate the temporal information in these feature maps, followed by a classification layer. This 3D-CNN-BERT classification network is first trained on sampled fixed number of slice images from every original CT scan volume. In the second stage, the 3D-CNN-BERT embedding features are extracted on all slice images of every CT scan volume, and these features are averaged into a fixed number of segments. Then another BERT network is used to aggregate these multiple features into a single feature followed by another classification layer. The classification results of both stages are combined to generate final outputs. On the validation dataset, we achieve macro F1 score of 0.9164.
△ Less
Submitted 29 June, 2022;
originally announced June 2022.
-
Universal Learned Image Compression With Low Computational Cost
Authors:
Bowen Li,
Yao Xin,
Youneng Bao,
Fanyang Meng,
Yongsheng Liang,
Wen Tan
Abstract:
Recently, learned image compression methods have developed rapidly and exhibited excellent rate-distortion performance when compared to traditional standards, such as JPEG, JPEG2000 and BPG. However, the learning-based methods suffer from high computational costs, which is not beneficial for deployment on devices with limited resources. To this end, we propose shift-addition parallel modules (SAPM…
▽ More
Recently, learned image compression methods have developed rapidly and exhibited excellent rate-distortion performance when compared to traditional standards, such as JPEG, JPEG2000 and BPG. However, the learning-based methods suffer from high computational costs, which is not beneficial for deployment on devices with limited resources. To this end, we propose shift-addition parallel modules (SAPMs), including SAPM-E for the encoder and SAPM-D for the decoder, to largely reduce the energy consumption. To be specific, they can be taken as plug-and-play components to upgrade existing CNN-based architectures, where the shift branch is used to extract large-grained features as compared to small-grained features learned by the addition branch. Furthermore, we thoroughly analyze the probability distribution of latent representations and propose to use Laplace Mixture Likelihoods for more accurate entropy estimation. Experimental results demonstrate that the proposed methods can achieve comparable or even better performance on both PSNR and MS-SSIM metrics to that of the convolutional counterpart with an about 2x energy reduction.
△ Less
Submitted 23 June, 2022;
originally announced June 2022.
-
Modelling Pre-fatigue, Low-velocity Impact and Fatigue behaviours of Composite Helicopter Tail Structures under Multipoint Coordinated Loading Spectrum
Authors:
Zheng-Qiang Cheng,
Wei Tan,
Jun-Jiang Xiong
Abstract:
This paper aims to numerically study the pre-fatigue, low-velocity impact (LVI) and fatigue progressive damage behaviours of a full-scale composite helicopter tail structure under multipoint coordinated loading spectrum. First, a fatigue progressive damage model (PDM) incorporating multiaxial fatigue residual strength degradation rule, fatigue failure criteria based on fatigue residual strength co…
▽ More
This paper aims to numerically study the pre-fatigue, low-velocity impact (LVI) and fatigue progressive damage behaviours of a full-scale composite helicopter tail structure under multipoint coordinated loading spectrum. First, a fatigue progressive damage model (PDM) incorporating multiaxial fatigue residual strength degradation rule, fatigue failure criteria based on fatigue residual strength concept and sudden stiffness degradation rule was proposed. Then, an LVI progressive damage model for plain-weave (PW) and unidirectional (UD) composites was developed. Moreover, a full-process analysis algorithm with a reasonable damage transfer strategy for pre-fatigue, LVI and fatigue progressive damage analysis was proposed. Finally, a highly computational efficient and accurate full-scale global-local finite element (FE) model of helicopter tail structure was built to predict strain distribution under two flight working conditions, to predict LVI damage under impact loading, and to assess fatigue damage behaviours under multipoint coordinated loading spectrum. The numerical predictions agree well with test results from this work and literature data, indicating that the developed pre-fatigue, LVI, fatigue PDMs and algorithms, as well as the global-local FE modelling based on shell-to-solid coupling, can effectively analyse the impact damage tolerance of full-scale aircraft structures.
△ Less
Submitted 5 May, 2022;
originally announced May 2022.
-
Transformations in Learned Image Compression from a Modulation Perspective
Authors:
Youneng Bao,
Fangyang Meng,
Wen Tan,
Chao Li,
Yonghong Tian,
Yongsheng Liang
Abstract:
In this paper, a unified transformation method in learned image compression(LIC) is proposed from the perspective of modulation. Firstly, the quantization in LIC is considered as a generalized channel with additive uniform noise. Moreover, the LIC is interpreted as a particular communication system according to the consistency in structures and optimization objectives. Thus, the technology of comm…
▽ More
In this paper, a unified transformation method in learned image compression(LIC) is proposed from the perspective of modulation. Firstly, the quantization in LIC is considered as a generalized channel with additive uniform noise. Moreover, the LIC is interpreted as a particular communication system according to the consistency in structures and optimization objectives. Thus, the technology of communication systems can be applied to guide the design of modules in LIC. Furthermore, a unified transform method based on signal modulation (TSM) is defined. In the view of TSM, the existing transformation methods are mathematically reduced to a linear modulation. A series of transformation methods, e.g. TPM and TJM, are obtained by extending to nonlinear modulation. The experimental results on various datasets and backbone architectures verify that the effectiveness and robustness of the proposed method. More importantly, it further confirms the feasibility of guiding LIC design from a communication perspective. For example, when backbone architecture is hyperprior combining context model, our method achieves 3.52$\%$ BD-rate reduction over GDN on Kodak dataset without increasing complexity.
△ Less
Submitted 12 March, 2024; v1 submitted 4 March, 2022;
originally announced March 2022.
-
Exploring Structural Sparsity in Neural Image Compression
Authors:
Shanzhi Yin,
Chao Li,
Wen Tan,
Youneng Bao,
Yongsheng Liang,
Wei Liu
Abstract:
Neural image compression have reached or out-performed traditional methods (such as JPEG, BPG, WebP). However,their sophisticated network structures with cascaded convolution layers bring heavy computational burden for practical deployment. In this paper, we explore the structural sparsity in neural image compression network to obtain real-time acceleration without any specialized hardware design…
▽ More
Neural image compression have reached or out-performed traditional methods (such as JPEG, BPG, WebP). However,their sophisticated network structures with cascaded convolution layers bring heavy computational burden for practical deployment. In this paper, we explore the structural sparsity in neural image compression network to obtain real-time acceleration without any specialized hardware design or algorithm. We propose a simple plug-in adaptive binary channel masking(ABCM) to judge the importance of each convolution channel and introduce sparsity during training. During inference, the unimportant channels are pruned to obtain slimmer network and less computation. We implement our method into three neural image compression networks with different entropy models to verify its effectiveness and generalization, the experiment results show that up to 7x computation reduction and 3x acceleration can be achieved with negligible performance drop.
△ Less
Submitted 11 March, 2022; v1 submitted 9 February, 2022;
originally announced February 2022.
-
Data Augmentation and CNN Classification For Automatic COVID-19 Diagnosis From CT-Scan Images On Small Dataset
Authors:
Weijun Tan,
Hongwei Guo
Abstract:
We present an automatic COVID1-19 diagnosis framework from lung CT images. The focus is on signal processing and classification on small datasets with efforts putting into exploring data preparation and augmentation to improve the generalization capability of the 2D CNN classification models. We propose a unique and effective data augmentation method using multiple Hounsfield Unit (HU) normalizati…
▽ More
We present an automatic COVID1-19 diagnosis framework from lung CT images. The focus is on signal processing and classification on small datasets with efforts putting into exploring data preparation and augmentation to improve the generalization capability of the 2D CNN classification models. We propose a unique and effective data augmentation method using multiple Hounsfield Unit (HU) normalization windows. In addition, the original slice image is cropped to exclude background, and a filter is applied to filter out closed-lung images. For the classification network, we choose to use 2D Densenet and Xception with the feature pyramid network (FPN). To further improve the classification accuracy, an ensemble of multiple CNN models and HU windows is used. On the training/validation dataset, we achieve a patient classification accuracy of 93.39%.
△ Less
Submitted 30 September, 2021; v1 submitted 16 August, 2021;
originally announced August 2021.
-
A 3D CNN Network with BERT For Automatic COVID-19 Diagnosis From CT-Scan Images
Authors:
Weijun Tan,
Jingfeng Liu
Abstract:
We present an automatic COVID1-19 diagnosis framework from lung CT-scan slice images. In this framework, the slice images of a CT-scan volume are first proprocessed using segmentation techniques to filter out images of closed lung, and to remove the useless background. Then a resampling method is used to select one or multiple sets of a fixed number of slice images for training and validation. A 3…
▽ More
We present an automatic COVID1-19 diagnosis framework from lung CT-scan slice images. In this framework, the slice images of a CT-scan volume are first proprocessed using segmentation techniques to filter out images of closed lung, and to remove the useless background. Then a resampling method is used to select one or multiple sets of a fixed number of slice images for training and validation. A 3D CNN network with BERT is used to classify this set of selected slice images. In this network, an embedding feature is also extracted. In cases where there are more than one set of slice images in a volume, the features of all sets are extracted and pooled into a global feature vector for the whole CT-scan volume. A simple multiple-layer perceptron (MLP) network is used to further classify the aggregated feature vector. The models are trained and evaluated on the provided training and validation datasets. On the validation dataset, the accuracy is 0.9278 and the F1 score is 0.9261.
△ Less
Submitted 4 October, 2021; v1 submitted 28 June, 2021;
originally announced June 2021.
-
DeepWiPHY: Deep Learning-based Receiver Design and Dataset for IEEE 802.11ax Systems
Authors:
Yi Zhang,
Akash Doshi,
Rob Liston,
Wai-tian Tan,
Xiaoqing Zhu,
Jeffrey G. Andrews,
Robert W. Heath
Abstract:
In this work, we develop DeepWiPHY, a deep learning-based architecture to replace the channel estimation, common phase error (CPE) correction, sampling rate offset (SRO) correction, and equalization modules of IEEE 802.11ax based orthogonal frequency division multiplexing (OFDM) receivers. We first train DeepWiPHY with a synthetic dataset, which is generated using representative indoor channel mod…
▽ More
In this work, we develop DeepWiPHY, a deep learning-based architecture to replace the channel estimation, common phase error (CPE) correction, sampling rate offset (SRO) correction, and equalization modules of IEEE 802.11ax based orthogonal frequency division multiplexing (OFDM) receivers. We first train DeepWiPHY with a synthetic dataset, which is generated using representative indoor channel models and includes typical radio frequency (RF) impairments that are the source of nonlinearity in wireless systems. To further train and evaluate DeepWiPHY with real-world data, we develop a passive sniffing-based data collection testbed composed of Universal Software Radio Peripherals (USRPs) and commercially available IEEE 802.11ax products. The comprehensive evaluation of DeepWiPHY with synthetic and real-world datasets (110 million synthetic OFDM symbols and 14 million real-world OFDM symbols) confirms that, even without fine-tuning the neural network's architecture parameters, DeepWiPHY achieves comparable performance to or outperforms the conventional WLAN receivers, in terms of both bit error rate (BER) and packet error rate (PER), under a wide range of channel models, signal-to-noise (SNR) levels, and modulation schemes.
△ Less
Submitted 19 October, 2020;
originally announced October 2020.
-
Adaptive Energy Management for Real Driving Conditions via Transfer Reinforcement Learning
Authors:
Teng Liu,
Wenhao Tan,
Xiaolin Tang,
Jiaxin Chen,
Dongpu Cao
Abstract:
This article proposes a transfer reinforcement learning (RL) based adaptive energy managing approach for a hybrid electric vehicle (HEV) with parallel topology. This approach is bi-level. The up-level characterizes how to transform the Q-value tables in the RL framework via driving cycle transformation (DCT). Especially, transition probability matrices (TPMs) of power request are computed for diff…
▽ More
This article proposes a transfer reinforcement learning (RL) based adaptive energy managing approach for a hybrid electric vehicle (HEV) with parallel topology. This approach is bi-level. The up-level characterizes how to transform the Q-value tables in the RL framework via driving cycle transformation (DCT). Especially, transition probability matrices (TPMs) of power request are computed for different cycles, and induced matrix norm (IMN) is employed as a critical criterion to identify the transformation differences and to determine the alteration of the control strategy. The lower-level determines how to set the corresponding control strategies with the transformed Q-value tables and TPMs by using model-free reinforcement learning (RL) algorithm. Numerical tests illustrate that the transferred performance can be tuned by IMN value and the transfer RL controller could receive a higher fuel economy. The comparison demonstrates that the proposed strategy exceeds the conventional RL approach in both calculation speed and control performance.
△ Less
Submitted 24 July, 2020;
originally announced July 2020.
-
Driving Conditions-Driven Energy Management for Hybrid Electric Vehicles: A Review
Authors:
Teng Liu,
Wenhao Tan,
Xiaolin Tang,
Jinwei Zhang,
Yang Xing,
Dongpu Cao
Abstract:
Motivated by the concerns on transported fuel consumption and global air pollution, industrial engineers, and academic researchers have made many efforts to construct more efficient and environment-friendly vehicles. Hybrid electric vehicles (HEVs) are the representative ones because they can satisfy the power demand by coordinating energy supplements among different energy storage devices. To ach…
▽ More
Motivated by the concerns on transported fuel consumption and global air pollution, industrial engineers, and academic researchers have made many efforts to construct more efficient and environment-friendly vehicles. Hybrid electric vehicles (HEVs) are the representative ones because they can satisfy the power demand by coordinating energy supplements among different energy storage devices. To achieve this goal, energy management approaches are crucial technology, and driving cycles are the critical influence factor. Therefore, this paper aims to summarize driving cycle-driven energy management strategies (EMSs) for HEVs. First, the definition and significance of driving cycles in the energy management field are clarified, and the recent literature in this research domain is reviewed and revisited. In addition, according to the known information of driving cycles, the EMSs are divided into three categories, and the relevant study directions, such as standard driving cycles, long-term driving cycle generation (LT-DCG) and short-term driving cycle prediction (ST-DCP) are illuminated and analyzed. Furthermore, the existing database of driving cycles in highway and urban aspects are displayed and discussed. Finally, this article also elaborates on the future prospects of energy management technologies related to driving cycles. This paper focusing on helping the relevant researchers realize the state-of-the-art of HEVs energy management field and also recognize its future development direction.
△ Less
Submitted 1 August, 2020; v1 submitted 16 July, 2020;
originally announced July 2020.
-
Transfer Deep Reinforcement Learning-enabled Energy Management Strategy for Hybrid Tracked Vehicle
Authors:
Xiaowei Guo,
Teng Liu,
Bangbei Tang,
Xiaolin Tang,
Jinwei Zhang,
Wenhao Tan,
Shufeng Jin
Abstract:
This paper proposes an adaptive energy management strategy for hybrid electric vehicles by combining deep reinforcement learning (DRL) and transfer learning (TL). This work aims to address the defect of DRL in tedious training time. First, an optimization control modeling of a hybrid tracked vehicle is built, wherein the elaborate powertrain components are introduced. Then, a bi-level control fram…
▽ More
This paper proposes an adaptive energy management strategy for hybrid electric vehicles by combining deep reinforcement learning (DRL) and transfer learning (TL). This work aims to address the defect of DRL in tedious training time. First, an optimization control modeling of a hybrid tracked vehicle is built, wherein the elaborate powertrain components are introduced. Then, a bi-level control framework is constructed to derive the energy management strategies (EMSs). The upper-level is applying the particular deep deterministic policy gradient (DDPG) algorithms for EMS training at different speed intervals. The lower-level is employing the TL method to transform the pre-trained neural networks for a novel driving cycle. Finally, a series of experiments are executed to prove the effectiveness of the presented control framework. The optimality and adaptability of the formulated EMS are illuminated. The founded DRL and TL-enabled control policy is capable of enhancing energy efficiency and improving system performance.
△ Less
Submitted 16 July, 2020;
originally announced July 2020.
-
Transferred Energy Management Strategies for Hybrid Electric Vehicles Based on Driving Conditions Recognition
Authors:
Teng Liu,
Xiaolin Tang,
Jiaxin Chen,
Hong Wang,
Wenhao Tan,
Yalian Yang
Abstract:
Energy management strategies (EMSs) are the most significant components in hybrid electric vehicles (HEVs) because they decide the potential of energy conservation and emission reduction. This work presents a transferred EMS for a parallel HEV via combining the reinforcement learning method and driving conditions recognition. First, the Markov decision process (MDP) and the transition probability…
▽ More
Energy management strategies (EMSs) are the most significant components in hybrid electric vehicles (HEVs) because they decide the potential of energy conservation and emission reduction. This work presents a transferred EMS for a parallel HEV via combining the reinforcement learning method and driving conditions recognition. First, the Markov decision process (MDP) and the transition probability matrix are utilized to differentiate the driving conditions. Then, reinforcement learning algorithms are formulated to achieve power split controls, in which Q-tables are tuned by current driving situations. Finally, the proposed transferred framework is estimated and validated in a parallel hybrid topology. Its advantages in computational efficiency and fuel economy are summarized and proved.
△ Less
Submitted 16 July, 2020;
originally announced July 2020.
-
Detecting Driver's Distraction using Long-term Recurrent Convolutional Network
Authors:
Chang Wei Tan,
Mahsa Salehi,
Geoffrey Mackellar
Abstract:
In this study we demonstrate a novel Brain Computer Interface (BCI) approach to detect driver distraction events to improve road safety. We use a commercial wireless headset that generates EEG signals from the brain. We collected real EEG signals from participants who undertook a 40-minute driving simulation and were required to perform different tasks while driving. These signals are segmented in…
▽ More
In this study we demonstrate a novel Brain Computer Interface (BCI) approach to detect driver distraction events to improve road safety. We use a commercial wireless headset that generates EEG signals from the brain. We collected real EEG signals from participants who undertook a 40-minute driving simulation and were required to perform different tasks while driving. These signals are segmented into short windows and labelled using a time series classification (TSC) model. We studied different TSC approaches and designed a Long-term Recurrent Convolutional Network (LCRN) model for this task. Our results showed that our LRCN model performs better than the state of the art TSC models at detecting driver distraction events.
△ Less
Submitted 13 April, 2020;
originally announced April 2020.
-
Deep Vocoder: Low Bit Rate Compression of Speech with Deep Autoencoder
Authors:
Gang Min,
Changqing Zhang,
Xiongwei Zhang,
Wei Tan
Abstract:
Inspired by the success of deep neural networks (DNNs) in speech processing, this paper presents Deep Vocoder, a direct end-to-end low bit rate speech compression method with deep autoencoder (DAE). In Deep Vocoder, DAE is used for extracting the latent representing features (LRFs) of speech, which are then efficiently quantized by an analysis-by-synthesis vector quantization (AbS VQ) method. AbS…
▽ More
Inspired by the success of deep neural networks (DNNs) in speech processing, this paper presents Deep Vocoder, a direct end-to-end low bit rate speech compression method with deep autoencoder (DAE). In Deep Vocoder, DAE is used for extracting the latent representing features (LRFs) of speech, which are then efficiently quantized by an analysis-by-synthesis vector quantization (AbS VQ) method. AbS VQ aims to minimize the perceptual spectral reconstruction distortion rather than the distortion of LRFs vector itself. Also, a suboptimal codebook searching technique is proposed to further reduce the computational complexity. Experimental results demonstrate that Deep Vocoder yields substantial improvements in terms of frequency-weighted segmental SNR, STOI and PESQ score when compared to the output of the conventional SQ- or VQ-based codec. The yielded PESQ score over the TIMIT corpus is 3.34 and 3.08 for speech coding at 2400 bit/s and 1200 bit/s, respectively.
△ Less
Submitted 14 May, 2019; v1 submitted 12 May, 2019;
originally announced May 2019.
-
Tablet-based Information System for Commercial Air-craft: Onboard Context-Sensitive Information System (OCSIS)
Authors:
Guy Andre Boy,
Wei Tan
Abstract:
Pilots currently use paper-based documentation and electronic systems to help them perform procedures to ensure safety, efficiency and comfort on commercial aircrafts. Management of interconnections among paper-based operational documents can be a challenge for pilots, especially when time pressure is high in normal, abnormal, and emergency situations. This dissertation is a contribution to the de…
▽ More
Pilots currently use paper-based documentation and electronic systems to help them perform procedures to ensure safety, efficiency and comfort on commercial aircrafts. Management of interconnections among paper-based operational documents can be a challenge for pilots, especially when time pressure is high in normal, abnormal, and emergency situations. This dissertation is a contribution to the design of an Onboard Context-Sensitive Information System (OCSIS), which was developed on a tablet. The claim is that the use of con-textual information facilitates access to appropriate operational content at the right time either automatically or on demand. OCSIS was tested using human-in-the-loop simulations that involved professional pilots in the Airbus 320 cockpit simulator. First results are encouraging that show OCSIS can be usable and useful for operational information access. More specifically, context-sensitivity contributes to simplify this access (i.e., appropriate operational information is provided at the right time in the right format. In addition, OCSIS provides other features that paper-based documents do not have, such as procedure execution status after an interruption. Also, the fact that several calculations are automatically done by OCSIS tends to decrease the pilot's task demand .
△ Less
Submitted 21 November, 2018;
originally announced November 2018.
-
Impedance control of a cable-driven series elastic actuator with the 2-DOF control structure
Authors:
Wulin Zou,
Zhuo Yang,
Wen Tan,
Meng Wang,
Jingtai Liu,
Ningbo Yu
Abstract:
Series elastic actuators (SEAs) are growingly important in physical human-robot interaction (HRI) due to their inherent safety and compliance. Cable-driven SEAs also allow flexible installation and remote torque transmission, etc. However, there are still challenges for the impedance control of cable-driven SEAs, such as the reduced bandwidth caused by the elastic component, and the performance ba…
▽ More
Series elastic actuators (SEAs) are growingly important in physical human-robot interaction (HRI) due to their inherent safety and compliance. Cable-driven SEAs also allow flexible installation and remote torque transmission, etc. However, there are still challenges for the impedance control of cable-driven SEAs, such as the reduced bandwidth caused by the elastic component, and the performance balance between reference tracking and robustness. In this paper, a velocity sourced cable-driven SEA has been set up. Then, a stabilizing 2 degrees of freedom (2-DOF) control approach was designed to separately pursue the goals of robustness and torque tracking. Further, the impedance control structure for human-robot interaction was designed and implemented with a torque compensator. Both simulation and practical experiments have validated the efficacy of the 2-DOF method for the control of cable-driven SEAs.
△ Less
Submitted 30 October, 2016;
originally announced October 2016.