-
ICME 2025 Generalizable HDR and SDR Video Quality Measurement Grand Challenge
Authors:
Yixu Chen,
Bowen Chen,
Hai Wei,
Alan C. Bovik,
Baojun Li,
Wei Sun,
Linhan Cao,
Kang Fu,
Dandan Zhu,
Jun Jia,
Menghan Hu,
Xiongkuo Min,
Guangtao Zhai,
Dounia Hammou,
Fei Yin,
Rafal Mantiuk,
Amritha Premkumar,
Prajit T Rajendran,
Vignesh V Menon
Abstract:
This paper reports IEEE International Conference on Multimedia \& Expo (ICME) 2025 Grand Challenge on Generalizable HDR and SDR Video Quality Measurement. With the rapid development of video technology, especially High Dynamic Range (HDR) and Standard Dynamic Range (SDR) contents, the need for robust and generalizable Video Quality Assessment (VQA) methods has become increasingly demanded. Existin…
▽ More
This paper reports IEEE International Conference on Multimedia \& Expo (ICME) 2025 Grand Challenge on Generalizable HDR and SDR Video Quality Measurement. With the rapid development of video technology, especially High Dynamic Range (HDR) and Standard Dynamic Range (SDR) contents, the need for robust and generalizable Video Quality Assessment (VQA) methods has become increasingly demanded. Existing VQA models often struggle to deliver consistent performance across varying dynamic ranges, distortion types, and diverse content. This challenge was established to benchmark and promote VQA approaches capable of jointly handling HDR and SDR content. In the final evaluation phase, five teams submitted seven models along with technical reports to the Full Reference (FR) and No Reference (NR) tracks. Among them, four methods outperformed VMAF baseline, while the top-performing model achieved state-of-the-art performance, setting a new benchmark for generalizable video quality assessment.
△ Less
Submitted 28 June, 2025;
originally announced June 2025.
-
Towards Easy and Realistic Network Infrastructure Testing for Large-scale Machine Learning
Authors:
Jinsun Yoo,
ChonLam Lao,
Lianjie Cao,
Bob Lantz,
Minlan Yu,
Tushar Krishna,
Puneet Sharma
Abstract:
This paper lays the foundation for Genie, a testing framework that captures the impact of real hardware network behavior on ML workload performance, without requiring expensive GPUs. Genie uses CPU-initiated traffic over a hardware testbed to emulate GPU to GPU communication, and adapts the ASTRA-sim simulator to model interaction between the network and the ML workload.
This paper lays the foundation for Genie, a testing framework that captures the impact of real hardware network behavior on ML workload performance, without requiring expensive GPUs. Genie uses CPU-initiated traffic over a hardware testbed to emulate GPU to GPU communication, and adapts the ASTRA-sim simulator to model interaction between the network and the ML workload.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement: Methods and Results
Authors:
Xin Li,
Kun Yuan,
Bingchen Li,
Fengbin Guan,
Yizhen Shao,
Zihao Yu,
Xijun Wang,
Yiting Lu,
Wei Luo,
Suhang Yao,
Ming Sun,
Chao Zhou,
Zhibo Chen,
Radu Timofte,
Yabin Zhang,
Ao-Xiang Zhang,
Tianwu Zhi,
Jianzhao Liu,
Yang Li,
Jingwen Xu,
Yiting Liao,
Yushen Zuo,
Mingyang Wu,
Renjie Li,
Shengyun Zhong
, et al. (88 additional authors not shown)
Abstract:
This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating re…
▽ More
This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating reliance on model ensembles, redundant weights, and other computationally expensive components in the previous IQA/VQA competitions. Track 2 introduces a new short-form UGC dataset tailored for single image super-resolution, i.e., the KwaiSR dataset. It consists of 1,800 synthetically generated S-UGC image pairs and 1,900 real-world S-UGC images, which are split into training, validation, and test sets using a ratio of 8:1:1. The primary objective of the challenge is to drive research that benefits the user experience of short-form UGC platforms such as Kwai and TikTok. This challenge attracted 266 participants and received 18 valid final submissions with corresponding fact sheets, significantly contributing to the progress of short-form UGC VQA and image superresolution. The project is publicly available at https://github.com/lixinustc/KVQE- ChallengeCVPR-NTIRE2025.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
STF-GCN: A Multi-Domain Graph Convolution Network Method for Automatic Modulation Recognition via Adaptive Correlation
Authors:
Mingyuan Shao,
Zhengqiu Fu,
Dingzhao Li,
Fuqing Zhang,
Yilin Cai,
Shaohua Hong,
Lin Cao,
Yuan Peng,
Jie Qi
Abstract:
Automatic Modulation Recognition (AMR) is an essential part of Intelligent Transportation System (ITS) dynamic spectrum allocation. However, current deep learning-based AMR (DL-AMR) methods are challenged to extract discriminative and robust features at low signal-to-noise ratios (SNRs), where the representation of modulation symbols is highly interfered by noise. Furthermore, current research on…
▽ More
Automatic Modulation Recognition (AMR) is an essential part of Intelligent Transportation System (ITS) dynamic spectrum allocation. However, current deep learning-based AMR (DL-AMR) methods are challenged to extract discriminative and robust features at low signal-to-noise ratios (SNRs), where the representation of modulation symbols is highly interfered by noise. Furthermore, current research on GNN methods for AMR tasks generally suffers from issues related to graph structure construction and computational complexity. In this paper, we propose a Spatial-Temporal-Frequency Graph Convolution Network (STF-GCN) framework, with the temporal domain as the anchor point, to fuse spatial and frequency domain features embedded in the graph structure nodes. On this basis, an adaptive correlation-based adjacency matrix construction method is proposed, which significantly enhances the graph structure's capacity to aggregate local information into individual nodes. In addition, a PoolGAT layer is proposed to coarsen and compress the global key features of the graph, significantly reducing the computational complexity. The results of the experiments confirm that STF-GCN is able to achieve recognition performance far beyond the state-of-the-art DL-AMR algorithms, with overall accuracies of 64.35%, 66.04% and 70.95% on the RML2016.10a, RML2016.10b and RML22 datasets, respectively. Furthermore, the average recognition accuracies under low SNR conditions from -14dB to 0dB outperform the state-of-the-art (SOTA) models by 1.20%, 1.95% and 1.83%, respectively.
△ Less
Submitted 11 April, 2025;
originally announced April 2025.
-
IncepFormerNet: A multi-scale multi-head attention network for SSVEP classification
Authors:
Yan Huang,
Yongru Chen,
Lei Cao,
Yongnian Cao,
Xuechun Yang,
Yilin Dong,
Tianyu Liu
Abstract:
In recent years, deep learning (DL) models have shown outstanding performance in EEG classification tasks, particularly in Steady-State Visually Evoked Potential(SSVEP)-based Brain-Computer-Interfaces(BCI)systems. DL methods have been successfully applied to SSVEP-BCI. This study proposes a new model called IncepFormerNet, which is a hybrid of the Inception and Transformer architectures. IncepForm…
▽ More
In recent years, deep learning (DL) models have shown outstanding performance in EEG classification tasks, particularly in Steady-State Visually Evoked Potential(SSVEP)-based Brain-Computer-Interfaces(BCI)systems. DL methods have been successfully applied to SSVEP-BCI. This study proposes a new model called IncepFormerNet, which is a hybrid of the Inception and Transformer architectures. IncepFormerNet adeptly extracts multi-scale temporal information from time series data using parallel convolution kernels of varying sizes, accurately capturing the subtle variations and critical features within SSVEP signals.Furthermore, the model integrates the multi-head attention mechanism from the Transformer architecture, which not only provides insights into global dependencies but also significantly enhances the understanding and representation of complex patterns.Additionally, it takes advantage of filter bank techniques to extract features based on the spectral characteristics of SSVEP data. To validate the effectiveness of the proposed model, we conducted experiments on two public datasets, . The experimental results show that IncepFormerNet achieves an accuracy of 87.41 on Dataset 1 and 71.97 on Dataset 2 using a 1.0-second time window. To further verify the superiority of the proposed model, we compared it with other deep learning models, and the results indicate that our method achieves significantly higher accuracy than the others.The source codes in this work are available at: https://github.com/CECNL/SSVEP-DAN.
△ Less
Submitted 4 February, 2025;
originally announced February 2025.
-
Design and Prototyping of Filtering Active STAR-RIS with Adjustable Power Splitting
Authors:
Rongguang Song,
Haifan Yin,
Xilong Pei,
Lin Cao,
Taorui Yang,
Xue Ren,
Yuanwei Liu
Abstract:
Reconfigurable Intelligent Surfaces (RISs) have emerged as a transformative technology for next-generation wireless communication systems, offering unprecedented control over electromagnetic wave propagation. In particular, Simultaneously Transmitting and Reflecting RISs (STAR-RISs) have garnered significant attention due to their full-space coverage. This paper presents an active STAR-RIS, which…
▽ More
Reconfigurable Intelligent Surfaces (RISs) have emerged as a transformative technology for next-generation wireless communication systems, offering unprecedented control over electromagnetic wave propagation. In particular, Simultaneously Transmitting and Reflecting RISs (STAR-RISs) have garnered significant attention due to their full-space coverage. This paper presents an active STAR-RIS, which enables independent control of both transmission and reflection phases and features out-of-band harmonic suppression. Unlike the traditional passive RIS, the proposed design integrates active amplification to overcome the inherent passive losses, significantly enhancing signal strength and system performance. Additionally, the system supports dynamic power allocation between transmission and reflection modes, providing greater flexibility to meet diverse communication demands in complex propagation environments. The versatility of the design is further validated by extending the Radar Cross Section (RCS)-based path loss model to the STAR-RIS. This design improves efficiency, flexibility, and adaptability, offering a promising solution for future wireless communication systems, particularly in scenarios requiring simultaneous control of transmission and reflection signals.
△ Less
Submitted 19 January, 2025;
originally announced January 2025.
-
Towards Homogeneous Lexical Tone Decoding from Heterogeneous Intracranial Recordings
Authors:
Di Wu,
Siyuan Li,
Chen Feng,
Lu Cao,
Yue Zhang,
Jie Yang,
Mohamad Sawan
Abstract:
Recent advancements in brain-computer interfaces (BCIs) have enabled the decoding of lexical tones from intracranial recordings, offering the potential to restore the communication abilities of speech-impaired tonal language speakers. However, data heterogeneity induced by both physiological and instrumental factors poses a significant challenge for unified invasive brain tone decoding. Traditiona…
▽ More
Recent advancements in brain-computer interfaces (BCIs) have enabled the decoding of lexical tones from intracranial recordings, offering the potential to restore the communication abilities of speech-impaired tonal language speakers. However, data heterogeneity induced by both physiological and instrumental factors poses a significant challenge for unified invasive brain tone decoding. Traditional subject-specific models, which operate under a heterogeneous decoding paradigm, fail to capture generalized neural representations and cannot effectively leverage data across subjects. To address these limitations, we introduce Homogeneity-Heterogeneity Disentangled Learning for neural Representations (H2DiLR), a novel framework that disentangles and learns both the homogeneity and heterogeneity from intracranial recordings across multiple subjects. To evaluate H2DiLR, we collected stereoelectroencephalography (sEEG) data from multiple participants reading Mandarin materials comprising 407 syllables, representing nearly all Mandarin characters. Extensive experiments demonstrate that H2DiLR, as a unified decoding paradigm, significantly outperforms the conventional heterogeneous decoding approach. Furthermore, we empirically confirm that H2DiLR effectively captures both homogeneity and heterogeneity during neural representation learning.
△ Less
Submitted 18 February, 2025; v1 submitted 13 October, 2024;
originally announced October 2024.
-
Adaptable, shape-conforming robotic endoscope
Authors:
Jiayang Du,
Lin Cao,
Sanja Dogramazi
Abstract:
This paper introduces a size-adaptable robotic endoscope design, which aims to improve the efficiency and comfort of colonoscopy. The robotic endoscope proposed in this paper combines the expansion mechanism and the external drive system, which can adjust the shape according to the different pipe diameters, thus improving the stability and propulsion force during propulsion. As an actuator in the…
▽ More
This paper introduces a size-adaptable robotic endoscope design, which aims to improve the efficiency and comfort of colonoscopy. The robotic endoscope proposed in this paper combines the expansion mechanism and the external drive system, which can adjust the shape according to the different pipe diameters, thus improving the stability and propulsion force during propulsion. As an actuator in the expansion mechanism, flexible bellows can provide a normal force of 3.89 N and an axial deformation of nearly 10mm at the maximum pressure, with a 53% expansion rate in the size of expandable tip. In the test of the locomotion performance of the prototype, we obtained the relationship with the propelling of the prototype by changing the friction coefficient of the pipe and the motor angular velocity. In the experiment with artificial bowel tissues, the prototype can generate a propelling force of 2.83 N, and the maximum linear speed is 29.29 m/s in average, and could produce effective propulsion when it passes through different pipe sizes. The results show that the prototype can realize the ability of shape adaptation in order to obtain more propulsion. The relationship between propelling force and traction force, structural optimization and miniaturization still need further exploration.
△ Less
Submitted 14 September, 2024;
originally announced September 2024.
-
Assessing UHD Image Quality from Aesthetics, Distortions, and Saliency
Authors:
Wei Sun,
Weixia Zhang,
Yuqin Cao,
Linhan Cao,
Jun Jia,
Zijian Chen,
Zicheng Zhang,
Xiongkuo Min,
Guangtao Zhai
Abstract:
UHD images, typically with resolutions equal to or higher than 4K, pose a significant challenge for efficient image quality assessment (IQA) algorithms, as adopting full-resolution images as inputs leads to overwhelming computational complexity and commonly used pre-processing methods like resizing or cropping may cause substantial loss of detail. To address this problem, we design a multi-branch…
▽ More
UHD images, typically with resolutions equal to or higher than 4K, pose a significant challenge for efficient image quality assessment (IQA) algorithms, as adopting full-resolution images as inputs leads to overwhelming computational complexity and commonly used pre-processing methods like resizing or cropping may cause substantial loss of detail. To address this problem, we design a multi-branch deep neural network (DNN) to assess the quality of UHD images from three perspectives: global aesthetic characteristics, local technical distortions, and salient content perception. Specifically, aesthetic features are extracted from low-resolution images downsampled from the UHD ones, which lose high-frequency texture information but still preserve the global aesthetics characteristics. Technical distortions are measured using a fragment image composed of mini-patches cropped from UHD images based on the grid mini-patch sampling strategy. The salient content of UHD images is detected and cropped to extract quality-aware features from the salient regions. We adopt the Swin Transformer Tiny as the backbone networks to extract features from these three perspectives. The extracted features are concatenated and regressed into quality scores by a two-layer multi-layer perceptron (MLP) network. We employ the mean square error (MSE) loss to optimize prediction accuracy and the fidelity loss to optimize prediction monotonicity. Experimental results show that the proposed model achieves the best performance on the UHD-IQA dataset while maintaining the lowest computational complexity, demonstrating its effectiveness and efficiency. Moreover, the proposed model won first prize in ECCV AIM 2024 UHD-IQA Challenge. The code is available at https://github.com/sunwei925/UIQA.
△ Less
Submitted 1 September, 2024;
originally announced September 2024.
-
SG-JND: Semantic-Guided Just Noticeable Distortion Predictor For Image Compression
Authors:
Linhan Cao,
Wei Sun,
Xiongkuo Min,
Jun Jia,
Zicheng Zhang,
Zijian Chen,
Yucheng Zhu,
Lizhou Liu,
Qiubo Chen,
Jing Chen,
Guangtao Zhai
Abstract:
Just noticeable distortion (JND), representing the threshold of distortion in an image that is minimally perceptible to the human visual system (HVS), is crucial for image compression algorithms to achieve a trade-off between transmission bit rate and image quality. However, traditional JND prediction methods only rely on pixel-level or sub-band level features, lacking the ability to capture the i…
▽ More
Just noticeable distortion (JND), representing the threshold of distortion in an image that is minimally perceptible to the human visual system (HVS), is crucial for image compression algorithms to achieve a trade-off between transmission bit rate and image quality. However, traditional JND prediction methods only rely on pixel-level or sub-band level features, lacking the ability to capture the impact of image content on JND. To bridge this gap, we propose a Semantic-Guided JND (SG-JND) network to leverage semantic information for JND prediction. In particular, SG-JND consists of three essential modules: the image preprocessing module extracts semantic-level patches from images, the feature extraction module extracts multi-layer features by utilizing the cross-scale attention layers, and the JND prediction module regresses the extracted features into the final JND value. Experimental results show that SG-JND achieves the state-of-the-art performance on two publicly available JND datasets, which demonstrates the effectiveness of SG-JND and highlight the significance of incorporating semantic information in JND assessment.
△ Less
Submitted 8 August, 2024;
originally announced August 2024.
-
Multi-scale Restoration of Missing Data in Optical Time-series Images with Masked Spatial-Temporal Attention Network
Authors:
Zaiyan Zhang,
Jining Yan,
Yuanqi Liang,
Jiaxin Feng,
Haixu He,
Li Cao
Abstract:
Remote sensing images often suffer from substantial data loss due to factors such as thick cloud cover and sensor limitations. Existing methods for imputing missing values in remote sensing images fail to fully exploit spatiotemporal auxiliary information, which restricts the accuracy of their reconstructions. To address this issue, this paper proposes a novel deep learning-based approach called M…
▽ More
Remote sensing images often suffer from substantial data loss due to factors such as thick cloud cover and sensor limitations. Existing methods for imputing missing values in remote sensing images fail to fully exploit spatiotemporal auxiliary information, which restricts the accuracy of their reconstructions. To address this issue, this paper proposes a novel deep learning-based approach called MS2TAN (Multi-Scale Masked Spatial-Temporal Attention Network) for reconstructing time-series remote sensing images. First, we introduce an efficient spatiotemporal feature extractor based on Masked Spatial-Temporal Attention (MSTA) to capture high-quality representations of spatiotemporal neighborhood features surrounding missing regions while significantly reducing the computational complexity of the attention mechanism. Second, a Multi-Scale Restoration Network composed of MSTA-based Feature Extractors is designed to progressively refine missing values by exploring spatiotemporal neighborhood features at different scales. Third, we propose a "Pixel-Structure-Perception" Multi-Objective Joint Optimization method to enhance the visual quality of the reconstructed results from multiple perspectives and to preserve more texture structures. Finally, quantitative experimental results under multi-temporal inputs on two public datasets demonstrate that the proposed method outperforms competitive approaches, achieving a 9.76%/9.30% reduction in Mean Absolute Error (MAE) and a 0.56 dB/0.62 dB increase in Peak Signal-to-Noise Ratio (PSNR), along with stronger texture and structural consistency. Ablation experiments further validate the contribution of the core innovations to imputation accuracy.
△ Less
Submitted 18 November, 2024; v1 submitted 19 June, 2024;
originally announced June 2024.
-
Revisiting Multi-User Downlink in IEEE 802.11ax: A Designers Guide to MU-MIMO
Authors:
Liu Cao,
Lyutianyang Zhang,
Sumit Roy,
Sian Jin
Abstract:
Downlink (DL) Multi-User (MU) Multiple Input Multiple Output (MU-MIMO) is a key technology that allows multiple concurrent data transmissions from an Access Point (AP) to a selected sub-set of clients for higher network efficiency in IEEE 802.11ax. However, DL MU-MIMO feature is typically turned off as the default setting in AP vendors' products, that is, turning on the DL MU-MIMO may not help inc…
▽ More
Downlink (DL) Multi-User (MU) Multiple Input Multiple Output (MU-MIMO) is a key technology that allows multiple concurrent data transmissions from an Access Point (AP) to a selected sub-set of clients for higher network efficiency in IEEE 802.11ax. However, DL MU-MIMO feature is typically turned off as the default setting in AP vendors' products, that is, turning on the DL MU-MIMO may not help increase the network efficiency, which is counter-intuitive. In this article, we provide a sufficiently deep understanding of the interplay between the various underlying factors, i.e., CSI overhead and spatial correlation, which result in negative results when turning on the DL MU-MIMO. Furthermore, we provide a fundamental guideline as a function of operational scenarios to address the fundamental question "when the DL MU-MIMO should be turned on/off".
△ Less
Submitted 19 August, 2024; v1 submitted 9 June, 2024;
originally announced June 2024.
-
Augmentation-based Unsupervised Cross-Domain Functional MRI Adaptation for Major Depressive Disorder Identification
Authors:
Yunling Ma,
Chaojun Zhang,
Xiaochuan Wang,
Qianqian Wang,
Liang Cao,
Limei Zhang,
Mingxia Liu
Abstract:
Major depressive disorder (MDD) is a common mental disorder that typically affects a person's mood, cognition, behavior, and physical health. Resting-state functional magnetic resonance imaging (rs-fMRI) data are widely used for computer-aided diagnosis of MDD. While multi-site fMRI data can provide more data for training reliable diagnostic models, significant cross-site data heterogeneity would…
▽ More
Major depressive disorder (MDD) is a common mental disorder that typically affects a person's mood, cognition, behavior, and physical health. Resting-state functional magnetic resonance imaging (rs-fMRI) data are widely used for computer-aided diagnosis of MDD. While multi-site fMRI data can provide more data for training reliable diagnostic models, significant cross-site data heterogeneity would result in poor model generalizability. Many domain adaptation methods are designed to reduce the distributional differences between sites to some extent, but usually ignore overfitting problem of the model on the source domain. Intuitively, target data augmentation can alleviate the overfitting problem by forcing the model to learn more generalized features and reduce the dependence on source domain data. In this work, we propose a new augmentation-based unsupervised cross-domain fMRI adaptation (AUFA) framework for automatic diagnosis of MDD. The AUFA consists of 1) a graph representation learning module for extracting rs-fMRI features with spatial attention, 2) a domain adaptation module for feature alignment between source and target data, 3) an augmentation-based self-optimization module for alleviating model overfitting on the source domain, and 4) a classification module. Experimental results on 1,089 subjects suggest that AUFA outperforms several state-of-the-art methods in MDD identification. Our approach not only reduces data heterogeneity between different sites, but also localizes disease-related functional connectivity abnormalities and provides interpretability for the model.
△ Less
Submitted 6 June, 2024; v1 submitted 31 May, 2024;
originally announced June 2024.
-
Enhancing Blind Video Quality Assessment with Rich Quality-aware Features
Authors:
Wei Sun,
Haoning Wu,
Zicheng Zhang,
Jun Jia,
Zhichao Zhang,
Linhan Cao,
Qiubo Chen,
Xiongkuo Min,
Weisi Lin,
Guangtao Zhai
Abstract:
In this paper, we present a simple but effective method to enhance blind video quality assessment (BVQA) models for social media videos. Motivated by previous researches that leverage pre-trained features extracted from various computer vision models as the feature representation for BVQA, we further explore rich quality-aware features from pre-trained blind image quality assessment (BIQA) and BVQ…
▽ More
In this paper, we present a simple but effective method to enhance blind video quality assessment (BVQA) models for social media videos. Motivated by previous researches that leverage pre-trained features extracted from various computer vision models as the feature representation for BVQA, we further explore rich quality-aware features from pre-trained blind image quality assessment (BIQA) and BVQA models as auxiliary features to help the BVQA model to handle complex distortions and diverse content of social media videos. Specifically, we use SimpleVQA, a BVQA model that consists of a trainable Swin Transformer-B and a fixed SlowFast, as our base model. The Swin Transformer-B and SlowFast components are responsible for extracting spatial and motion features, respectively. Then, we extract three kinds of features from Q-Align, LIQE, and FAST-VQA to capture frame-level quality-aware features, frame-level quality-aware along with scene-specific features, and spatiotemporal quality-aware features, respectively. Through concatenating these features, we employ a multi-layer perceptron (MLP) network to regress them into quality scores. Experimental results demonstrate that the proposed model achieves the best performance on three public social media VQA datasets. Moreover, the proposed model won first place in the CVPR NTIRE 2024 Short-form UGC Video Quality Assessment Challenge. The code is available at \url{https://github.com/sunwei925/RQ-VQA.git}.
△ Less
Submitted 14 May, 2024;
originally announced May 2024.
-
NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results
Authors:
Xin Li,
Kun Yuan,
Yajing Pei,
Yiting Lu,
Ming Sun,
Chao Zhou,
Zhibo Chen,
Radu Timofte,
Wei Sun,
Haoning Wu,
Zicheng Zhang,
Jun Jia,
Zhichao Zhang,
Linhan Cao,
Qiubo Chen,
Xiongkuo Min,
Weisi Lin,
Guangtao Zhai,
Jianhui Sun,
Tianyi Wang,
Lei Li,
Han Kong,
Wenxuan Wang,
Bing Li,
Cheng Luo
, et al. (43 additional authors not shown)
Abstract:
This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The…
▽ More
This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The purpose is to build new benchmarks and advance the development of S-UGC VQA. The competition had 200 participants and 13 teams submitted valid solutions for the final testing phase. The proposed solutions achieved state-of-the-art performances for S-UGC VQA. The project can be found at https://github.com/lixinustc/KVQChallenge-CVPR-NTIRE2024.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
Study on the static detection of ICF target based on muonic X-ray sphere encoded imaging
Authors:
Dikai Li,
Jian Yu,
Qian Chen,
Ziming Li,
Chunhui Zhang,
Xiangyu Wan,
Zhibing He,
Leifeng Cao
Abstract:
Muon Induced X-ray Emission (MIXE) was discovered by Chinese physicist Zhang Wenyu as early as 1947, and it can conduct non-destructive elemental analysis inside samples. Research has shown that MIXE can retain the high efficiency of direct imaging while benefiting from the low noise of pinhole imaging through encoding holes. The related technology significantly improves the counting rate while ma…
▽ More
Muon Induced X-ray Emission (MIXE) was discovered by Chinese physicist Zhang Wenyu as early as 1947, and it can conduct non-destructive elemental analysis inside samples. Research has shown that MIXE can retain the high efficiency of direct imaging while benefiting from the low noise of pinhole imaging through encoding holes. The related technology significantly improves the counting rate while maintaining imaging quality. The sphere encoding technology effectively solves the imaging blurring caused by the tilting of the encoding system, and successfully images micrometer sized X-ray sources. This paper will combine MIXE and X-ray sphere coding imaging techniques, including ball coding and zone plates, to study the method of non-destructive deep structure imaging of ICF targets and obtaining sub element distribution. This method aims to develop a new method for ICF target detection, which is particularly important for inertial confinement fusion. At the same time, this method can be used to detect and analyze materials that are difficult to penetrate or sensitive, and is expected to solve the problem of element resolution and imaging that traditional technologies cannot overcome. It will provide new methods for the future development of multiple fields such as particle physics, material science, and X-ray optics.
△ Less
Submitted 5 November, 2024; v1 submitted 17 April, 2024;
originally announced April 2024.
-
Unified Predefined-time Stability Conditions of Nonlinear Systems with Lyapunov Analysis
Authors:
Bing Xiao,
Haichao Zhang,
Shijie Zhao,
Lu Cao
Abstract:
This brief gives a set of unified Lyapunov stability conditions to guarantee the predefined-time/finite-time stability of a dynamical systems. The derived Lyapunov theorem for autonomous systems establishes equivalence with existing theorems on predefined-time/finite-time stability. The findings proposed herein develop a nonsingular sliding mode control framework for an Euler-Lagrange system to an…
▽ More
This brief gives a set of unified Lyapunov stability conditions to guarantee the predefined-time/finite-time stability of a dynamical systems. The derived Lyapunov theorem for autonomous systems establishes equivalence with existing theorems on predefined-time/finite-time stability. The findings proposed herein develop a nonsingular sliding mode control framework for an Euler-Lagrange system to analyze its stability, and its upper bound for the settling time can be arbitrarily determined a priori through predefined time constant.
△ Less
Submitted 1 April, 2024;
originally announced April 2024.
-
Hybrid Internal Model: Learning Agile Legged Locomotion with Simulated Robot Response
Authors:
Junfeng Long,
Zirui Wang,
Quanyi Li,
Jiawei Gao,
Liu Cao,
Jiangmiao Pang
Abstract:
Robust locomotion control depends on accurate state estimations. However, the sensors of most legged robots can only provide partial and noisy observations, making the estimation particularly challenging, especially for external states like terrain frictions and elevation maps. Inspired by the classical Internal Model Control principle, we consider these external states as disturbances and introdu…
▽ More
Robust locomotion control depends on accurate state estimations. However, the sensors of most legged robots can only provide partial and noisy observations, making the estimation particularly challenging, especially for external states like terrain frictions and elevation maps. Inspired by the classical Internal Model Control principle, we consider these external states as disturbances and introduce Hybrid Internal Model (HIM) to estimate them according to the response of the robot. The response, which we refer to as the hybrid internal embedding, contains the robot's explicit velocity and implicit stability representation, corresponding to two primary goals for locomotion tasks: explicitly tracking velocity and implicitly maintaining stability. We use contrastive learning to optimize the embedding to be close to the robot's successor state, in which the response is naturally embedded. HIM has several appealing benefits: It only needs the robot's proprioceptions, i.e., those from joint encoders and IMU as observations. It innovatively maintains consistent observations between simulation reference and reality that avoids information loss in mimicking learning. It exploits batch-level information that is more robust to noises and keeps better sample efficiency. It only requires 1 hour of training on an RTX 4090 to enable a quadruped robot to traverse any terrain under any disturbances. A wealth of real-world experiments demonstrates its agility, even in high-difficulty tasks and cases never occurred during the training process, revealing remarkable open-world generalizability.
△ Less
Submitted 1 January, 2024; v1 submitted 18 December, 2023;
originally announced December 2023.
-
Unsupervised convolutional neural network fusion approach for change detection in remote sensing images
Authors:
Weidong Yan,
Pei Yan,
Li Cao
Abstract:
With the rapid development of deep learning, a variety of change detection methods based on deep learning have emerged in recent years. However, these methods usually require a large number of training samples to train the network model, so it is very expensive. In this paper, we introduce a completely unsupervised shallow convolutional neural network (USCNN) fusion approach for change detection.…
▽ More
With the rapid development of deep learning, a variety of change detection methods based on deep learning have emerged in recent years. However, these methods usually require a large number of training samples to train the network model, so it is very expensive. In this paper, we introduce a completely unsupervised shallow convolutional neural network (USCNN) fusion approach for change detection. Firstly, the bi-temporal images are transformed into different feature spaces by using convolution kernels of different sizes to extract multi-scale information of the images. Secondly, the output features of bi-temporal images at the same convolution kernels are subtracted to obtain the corresponding difference images, and the difference feature images at the same scale are fused into one feature image by using 1 * 1 convolution layer. Finally, the output features of different scales are concatenated and a 1 * 1 convolution layer is used to fuse the multi-scale information of the image. The model parameters are obtained by a redesigned sparse function. Our model has three features: the entire training process is conducted in an unsupervised manner, the network architecture is shallow, and the objective function is sparse. Thus, it can be seen as a kind of lightweight network model. Experimental results on four real remote sensing datasets indicate the feasibility and effectiveness of the proposed approach.
△ Less
Submitted 6 November, 2023;
originally announced November 2023.
-
Quantized-but-uncoded Distributed Detection (QDD) with Unreliable Reporting Channels
Authors:
Lei Cao,
Ramanarayanan Viswanathan
Abstract:
Distributed detection primarily centers around two approaches: Unquantized Distributed Detection (UDD), where each sensor reports its complete observation to the fusion center (FC), and quantized-and-Coded DD (CDD), where each sensor first partitions the observation space and then reports to the FC a codeword. In this paper, we introduce Quantized-but-uncoded DD (QDD), where each sensor, after qua…
▽ More
Distributed detection primarily centers around two approaches: Unquantized Distributed Detection (UDD), where each sensor reports its complete observation to the fusion center (FC), and quantized-and-Coded DD (CDD), where each sensor first partitions the observation space and then reports to the FC a codeword. In this paper, we introduce Quantized-but-uncoded DD (QDD), where each sensor, after quantization, transmits a summarized value, instead of a codeword, to the FC. We show that QDD well adapts to the constraint of transmission power when compared to CDD, albeit with increased complexity in parameter selection. Moreover, we establish that, in the presence of independent observations, QDD upholds a necessary condition inherent in CDD. Specifically, the optimal sensor decision rules are the likelihood ratio quantizers (LRQ), irrelevant to the channel conditions. In the context of a single-sensor scenario involving binary decision at the sensor, we find that the optimal sensor rule in QDD is in general no longer ``channel blind", a feature presented in CDD. In addition, we compare these systems numerically under the same transmission power and bandwidth, while assuming additive white Gaussian noise (AWGN) in both sensing and reporting stages. Finally, we present some potential directions for future research.
△ Less
Submitted 4 November, 2023;
originally announced November 2023.
-
Codebook-based Uplink Transmission Enhancement in 5G Advanced: Sub-band Precoding
Authors:
Liu Cao,
Yahia Shabara,
Parisa Cheraghi
Abstract:
The transformative enhancements of fifth-generation (5G) mobile devices bring about new challenges to achieve better uplink (UL) performance. Particularly, in codebook-based transmission, the wide-band (WB) precoding and the legacy UL codebook may become main bottlenecks for higher efficient data transmission. In this paper, we investigate the codebook-based UL single-layer transmission performanc…
▽ More
The transformative enhancements of fifth-generation (5G) mobile devices bring about new challenges to achieve better uplink (UL) performance. Particularly, in codebook-based transmission, the wide-band (WB) precoding and the legacy UL codebook may become main bottlenecks for higher efficient data transmission. In this paper, we investigate the codebook-based UL single-layer transmission performance using fully coherent antenna ports in the context of sub-band (SB) precoding. We analyze the SB precoder selection criteria and design an UL codebook used for SB precoding by increasing the number of relative phase shifts of each port. Via link-level simulations, we verify that the UL SB precoding can improve up to 2 dB performance gain in terms of the block error rate (BLER) compared with the UL WB precoding which is the current UL precoding scheme. We also show that UL performance gain is sensitive to the SB size selection as well as the relative phase shift diversity.
△ Less
Submitted 29 October, 2023; v1 submitted 24 October, 2023;
originally announced October 2023.
-
Measuring Acoustics with Collaborative Multiple Agents
Authors:
Yinfeng Yu,
Changan Chen,
Lele Cao,
Fangkai Yang,
Fuchun Sun
Abstract:
As humans, we hear sound every second of our life. The sound we hear is often affected by the acoustics of the environment surrounding us. For example, a spacious hall leads to more reverberation. Room Impulse Responses (RIR) are commonly used to characterize environment acoustics as a function of the scene geometry, materials, and source/receiver locations. Traditionally, RIRs are measured by set…
▽ More
As humans, we hear sound every second of our life. The sound we hear is often affected by the acoustics of the environment surrounding us. For example, a spacious hall leads to more reverberation. Room Impulse Responses (RIR) are commonly used to characterize environment acoustics as a function of the scene geometry, materials, and source/receiver locations. Traditionally, RIRs are measured by setting up a loudspeaker and microphone in the environment for all source/receiver locations, which is time-consuming and inefficient. We propose to let two robots measure the environment's acoustics by actively moving and emitting/receiving sweep signals. We also devise a collaborative multi-agent policy where these two robots are trained to explore the environment's acoustics while being rewarded for wide exploration and accurate prediction. We show that the robots learn to collaborate and move to explore environment acoustics while minimizing the prediction error. To the best of our knowledge, we present the very first problem formulation and solution to the task of collaborative environment acoustics measurements with multiple agents.
△ Less
Submitted 8 October, 2023;
originally announced October 2023.
-
Semi-Persistent Scheduling in NR Sidelink Mode 2: MAC Packet Reception Ratio Model and ns-3 Validation
Authors:
Liu Cao,
Sumit Roy,
Collin Brady
Abstract:
5G New Radio (NR) Sidelink (SL) has demonstrated the promising capability for infrastructure-less cellular coverage. Understanding the fundamentals of the NR SL channel access mechanism, Semi-Persistent Scheduling (SPS), which is specified by the 3rd Generation Partnership Project (3GPP), is a necessity to enhance the NR SL Packet Reception Ratio (PRR). However, most existing works fail to account…
▽ More
5G New Radio (NR) Sidelink (SL) has demonstrated the promising capability for infrastructure-less cellular coverage. Understanding the fundamentals of the NR SL channel access mechanism, Semi-Persistent Scheduling (SPS), which is specified by the 3rd Generation Partnership Project (3GPP), is a necessity to enhance the NR SL Packet Reception Ratio (PRR). However, most existing works fail to account for the new SPS features introduced in NR SL, which might be out-of-date for comprehensively describing the NR SL PRR. The existing models ignore the relationships between SPS parameters and, therefore, do not provide sufficient insights into the PRR of SPS. This work proposes a novel SPS PRR model incorporating MAC collisions based on new features in NR SL. We extend our model by loosening several simplifying assumptions made in our initial modeling. The extended models illustrate how the PRR is affected by various SPS parameters. The computed results are validated via simulations using the network simulator (ns-3), which provides important guidelines for future NR SL enhancement work.
△ Less
Submitted 19 August, 2024; v1 submitted 26 July, 2023;
originally announced September 2023.
-
Instruction-Following Speech Recognition
Authors:
Cheng-I Jeff Lai,
Zhiyun Lu,
Liangliang Cao,
Ruoming Pang
Abstract:
Conventional end-to-end Automatic Speech Recognition (ASR) models primarily focus on exact transcription tasks, lacking flexibility for nuanced user interactions. With the advent of Large Language Models (LLMs) in speech processing, more organic, text-prompt-based interactions have become possible. However, the mechanisms behind these models' speech understanding and "reasoning" capabilities remai…
▽ More
Conventional end-to-end Automatic Speech Recognition (ASR) models primarily focus on exact transcription tasks, lacking flexibility for nuanced user interactions. With the advent of Large Language Models (LLMs) in speech processing, more organic, text-prompt-based interactions have become possible. However, the mechanisms behind these models' speech understanding and "reasoning" capabilities remain underexplored. To study this question from the data perspective, we introduce instruction-following speech recognition, training a Listen-Attend-Spell model to understand and execute a diverse set of free-form text instructions. This enables a multitude of speech recognition tasks -- ranging from transcript manipulation to summarization -- without relying on predefined command sets. Remarkably, our model, trained from scratch on Librispeech, interprets and executes simple instructions without requiring LLMs or pre-trained speech modules. It also offers selective transcription options based on instructions like "transcribe first half and then turn off listening," providing an additional layer of privacy and safety compared to existing LLMs. Our findings highlight the significant potential of instruction-following training to advance speech foundation models.
△ Less
Submitted 18 September, 2023;
originally announced September 2023.
-
Prototyping and real-world field trials of RIS-aided wireless communications
Authors:
Xilong Pei,
Haifan Yin,
Li Tan,
Lin Cao,
Taorui Yang
Abstract:
Reconfigurable intelligent surface (RIS) is a promising technology that has the potential to change the way we interact with the wireless propagating environment. In this paper, we design and fabricate an RIS system that can be used in the fifth generation (5G) mobile communication networks. We also propose a practical two-step spatial-oversampling codebook algorithm for the beamforming of RIS, wh…
▽ More
Reconfigurable intelligent surface (RIS) is a promising technology that has the potential to change the way we interact with the wireless propagating environment. In this paper, we design and fabricate an RIS system that can be used in the fifth generation (5G) mobile communication networks. We also propose a practical two-step spatial-oversampling codebook algorithm for the beamforming of RIS, which is based on the spatial structure of the wireless channel. This algorithm has much lower complexity compared to the two-dimensional full-space searching-based codebook, yet with only negligible performance loss. Then, a series of experiments are conducted with the fabricated RIS systems, covering the office, corridor, and outdoor environments, in order to verified the effectiveness of RIS in both laboratory and current 5G commercial networks. In the office and corridor scenarios, the 5.8 GHz RIS provided a 10-20 dB power gain at the receiver. In the outdoor test, over 35 dB power gain was observed with RIS compared to the non-deployment case. However, in commercial 5G networks, the 2.6 GHz RIS improved indoor signal strength by only 4-7 dB. The experimental results indicate that RIS achieves higher power gain when transceivers are equipped with directional antennas instead of omni-directional antennas.
△ Less
Submitted 6 August, 2023;
originally announced August 2023.
-
RIS with insufficient phase shifting capability: Modeling, beamforming, and experimental validations
Authors:
Lin Cao,
Haifan Yin,
Li Tan,
Xilong Pei
Abstract:
Most research works on reconfigurable intelligent surfaces (RIS) rely on idealized models of the reflection coefficients, i.e., uniform reflection amplitude for any phase and sufficient phase shifting capability. In practice however, such models are oversimplified. This paper introduces a realistic reflection coefficient model for RIS based on measurements. The reflection coefficients are modeled…
▽ More
Most research works on reconfigurable intelligent surfaces (RIS) rely on idealized models of the reflection coefficients, i.e., uniform reflection amplitude for any phase and sufficient phase shifting capability. In practice however, such models are oversimplified. This paper introduces a realistic reflection coefficient model for RIS based on measurements. The reflection coefficients are modeled as discrete complex values that have non-uniform amplitudes and suffer from insufficient phase shift capability. We then propose a group-based query algorithm that takes the imperfect coefficients into consideration while calculating the reflection coefficients. We analyze the performance of the proposed algorithm, and derive the closed-form expressions to characterize the received power of an RIS-aided wireless communication system. The performance gains of the proposed algorithm are confirmed in simulations. Furthermore, we validate the proposed theoretical results by experiments with our fabricated RIS prototype systems. The simulation and measurement results match well with the theoretical analysis.
△ Less
Submitted 16 April, 2024; v1 submitted 5 July, 2023;
originally announced July 2023.
-
Resilient Output Containment Control of Heterogeneous Multiagent Systems Against Composite Attacks: A Digital Twin Approach
Authors:
Yukang Cui,
Lingbo Cao,
Michael V. Basin,
Jun Shen,
Tingwen Huang,
Xin Gong
Abstract:
This paper studies the distributed resilient output containment control of heterogeneous multiagent systems against composite attacks, including denial-of-services (DoS) attacks, false-data injection (FDI) attacks, camouflage attacks, and actuation attacks. Inspired by digital twins, a twin layer (TL) with higher security and privacy is used to decouple the above problem into two tasks: defense pr…
▽ More
This paper studies the distributed resilient output containment control of heterogeneous multiagent systems against composite attacks, including denial-of-services (DoS) attacks, false-data injection (FDI) attacks, camouflage attacks, and actuation attacks. Inspired by digital twins, a twin layer (TL) with higher security and privacy is used to decouple the above problem into two tasks: defense protocols against DoS attacks on TL and defense protocols against actuation attacks on cyber-physical layer (CPL). First, considering modeling errors of leader dynamics, we introduce distributed observers to reconstruct the leader dynamics for each follower on TL under DoS attacks. Second, distributed estimators are used to estimate follower states according to the reconstructed leader dynamics on the TL. Third, according to the reconstructed leader dynamics, we design decentralized solvers that calculate the output regulator equations on CPL. Fourth, decentralized adaptive attack-resilient control schemes that resist unbounded actuation attacks are provided on CPL. Furthermore, we apply the above control protocols to prove that the followers can achieve uniformly ultimately bounded (UUB) convergence, and the upper bound of the UUB convergence is determined explicitly. Finally, two simulation examples are provided to show the effectiveness of the proposed control protocols.
△ Less
Submitted 22 March, 2023;
originally announced March 2023.
-
RIS-aided Wireless Communications: Can RIS Beat Metal Plate?
Authors:
Jiangfeng Hu,
Haifan Yin,
Li Tan,
Lin Cao,
Xilong Pei
Abstract:
Reconfigurable Intelligent Surface (RIS) has recently been regarded as a paradigm-shifting technology beyond 5G, for its flexibility on smartly adjusting the response to the impinging electromagnetic (EM) waves. Usually, RIS can be implemented by properly reconfiguring the adjustable parameters of each RIS unit to align the signal phase on the receiver side. And it is believed that the phase align…
▽ More
Reconfigurable Intelligent Surface (RIS) has recently been regarded as a paradigm-shifting technology beyond 5G, for its flexibility on smartly adjusting the response to the impinging electromagnetic (EM) waves. Usually, RIS can be implemented by properly reconfiguring the adjustable parameters of each RIS unit to align the signal phase on the receiver side. And it is believed that the phase alignment can be also mechanically achieved by a metal plate with the same physical size. However, we found in the prototype experiments that, a well-rotated metal plate can only approximately perform as well as RIS under limited conditions, although its scattering efficiency is relatively higher. When it comes to the case of spherical wave impinging, RIS outperforms the metal plate even beyond the receiving near-field regions. We analyze this phenomenon with wave optics theory and propose explicit scattering models for both the metal plate and RIS in general scenarios. Finally, the models are validated by simulations and field measurements.
△ Less
Submitted 6 March, 2023;
originally announced March 2023.
-
Learning Informative Representation for Fairness-aware Multivariate Time-series Forecasting: A Group-based Perspective
Authors:
Hui He,
Qi Zhang,
Shoujin Wang,
Kun Yi,
Zhendong Niu,
Longbing Cao
Abstract:
Performance unfairness among variables widely exists in multivariate time series (MTS) forecasting models since such models may attend/bias to certain (advantaged) variables. Addressing this unfairness problem is important for equally attending to all variables and avoiding vulnerable model biases/risks. However, fair MTS forecasting is challenging and has been less studied in the literature. To b…
▽ More
Performance unfairness among variables widely exists in multivariate time series (MTS) forecasting models since such models may attend/bias to certain (advantaged) variables. Addressing this unfairness problem is important for equally attending to all variables and avoiding vulnerable model biases/risks. However, fair MTS forecasting is challenging and has been less studied in the literature. To bridge such significant gap, we formulate the fairness modeling problem as learning informative representations attending to both advantaged and disadvantaged variables. Accordingly, we propose a novel framework, named FairFor, for fairness-aware MTS forecasting. FairFor is based on adversarial learning to generate both group-independent and group-relevant representations for the downstream forecasting. The framework first leverages a spectral relaxation of the K-means objective to infer variable correlations and thus to group variables. Then, it utilizes a filtering&fusion component to filter the group-relevant information and generate group-independent representations via orthogonality regularization. The group-independent and group-relevant representations form highly informative representations, facilitating to sharing knowledge from advantaged variables to disadvantaged variables to guarantee fairness. Extensive experiments on four public datasets demonstrate the effectiveness of our proposed FairFor for fair forecasting and significant performance improvement.
△ Less
Submitted 23 October, 2023; v1 submitted 26 January, 2023;
originally announced January 2023.
-
Active Fault Isolation for Discrete Event Systems
Authors:
Lin Cao,
Shaolong Shu,
Feng Lin
Abstract:
In practice, we can not only disable some events, but also enforce the occurrence of some events prior to the occurrence of other events by external control. In this paper, we combine these two control mechanisms to synthesize a more powerful supervisor. Here our control goal is to design an isolation supervisor which ensures in the closed-loop system, faults are isolatable in the sense that after…
▽ More
In practice, we can not only disable some events, but also enforce the occurrence of some events prior to the occurrence of other events by external control. In this paper, we combine these two control mechanisms to synthesize a more powerful supervisor. Here our control goal is to design an isolation supervisor which ensures in the closed-loop system, faults are isolatable in the sense that after a fault occurs, we can determine which type the fault belongs to by observing the output of the closed-loop system. The isolation supervisor starts to work when the occurrence of faults is detected. We then solve the isolation supervisor synthesis problem as follows. For a given discrete event system, we firstly construct a bipartite transition system which includes all feasible isolation supervisors. An isolation supervisor is feasible if it enforces only events that are physically possible. We then develop an algorithm to check whether the synthesis problem is solvable or not. The algorithm can also be used to find a valid isolation supervisor if the synthesis problem is solvable. The method of combining two control mechanisms can be used to synthesize more powerful supervisors for other supervisory control problems of discrete event systems as well.
△ Less
Submitted 7 January, 2023;
originally announced January 2023.
-
TriNet: stabilizing self-supervised learning from complete or slow collapse on ASR
Authors:
Lixin Cao,
Jun Wang,
Ben Yang,
Dan Su,
Dong Yu
Abstract:
Self-supervised learning (SSL) models confront challenges of abrupt informational collapse or slow dimensional collapse. We propose TriNet, which introduces a novel triple-branch architecture for preventing collapse and stabilizing the pre-training. TriNet learns the SSL latent embedding space and incorporates it to a higher level space for predicting pseudo target vectors generated by a frozen te…
▽ More
Self-supervised learning (SSL) models confront challenges of abrupt informational collapse or slow dimensional collapse. We propose TriNet, which introduces a novel triple-branch architecture for preventing collapse and stabilizing the pre-training. TriNet learns the SSL latent embedding space and incorporates it to a higher level space for predicting pseudo target vectors generated by a frozen teacher. Our experimental results show that the proposed method notably stabilizes and accelerates pre-training and achieves a relative word error rate reduction (WERR) of 6.06% compared to the state-of-the-art (SOTA) Data2vec for a downstream benchmark ASR task. We will release our code at https://github.com/tencent-ailab/.
△ Less
Submitted 14 March, 2023; v1 submitted 12 December, 2022;
originally announced January 2023.
-
Latency-aware End-to-end Multi-path Data Transmission for URLLC Services
Authors:
Liu Cao,
Abbas Kiani,
Amanda Xiang,
Kaippallimalil John,
Tony Saboorian
Abstract:
5th Generation Mobile Communication Technology (5G) utilizes the Access Traffic Steering, Switching, and Splitting (ATSSS) rule to enable multi-path data transmission, which is currently being standardized. Recently, the 3rd Generation Partnership Project (3GPP) SA1 and SA2 have been working on the multi-path solution for possible improvement from different perspectives. However, the existing 3GPP…
▽ More
5th Generation Mobile Communication Technology (5G) utilizes the Access Traffic Steering, Switching, and Splitting (ATSSS) rule to enable multi-path data transmission, which is currently being standardized. Recently, the 3rd Generation Partnership Project (3GPP) SA1 and SA2 have been working on the multi-path solution for possible improvement from different perspectives. However, the existing 3GPP multi-path solution has some limitations on ultra-reliable low-latency communication (URLLC) traffic in terms of reliability and latency requirements. In order to capture the potential gains of multi-path architecture in the context of URLLC services, this paper proposes a novel traffic splitting technique that can more efficiently enjoy the benefit of multi-path architecture in reducing user equipment (UE) uplink (UL) end-to-end (E2E) latency. In particular, we formulate an optimization framework that minimizes user's UL E2E latency via the joint optimization on the ratio of traffic assigned to each path and their corresponding transmit power. The performance of the proposed scheme is evaluated via well-designed simulations.
△ Less
Submitted 21 October, 2023; v1 submitted 24 October, 2022;
originally announced October 2022.
-
Pay Self-Attention to Audio-Visual Navigation
Authors:
Yinfeng Yu,
Lele Cao,
Fuchun Sun,
Xiaohong Liu,
Liejun Wang
Abstract:
Audio-visual embodied navigation, as a hot research topic, aims training a robot to reach an audio target using egocentric visual (from the sensors mounted on the robot) and audio (emitted from the target) input. The audio-visual information fusion strategy is naturally important to the navigation performance, but the state-of-the-art methods still simply concatenate the visual and audio features,…
▽ More
Audio-visual embodied navigation, as a hot research topic, aims training a robot to reach an audio target using egocentric visual (from the sensors mounted on the robot) and audio (emitted from the target) input. The audio-visual information fusion strategy is naturally important to the navigation performance, but the state-of-the-art methods still simply concatenate the visual and audio features, potentially ignoring the direct impact of context. Moreover, the existing approaches requires either phase-wise training or additional aid (e.g. topology graph and sound semantics). Up till this date, the work that deals with the more challenging setup with moving target(s) is still rare. As a result, we propose an end-to-end framework FSAAVN (feature self-attention audio-visual navigation) to learn chasing after a moving audio target using a context-aware audio-visual fusion strategy implemented as a self-attention module. Our thorough experiments validate the superior performance (both quantitatively and qualitatively) of FSAAVN in comparison with the state-of-the-arts, and also provide unique insights about the choice of visual modalities, visual/audio encoder backbones and fusion patterns.
△ Less
Submitted 5 October, 2022; v1 submitted 3 October, 2022;
originally announced October 2022.
-
Architecture-Algorithmic Trade-offs in Multi-path Channel Estimation for mmWAVE Systems
Authors:
Lyutianyang Zhang,
Sumit Roy,
Liu Cao
Abstract:
5G mmWave massive MIMO systems are likely to be deployed in dense urban scenarios, where increasing network capacity is the primary objective. A key component in mmWave transceiver design is channel estimation which is challenging due to the very large signal bandwidths (order of GHz) implying significant resolved spatial multipath, coupled with large # of Tx/Rx antennas for large-scale MIMO. This…
▽ More
5G mmWave massive MIMO systems are likely to be deployed in dense urban scenarios, where increasing network capacity is the primary objective. A key component in mmWave transceiver design is channel estimation which is challenging due to the very large signal bandwidths (order of GHz) implying significant resolved spatial multipath, coupled with large # of Tx/Rx antennas for large-scale MIMO. This results in significantly increased training overhead that in turn leads to unacceptably high computational complexity and power cost. Our work thus highlights the interplay of transceiver architecture and receiver signal processing algorithm choices that fundamentally address (mobile) handset power consumption, with minimal degradation in performance. We investigate trade-offs enabled by conjunction of hybrid beamforming mmWave receiver and channel estimation algorithms that exploit available sparsity in such wideband scenarios. A compressive sensing (CS) framework for sparse channel estimation -- Binary Iterative Hard Thresholding (BIHT) \cite{jacques2013robust} followed by linear reconstruction method with varying quantization (ADC) levels -- is explored to compare the trade-offs between bit-depth and sampling rate for a given ADC power budget. Performance analysis of the BIHT+ linear reconstruction method is conducted via simulation studies for 5G specified multi-path channel models and compared to oracle-assisted bounds for validation.
△ Less
Submitted 7 September, 2022;
originally announced September 2022.
-
Bilateral Network with Channel Splitting Network and Transformer for Thermal Image Super-Resolution
Authors:
Bo Yan,
Leilei Cao,
Fengliang Qi,
Hongbin Wang
Abstract:
In recent years, the Thermal Image Super-Resolution (TISR) problem has become an attractive research topic. TISR would been used in a wide range of fields, including military, medical, agricultural and animal ecology. Due to the success of PBVS-2020 and PBVS-2021 workshop challenge, the result of TISR keeps improving and attracts more researchers to sign up for PBVS-2022 challenge. In this paper,…
▽ More
In recent years, the Thermal Image Super-Resolution (TISR) problem has become an attractive research topic. TISR would been used in a wide range of fields, including military, medical, agricultural and animal ecology. Due to the success of PBVS-2020 and PBVS-2021 workshop challenge, the result of TISR keeps improving and attracts more researchers to sign up for PBVS-2022 challenge. In this paper, we will introduce the technical details of our submission to PBVS-2022 challenge designing a Bilateral Network with Channel Splitting Network and Transformer(BN-CSNT) to tackle the TISR problem. Firstly, we designed a context branch based on channel splitting network with transformer to obtain sufficient context information. Secondly, we designed a spatial branch with shallow transformer to extract low level features which can preserve the spatial information. Finally, for the context branch in order to fuse the features from channel splitting network and transformer, we proposed an attention refinement module, and then features from context branch and spatial branch are fused by proposed feature fusion module. The proposed method can achieve PSNR=33.64, SSIM=0.9263 for x4 and PSNR=21.08, SSIM=0.7803 for x2 in the PBVS-2022 challenge test dataset.
△ Less
Submitted 23 June, 2022;
originally announced June 2022.
-
Multi-Access Point Coordination for Next-Gen Wi-Fi Networks Aided by Deep Reinforcement Learning
Authors:
Lyutianyang Zhang,
Hao Yin,
Sumit Roy,
Liu Cao
Abstract:
Wi-Fi in the enterprise - characterized by overlapping Wi-Fi cells - constitutes the design challenge for next-generation networks. Standardization for recently started IEEE 802.11be (Wi-Fi 7) Working Groups has focused on significant medium access control layer changes that emphasize the role of the access point (AP) in radio resource management (RRM) for coordinating channel access due to the hi…
▽ More
Wi-Fi in the enterprise - characterized by overlapping Wi-Fi cells - constitutes the design challenge for next-generation networks. Standardization for recently started IEEE 802.11be (Wi-Fi 7) Working Groups has focused on significant medium access control layer changes that emphasize the role of the access point (AP) in radio resource management (RRM) for coordinating channel access due to the high collision probability with the distributed coordination function (DCF), especially in dense overlapping Wi-Fi networks. This paper proposes a novel multi-AP coordination system architecture aided by a centralized AP controller (APC). Meanwhile, a deep reinforcement learning channel access (DLCA) protocol is developed to replace the binary exponential backoff mechanism in DCF to enhance the network throughput by enabling the coordination of APs. First-Order Model-Agnostic Meta-Learning further enhances the network throughput. Subsequently, we also put forward a new greedy algorithm to maintain proportional fairness (PF) among multiple APs. Via the simulation, the performance of DLCA protocol in dense overlapping Wi-Fi networks is verified to have strong stability and outperform baselines such as Shared Transmission Opportunity (SH-TXOP) and Request-to-Send/Clear-to-Send (RTS/CTS) in terms of the network throughput by 10% and 3% as well as the network utility considering proportional fairness by 28.3% and 13.8%, respectively.
△ Less
Submitted 22 June, 2022;
originally announced June 2022.
-
Efficient PHY Layer Abstraction under Imperfect Channel Estimation
Authors:
Liu Cao,
Lyutianyang Zhang,
Sian Jin,
Sumit Roy
Abstract:
As most existing work investigate the PHY layer abstraction under an assumption of perfect channel estimation, it may become unreliable if there exists channel estimation error in a real communication system. This letter improves an efficient PHY layer method, EESM-log-SGN PHY layer abstraction, by considering the presence of channel estimation error. We develop two methods for implementing the EE…
▽ More
As most existing work investigate the PHY layer abstraction under an assumption of perfect channel estimation, it may become unreliable if there exists channel estimation error in a real communication system. This letter improves an efficient PHY layer method, EESM-log-SGN PHY layer abstraction, by considering the presence of channel estimation error. We develop two methods for implementing the EESM-log-SGN PHY abstraction under imperfect channel estimation. We show that the effective SINR is not impacted by the channel estimation error under multiple-input and single-output (MISO)/single-input and single-output (SISO) configuration, which is also verified by the full PHY simulation. The developed methods are then validated under different orthogonal frequency division multiplexing (OFDM) scenarios.
△ Less
Submitted 8 October, 2022; v1 submitted 22 May, 2022;
originally announced May 2022.
-
A Multi-Head Convolutional Neural Network With Multi-path Attention improves Image Denoising
Authors:
Jiahong Zhang,
Meijun Qu,
Ye Wang,
Lihong Cao
Abstract:
Recently, convolutional neural networks (CNNs) and attention mechanisms have been widely used in image denoising and achieved satisfactory performance. However, the previous works mostly use a single head to receive the noisy image, limiting the richness of extracted features. Therefore, a novel CNN with multiple heads (MH) named MHCNN is proposed in this paper, whose heads will receive the input…
▽ More
Recently, convolutional neural networks (CNNs) and attention mechanisms have been widely used in image denoising and achieved satisfactory performance. However, the previous works mostly use a single head to receive the noisy image, limiting the richness of extracted features. Therefore, a novel CNN with multiple heads (MH) named MHCNN is proposed in this paper, whose heads will receive the input images rotated by different rotation angles. MH makes MHCNN simultaneously utilize features of rotated images to remove noise. To integrate these features effectively, we present a novel multi-path attention mechanism (MPA). Unlike previous attention mechanisms that handle pixel-level, channel-level, or patch-level features, MPA focuses on features at the image level. Experiments show MHCNN surpasses other state-of-the-art CNN models on additive white Gaussian noise (AWGN) denoising and real-world image denoising. Its peak signal-to-noise ratio (PSNR) results are higher than other networks, such as BRDNet, RIDNet, PAN-Net, and CSANN. The code is accessible at https://github.com/JiaHongZ/MHCNN.
△ Less
Submitted 3 November, 2022; v1 submitted 27 April, 2022;
originally announced April 2022.
-
Information fusion approach for biomass estimation in a plateau mountainous forest using a synergistic system comprising UAS-based digital camera and LiDAR
Authors:
Rong Huang,
Wei Yao,
Zhong Xu,
Lin Cao,
Xin Shen
Abstract:
Forest land plays a vital role in global climate, ecosystems, farming and human living environments. Therefore, forest biomass estimation methods are necessary to monitor changes in the forest structure and function, which are key data in natural resources research. Although accurate forest biomass measurements are important in forest inventory and assessments, high-density measurements that invol…
▽ More
Forest land plays a vital role in global climate, ecosystems, farming and human living environments. Therefore, forest biomass estimation methods are necessary to monitor changes in the forest structure and function, which are key data in natural resources research. Although accurate forest biomass measurements are important in forest inventory and assessments, high-density measurements that involve airborne light detection and ranging (LiDAR) at a low flight height in large mountainous areas are highly expensive. The objective of this study was to quantify the aboveground biomass (AGB) of a plateau mountainous forest reserve using a system that synergistically combines an unmanned aircraft system (UAS)-based digital aerial camera and LiDAR to leverage their complementary advantages. In this study, we utilized digital aerial photogrammetry (DAP), which has the unique advantages of speed, high spatial resolution, and low cost, to compensate for the deficiency of forestry inventory using UAS-based LiDAR that requires terrain-following flight for high-resolution data acquisition. Combined with the sparse LiDAR points acquired by using a high-altitude and high-speed UAS for terrain extraction, dense normalized DAP point clouds can be obtained to produce an accurate and high-resolution canopy height model (CHM). Based on the CHM and spectral attributes obtained from multispectral images, we estimated and mapped the AGB of the region of interest with considerable cost efficiency. Our study supports the development of predictive models for large-scale wall-to-wall AGB mapping by leveraging the complementarity between DAP and LiDAR measurements. This work also reveals the potential of utilizing a UAS-based digital camera and LiDAR synergistically in a plateau mountainous forest area.
△ Less
Submitted 14 April, 2022;
originally announced April 2022.
-
Parallel Fourier Ptychography reconstruction
Authors:
Guocheng Zhou,
Shaohui Zhang,
Yao Hu,
Lei Cao,
Yong Huang,
Qun Hao
Abstract:
Fourier ptychography has attracted a wide range of focus for its ability of large space-bandwidth-produce, and quantative phase measurement. It is a typical computational imaging technique which refers to optimizing both the imaging hardware and reconstruction algorithms simultaneously. The data redundancy and inverse problem algorithms are the sources of FPM's excellent performance. But at the sa…
▽ More
Fourier ptychography has attracted a wide range of focus for its ability of large space-bandwidth-produce, and quantative phase measurement. It is a typical computational imaging technique which refers to optimizing both the imaging hardware and reconstruction algorithms simultaneously. The data redundancy and inverse problem algorithms are the sources of FPM's excellent performance. But at the same time, this large amount of data processing and complex algorithms also greatly reduce the imaging speed. In this article, we propose a parallel Fourier ptychography reconstruction framework consisting of three levels of parallel computing parts and implemented it with both central processing unit (CPU) and compute unified device architecture (CUDA) platform. In the conventional FPM reconstruction framework, the sample image is divided into multiple sub-regions for separately processing because the illumination angles for different subregions are varied for the same LED and different subregions contain different defocus distances due to the non-planar distribution or non-ideal posture of biological sample. We first build a parallel computing sub-framework in spatial domain based on the above-mentioned characteristics. And then, by utilizing the sequential characteristics of different spectrum regions to update, a parallel computing sub-framework in the spectrum domain is carried out in our scheme. The feasibility of the proposed parallel FPM reconstruction framework is verified with different experimental results acquired with the system we built.
△ Less
Submitted 3 March, 2022;
originally announced March 2022.
-
Learned end-to-end high-resolution lensless fiber imaging toward intraoperative real-time cancer diagnosis
Authors:
Jiachen Wu,
Tijue Wang,
Ortrud Uckermann,
Roberta Galli,
Gabriele Schackert,
Liangcai Cao,
Jürgen Czarske,
Robert Kuschmierz
Abstract:
Endomicroscopy is indispensable for minimally invasive diagnostics in clinical practice. For optical keyhole monitoring of surgical interventions, high-resolution fiber endoscopic imaging is considered to be very promising, especially in combination with label-free imaging techniques to realize in vivo diagnosis. However, the inherent honeycomb-artifacts of coherent fiber bundles (CFB) reduce the…
▽ More
Endomicroscopy is indispensable for minimally invasive diagnostics in clinical practice. For optical keyhole monitoring of surgical interventions, high-resolution fiber endoscopic imaging is considered to be very promising, especially in combination with label-free imaging techniques to realize in vivo diagnosis. However, the inherent honeycomb-artifacts of coherent fiber bundles (CFB) reduce the resolution and limit the clinical applications. We propose an end-to-end lensless fiber imaging scheme toward intraoperative real-time cancer diagnosis. The framework includes resolution enhancement and classification networks that use single-shot fiber bundle images to provide both high-resolution images and tumor diagnosis result. The well-trained resolution enhancement network not only recovers high-resolution features beyond the physical limitations of CFB, but also helps improving tumor recognition rate. Especially for glioblastoma, the resolution enhancement network helps increasing the classification accuracy from 90.8% to 95.6%. The novel technique can enable histological real-time imaging through lensless fiber endoscopy and is promising for rapid and minimal-invasive intraoperative diagnosis in clinics.
△ Less
Submitted 28 February, 2022;
originally announced March 2022.
-
Fourier ptychography multi-parameter neural network with composite physical priori optimization
Authors:
Delong Yang,
Shaohui Zhang,
Chuanjian Zheng,
Guocheng Zhou,
Lei Cao,
Yao Hu,
Qun Hao
Abstract:
Fourier ptychography microscopy(FP) is a recently developed computational imaging approach for microscopic super-resolution imaging. By turning on each light-emitting-diode (LED) located on different position on the LED array sequentially and acquiring the corresponding images that contain different spatial frequency components, high spatial resolution and quantitative phase imaging can be achieve…
▽ More
Fourier ptychography microscopy(FP) is a recently developed computational imaging approach for microscopic super-resolution imaging. By turning on each light-emitting-diode (LED) located on different position on the LED array sequentially and acquiring the corresponding images that contain different spatial frequency components, high spatial resolution and quantitative phase imaging can be achieved in the case of large field-of-view. Nevertheless, FPM has high requirements for the system construction and data acquisition processes, such as precise LEDs position, accurate focusing and appropriate exposure time, which brings many limitations to its practical applications. In this paper, inspired by artificial neural network, we propose a Fourier ptychography multi-parameter neural network (FPMN) with composite physical prior optimization. A hybrid parameter determination strategy combining physical imaging model and data-driven network training is proposed to recover the multi layers of the network corresponding to different physical parameters, including sample complex function, system pupil function, defocus distance, LED array position deviation and illumination intensity fluctuation, etc. Among these parameters, LED array position deviation is recovered based on the features of brightfield to darkfield transition low-resolution images while the others are recovered in the process of training of the neural network. The feasibility and effectiveness of FPMN are verified through simulations and actual experiments. Therefore FPMN can evidently reduce the requirement for practical applications of FPM.
△ Less
Submitted 17 February, 2022;
originally announced February 2022.
-
Quantitative phase imaging through an ultra-thin lensless fiber endoscope
Authors:
Jiawei Sun,
Jiachen Wu,
Song Wu,
Liangcai Cao,
Ruchi Goswami,
Salvatore Girardo,
Jochen Guck,
Nektarios Koukourakis,
Juergen W. Czarske
Abstract:
Quantitative phase imaging (QPI) is a label-free technique providing both morphology and quantitative biophysical information in biomedicine. However, applying such a powerful technique to in vivo pathological diagnosis remains challenging. Multi-core fiber bundles (MCFs) enable ultra-thin probes for in vivo imaging, but current MCF imaging techniques are limited to amplitude imaging modalities. W…
▽ More
Quantitative phase imaging (QPI) is a label-free technique providing both morphology and quantitative biophysical information in biomedicine. However, applying such a powerful technique to in vivo pathological diagnosis remains challenging. Multi-core fiber bundles (MCFs) enable ultra-thin probes for in vivo imaging, but current MCF imaging techniques are limited to amplitude imaging modalities. We demonstrate a computational lensless microendoscope that uses an ultra-thin bare MCF to perform quantitative phase imaging of biomedical samples with up to 1 μm lateral resolution and nanoscale axial resolution. The incident complex light field at the measurement side is precisely reconstructed from a single-shot far-field speckle pattern at the detection side, enabling digital focusing and 3D volumetric reconstruction without any mechanical movement. The accuracy of the quantitative phase reconstruction is validated by imaging the phase target and hydrogel beads through the MCF. With the proposed imaging modality, 3D imaging of human cancer cells is achieved through the ultra-thin fiber endoscope, promising widespread clinical applications.
△ Less
Submitted 6 July, 2022; v1 submitted 22 December, 2021;
originally announced December 2021.
-
Lensless multicore-fiber microendoscope for real-time tailored light field generation with phase encoder neural network (CoreNet)
Authors:
Jiawei Sun,
Jiachen Wu,
Nektarios Koukourakis,
Robert Kuschmierz,
Liangcai Cao,
Juergen Czarske
Abstract:
The generation of tailored light with multi-core fiber (MCF) lensless microendoscopes is widely used in biomedicine. However, the computer-generated holograms (CGHs) used for such applications are typically generated by iterative algorithms, which demand high computation effort, limiting advanced applications like in vivo optogenetic stimulation and fiber-optic cell manipulation. The random and di…
▽ More
The generation of tailored light with multi-core fiber (MCF) lensless microendoscopes is widely used in biomedicine. However, the computer-generated holograms (CGHs) used for such applications are typically generated by iterative algorithms, which demand high computation effort, limiting advanced applications like in vivo optogenetic stimulation and fiber-optic cell manipulation. The random and discrete distribution of the fiber cores induces strong spatial aliasing to the CGHs, hence, an approach that can rapidly generate tailored CGHs for MCFs is highly demanded. We demonstrate a novel phase encoder deep neural network (CoreNet), which can generate accurate tailored CGHs for MCFs at a near video-rate. Simulations show that CoreNet can speed up the computation time by two magnitudes and increase the fidelity of the generated light field compared to the conventional CGH techniques. For the first time, real-time generated tailored CGHs are on-the-fly loaded to the phase-only SLM for dynamic light fields generation through the MCF microendoscope in experiments. This paves the avenue for real-time cell rotation and several further applications that require real-time high-fidelity light delivery in biomedicine.
△ Less
Submitted 24 November, 2021;
originally announced November 2021.
-
Input Length Matters: Improving RNN-T and MWER Training for Long-form Telephony Speech Recognition
Authors:
Zhiyun Lu,
Yanwei Pan,
Thibault Doutre,
Parisa Haghani,
Liangliang Cao,
Rohit Prabhavalkar,
Chao Zhang,
Trevor Strohman
Abstract:
End-to-end models have achieved state-of-the-art results on several automatic speech recognition tasks. However, they perform poorly when evaluated on long-form data, e.g., minutes long conversational telephony audio. One reason the model fails on long-form speech is that it has only seen short utterances during training. In this paper we study the effect of training utterance length on the word e…
▽ More
End-to-end models have achieved state-of-the-art results on several automatic speech recognition tasks. However, they perform poorly when evaluated on long-form data, e.g., minutes long conversational telephony audio. One reason the model fails on long-form speech is that it has only seen short utterances during training. In this paper we study the effect of training utterance length on the word error rate (WER) for RNN-transducer (RNN-T) model. We compare two widely used training objectives, log loss (or RNN-T loss) and minimum word error rate (MWER) loss. We conduct experiments on telephony datasets in four languages. Our experiments show that for both losses, the WER on long-form speech reduces substantially as the training utterance length increases. The average relative WER gain is 15.7% for log loss and 8.8% for MWER loss. When training on short utterances, MWER loss leads to a lower WER than the log loss. Such difference between the two losses diminishes when the input length increases.
△ Less
Submitted 1 April, 2022; v1 submitted 7 October, 2021;
originally announced October 2021.
-
Improving Confidence Estimation on Out-of-Domain Data for End-to-End Speech Recognition
Authors:
Qiujia Li,
Yu Zhang,
David Qiu,
Yanzhang He,
Liangliang Cao,
Philip C. Woodland
Abstract:
As end-to-end automatic speech recognition (ASR) models reach promising performance, various downstream tasks rely on good confidence estimators for these systems. Recent research has shown that model-based confidence estimators have a significant advantage over using the output softmax probabilities. If the input data to the speech recogniser is from mismatched acoustic and linguistic conditions,…
▽ More
As end-to-end automatic speech recognition (ASR) models reach promising performance, various downstream tasks rely on good confidence estimators for these systems. Recent research has shown that model-based confidence estimators have a significant advantage over using the output softmax probabilities. If the input data to the speech recogniser is from mismatched acoustic and linguistic conditions, the ASR performance and the corresponding confidence estimators may exhibit severe degradation. Since confidence models are often trained on the same in-domain data as the ASR, generalising to out-of-domain (OOD) scenarios is challenging. By keeping the ASR model untouched, this paper proposes two approaches to improve the model-based confidence estimators on OOD data: using pseudo transcriptions and an additional OOD language model. With an ASR model trained on LibriSpeech, experiments show that the proposed methods can greatly improve the confidence metrics on TED-LIUM and Switchboard datasets while preserving in-domain performance. Furthermore, the improved confidence estimators are better calibrated on OOD data and can provide a much more reliable criterion for data selection.
△ Less
Submitted 2 March, 2022; v1 submitted 7 October, 2021;
originally announced October 2021.
-
BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition
Authors:
Yu Zhang,
Daniel S. Park,
Wei Han,
James Qin,
Anmol Gulati,
Joel Shor,
Aren Jansen,
Yuanzhong Xu,
Yanping Huang,
Shibo Wang,
Zongwei Zhou,
Bo Li,
Min Ma,
William Chan,
Jiahui Yu,
Yongqiang Wang,
Liangliang Cao,
Khe Chai Sim,
Bhuvana Ramabhadran,
Tara N. Sainath,
Françoise Beaufays,
Zhifeng Chen,
Quoc V. Le,
Chung-Cheng Chiu,
Ruoming Pang
, et al. (1 additional authors not shown)
Abstract:
We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled da…
▽ More
We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data. In particular, on an ASR task with 34k hours of labeled data, by fine-tuning an 8 billion parameter pre-trained Conformer model we can match state-of-the-art (SoTA) performance with only 3% of the training data and significantly improve SoTA with the full training set. We also report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks that cover a wide range of speech domains and span multiple orders of magnitudes of dataset sizes, including obtaining SoTA performance on many public benchmarks. In addition, we utilize the learned representation of pre-trained networks to achieve SoTA results on non-ASR tasks.
△ Less
Submitted 21 July, 2022; v1 submitted 27 September, 2021;
originally announced September 2021.
-
A Complex Constrained Total Variation Image Denoising Algorithm with Application to Phase Retrieval
Authors:
Yunhui Gao,
Liangcai Cao
Abstract:
This paper considers the constrained total variation (TV) denoising problem for complex-valued images. We extend the definition of TV seminorms for real-valued images to dealing with complex-valued ones. In particular, we introduce two types of complex TV in both isotropic and anisotropic forms. To solve the constrained denoising problem, we adopt a dual approach and derive an accelerated gradient…
▽ More
This paper considers the constrained total variation (TV) denoising problem for complex-valued images. We extend the definition of TV seminorms for real-valued images to dealing with complex-valued ones. In particular, we introduce two types of complex TV in both isotropic and anisotropic forms. To solve the constrained denoising problem, we adopt a dual approach and derive an accelerated gradient projection algorithm. We further generalize the proposed denoising algorithm as a key building block of the proximal gradient scheme to solve a vast class of complex constrained optimization problems with TV regularizers. As an example, we apply the proposed algorithmic framework to phase retrieval. We combine the complex TV regularizer with the conventional projection-based method within the constraint complex TV model. Initial results from both simulated and optical experiments demonstrate the validity of the constrained TV model in extracting sparsity priors within complex-valued images, while also utilizing physically tractable constraints that help speed up convergence.
△ Less
Submitted 12 September, 2021;
originally announced September 2021.
-
Bridging the gap between streaming and non-streaming ASR systems bydistilling ensembles of CTC and RNN-T models
Authors:
Thibault Doutre,
Wei Han,
Chung-Cheng Chiu,
Ruoming Pang,
Olivier Siohan,
Liangliang Cao
Abstract:
Streaming end-to-end automatic speech recognition (ASR) systems are widely used in everyday applications that require transcribing speech to text in real-time. Their minimal latency makes them suitable for such tasks. Unlike their non-streaming counterparts, streaming models are constrained to be causal with no future context and suffer from higher word error rates (WER). To improve streaming mode…
▽ More
Streaming end-to-end automatic speech recognition (ASR) systems are widely used in everyday applications that require transcribing speech to text in real-time. Their minimal latency makes them suitable for such tasks. Unlike their non-streaming counterparts, streaming models are constrained to be causal with no future context and suffer from higher word error rates (WER). To improve streaming models, a recent study [1] proposed to distill a non-streaming teacher model on unsupervised utterances, and then train a streaming student using the teachers' predictions. However, the performance gap between teacher and student WERs remains high. In this paper, we aim to close this gap by using a diversified set of non-streaming teacher models and combining them using Recognizer Output Voting Error Reduction (ROVER). In particular, we show that, despite being weaker than RNN-T models, CTC models are remarkable teachers. Further, by fusing RNN-T and CTC models together, we build the strongest teachers. The resulting student models drastically improve upon streaming models of previous work [1]: the WER decreases by 41% on Spanish, 27% on Portuguese, and 13% on French.
△ Less
Submitted 25 April, 2021;
originally announced April 2021.
-
Multi-Task Learning for End-to-End ASR Word and Utterance Confidence with Deletion Prediction
Authors:
David Qiu,
Yanzhang He,
Qiujia Li,
Yu Zhang,
Liangliang Cao,
Ian McGraw
Abstract:
Confidence scores are very useful for downstream applications of automatic speech recognition (ASR) systems. Recent works have proposed using neural networks to learn word or utterance confidence scores for end-to-end ASR. In those studies, word confidence by itself does not model deletions, and utterance confidence does not take advantage of word-level training signals. This paper proposes to joi…
▽ More
Confidence scores are very useful for downstream applications of automatic speech recognition (ASR) systems. Recent works have proposed using neural networks to learn word or utterance confidence scores for end-to-end ASR. In those studies, word confidence by itself does not model deletions, and utterance confidence does not take advantage of word-level training signals. This paper proposes to jointly learn word confidence, word deletion, and utterance confidence. Empirical results show that multi-task learning with all three objectives improves confidence metrics (NCE, AUC, RMSE) without the need for increasing the model size of the confidence estimation module. Using the utterance-level confidence for rescoring also decreases the word error rates on Google's Voice Search and Long-tail Maps datasets by 3-5% relative, without needing a dedicated neural rescorer.
△ Less
Submitted 26 April, 2021;
originally announced April 2021.