-
Beyond the LUMIR challenge: The pathway to foundational registration models
Authors:
Junyu Chen,
Shuwen Wei,
Joel Honkamaa,
Pekka Marttinen,
Hang Zhang,
Min Liu,
Yichao Zhou,
Zuopeng Tan,
Zhuoyuan Wang,
Yi Wang,
Hongchao Zhou,
Shunbo Hu,
Yi Zhang,
Qian Tao,
Lukas Förner,
Thomas Wendler,
Bailiang Jian,
Benedikt Wiestler,
Tim Hable,
Jin Kim,
Dan Ruan,
Frederic Madesta,
Thilo Sentker,
Wiebke Heyer,
Lianrui Zuo
, et al. (11 additional authors not shown)
Abstract:
Medical image challenges have played a transformative role in advancing the field, catalyzing algorithmic innovation and establishing new performance standards across diverse clinical applications. Image registration, a foundational task in neuroimaging pipelines, has similarly benefited from the Learn2Reg initiative. Building on this foundation, we introduce the Large-scale Unsupervised Brain MRI…
▽ More
Medical image challenges have played a transformative role in advancing the field, catalyzing algorithmic innovation and establishing new performance standards across diverse clinical applications. Image registration, a foundational task in neuroimaging pipelines, has similarly benefited from the Learn2Reg initiative. Building on this foundation, we introduce the Large-scale Unsupervised Brain MRI Image Registration (LUMIR) challenge, a next-generation benchmark designed to assess and advance unsupervised brain MRI registration. Distinct from prior challenges that leveraged anatomical label maps for supervision, LUMIR removes this dependency by providing over 4,000 preprocessed T1-weighted brain MRIs for training without any label maps, encouraging biologically plausible deformation modeling through self-supervision. In addition to evaluating performance on 590 held-out test subjects, LUMIR introduces a rigorous suite of zero-shot generalization tasks, spanning out-of-domain imaging modalities (e.g., FLAIR, T2-weighted, T2*-weighted), disease populations (e.g., Alzheimer's disease), acquisition protocols (e.g., 9.4T MRI), and species (e.g., macaque brains). A total of 1,158 subjects and over 4,000 image pairs were included for evaluation. Performance was assessed using both segmentation-based metrics (Dice coefficient, 95th percentile Hausdorff distance) and landmark-based registration accuracy (target registration error). Across both in-domain and zero-shot tasks, deep learning-based methods consistently achieved state-of-the-art accuracy while producing anatomically plausible deformation fields. The top-performing deep learning-based models demonstrated diffeomorphic properties and inverse consistency, outperforming several leading optimization-based methods, and showing strong robustness to most domain shifts, the exception being a drop in performance on out-of-domain contrasts.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
SpeakStream: Streaming Text-to-Speech with Interleaved Data
Authors:
Richard He Bai,
Zijin Gu,
Tatiana Likhomanenko,
Navdeep Jaitly
Abstract:
The latency bottleneck of traditional text-to-speech (TTS) systems fundamentally hinders the potential of streaming large language models (LLMs) in conversational AI. These TTS systems, typically trained and inferenced on complete utterances, introduce unacceptable delays, even with optimized inference speeds, when coupled with streaming LLM outputs. This is particularly problematic for creating r…
▽ More
The latency bottleneck of traditional text-to-speech (TTS) systems fundamentally hinders the potential of streaming large language models (LLMs) in conversational AI. These TTS systems, typically trained and inferenced on complete utterances, introduce unacceptable delays, even with optimized inference speeds, when coupled with streaming LLM outputs. This is particularly problematic for creating responsive conversational agents where low first-token latency is critical. In this paper, we present SpeakStream, a streaming TTS system that generates audio incrementally from streaming text using a decoder-only architecture. SpeakStream is trained using a next-step prediction loss on interleaved text-speech data. During inference, it generates speech incrementally while absorbing streaming input text, making it particularly suitable for cascaded conversational AI agents where an LLM streams text to a TTS system. Our experiments demonstrate that SpeakStream achieves state-of-the-art latency results in terms of first-token latency while maintaining the quality of non-streaming TTS systems.
△ Less
Submitted 25 May, 2025;
originally announced May 2025.
-
EvEnhancer: Empowering Effectiveness, Efficiency and Generalizability for Continuous Space-Time Video Super-Resolution with Events
Authors:
Shuoyan Wei,
Feng Li,
Shengeng Tang,
Yao Zhao,
Huihui Bai
Abstract:
Continuous space-time video super-resolution (C-STVSR) endeavors to upscale videos simultaneously at arbitrary spatial and temporal scales, which has recently garnered increasing interest. However, prevailing methods struggle to yield satisfactory videos at out-of-distribution spatial and temporal scales. On the other hand, event streams characterized by high temporal resolution and high dynamic r…
▽ More
Continuous space-time video super-resolution (C-STVSR) endeavors to upscale videos simultaneously at arbitrary spatial and temporal scales, which has recently garnered increasing interest. However, prevailing methods struggle to yield satisfactory videos at out-of-distribution spatial and temporal scales. On the other hand, event streams characterized by high temporal resolution and high dynamic range, exhibit compelling promise in vision tasks. This paper presents EvEnhancer, an innovative approach that marries the unique advantages of event streams to elevate effectiveness, efficiency, and generalizability for C-STVSR. Our approach hinges on two pivotal components: 1) Event-adapted synthesis capitalizes on the spatiotemporal correlations between frames and events to discern and learn long-term motion trajectories, enabling the adaptive interpolation and fusion of informative spatiotemporal features; 2) Local implicit video transformer integrates local implicit video neural function with cross-scale spatiotemporal attention to learn continuous video representations utilized to generate plausible videos at arbitrary resolutions and frame rates. Experiments show that EvEnhancer achieves superiority on synthetic and real-world datasets and preferable generalizability on out-of-distribution scales against state-of-the-art methods. Code is available at https://github.com/W-Shuoyan/EvEnhancer.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
Exploiting inter-agent coupling information for efficient reinforcement learning of cooperative LQR
Authors:
Shahbaz P Qadri Syed,
He Bai
Abstract:
Developing scalable and efficient reinforcement learning algorithms for cooperative multi-agent control has received significant attention over the past years. Existing literature has proposed inexact decompositions of local Q-functions based on empirical information structures between the agents. In this paper, we exploit inter-agent coupling information and propose a systematic approach to exact…
▽ More
Developing scalable and efficient reinforcement learning algorithms for cooperative multi-agent control has received significant attention over the past years. Existing literature has proposed inexact decompositions of local Q-functions based on empirical information structures between the agents. In this paper, we exploit inter-agent coupling information and propose a systematic approach to exactly decompose the local Q-function of each agent. We develop an approximate least square policy iteration algorithm based on the proposed decomposition and identify two architectures to learn the local Q-function for each agent. We establish that the worst-case sample complexity of the decomposition is equal to the centralized case and derive necessary and sufficient graphical conditions on the inter-agent couplings to achieve better sample efficiency. We demonstrate the improved sample efficiency and computational efficiency on numerical examples.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
TransVFC: A Transformable Video Feature Compression Framework for Machines
Authors:
Yuxiao Sun,
Yao Zhao,
Meiqin Liu,
Chao Yao,
Huihui Bai,
Chunyu Lin,
Weisi Lin
Abstract:
Nowadays, more and more video transmissions primarily aim at downstream machine vision tasks rather than humans. While widely deployed Human Visual System (HVS) oriented video coding standards like H.265/HEVC and H.264/AVC are efficient, they are not the optimal approaches for Video Coding for Machines (VCM) scenarios, leading to unnecessary bitrate expenditure. The academic and technical explorat…
▽ More
Nowadays, more and more video transmissions primarily aim at downstream machine vision tasks rather than humans. While widely deployed Human Visual System (HVS) oriented video coding standards like H.265/HEVC and H.264/AVC are efficient, they are not the optimal approaches for Video Coding for Machines (VCM) scenarios, leading to unnecessary bitrate expenditure. The academic and technical exploration within the VCM domain has led to the development of several strategies, and yet, conspicuous limitations remain in their adaptability for multi-task scenarios. To address the challenge, we propose a Transformable Video Feature Compression (TransVFC) framework. It offers a compress-then-transfer solution and includes a video feature codec and Feature Space Transform (FST) modules. In particular, the temporal redundancy of video features is squeezed by the codec through the scheme-based inter-prediction module. Then, the codec implements perception-guided conditional coding to minimize spatial redundancy and help the reconstructed features align with downstream machine perception.After that, the reconstructed features are transferred to new feature spaces for diverse downstream tasks by FST modules. To accommodate a new downstream task, it only requires training one lightweight FST module, avoiding retraining and redeploying the upstream codec and downstream task networks. Experiments show that TransVFC achieves high rate-task performance for diverse tasks of different granularities. We expect our work can provide valuable insights for video feature compression in multi-task scenarios. The codes are at https://github.com/Ws-Syx/TransVFC.
△ Less
Submitted 31 March, 2025;
originally announced March 2025.
-
Simultaneous Automatic Picking and Manual Picking Refinement for First-Break
Authors:
Haowen Bai,
Zixiang Zhao,
Jiangshe Zhang,
Yukun Cui,
Chunxia Zhang,
Zhenbo Guo,
Yongjun Wang
Abstract:
First-break picking is a pivotal procedure in processing microseismic data for geophysics and resource exploration. Recent advancements in deep learning have catalyzed the evolution of automated methods for identifying first-break. Nevertheless, the complexity of seismic data acquisition and the requirement for detailed, expert-driven labeling often result in outliers and potential mislabeling wit…
▽ More
First-break picking is a pivotal procedure in processing microseismic data for geophysics and resource exploration. Recent advancements in deep learning have catalyzed the evolution of automated methods for identifying first-break. Nevertheless, the complexity of seismic data acquisition and the requirement for detailed, expert-driven labeling often result in outliers and potential mislabeling within manually labeled datasets. These issues can negatively affect the training of neural networks, necessitating algorithms that handle outliers or mislabeled data effectively. We introduce the Simultaneous Picking and Refinement (SPR) algorithm, designed to handle datasets plagued by outlier samples or even noisy labels. Unlike conventional approaches that regard manual picks as ground truth, our method treats the true first-break as a latent variable within a probabilistic model that includes a first-break labeling prior. SPR aims to uncover this variable, enabling dynamic adjustments and improved accuracy across the dataset. This strategy mitigates the impact of outliers or inaccuracies in manual labels. Intra-site picking experiments and cross-site generalization experiments on publicly available data confirm our method's performance in identifying first-break and its generalization across different sites. Additionally, our investigations into noisy signals and labels underscore SPR's resilience to both types of noise and its capability to refine misaligned manual annotations. Moreover, the flexibility of SPR, not being limited to any single network architecture, enhances its adaptability across various deep learning-based picking methods. Focusing on learning from data that may contain outliers or partial inaccuracies, SPR provides a robust solution to some of the principal obstacles in automatic first-break picking.
△ Less
Submitted 3 February, 2025;
originally announced February 2025.
-
Optimizing Prompt Strategies for SAM: Advancing lesion Segmentation Across Diverse Medical Imaging Modalities
Authors:
Yuli Wang,
Victoria Shi,
Wen-Chi Hsu,
Yuwei Dai,
Sophie Yao,
Zhusi Zhong,
Zishu Zhang,
Jing Wu,
Aaron Maxwell,
Scott Collins,
Zhicheng Jiao,
Harrison X. Bai
Abstract:
Purpose: To evaluate various Segmental Anything Model (SAM) prompt strategies across four lesions datasets and to subsequently develop a reinforcement learning (RL) agent to optimize SAM prompt placement. Materials and Methods: This retrospective study included patients with four independent ovarian, lung, renal, and breast tumor datasets. Manual segmentation and SAM-assisted segmentation were per…
▽ More
Purpose: To evaluate various Segmental Anything Model (SAM) prompt strategies across four lesions datasets and to subsequently develop a reinforcement learning (RL) agent to optimize SAM prompt placement. Materials and Methods: This retrospective study included patients with four independent ovarian, lung, renal, and breast tumor datasets. Manual segmentation and SAM-assisted segmentation were performed for all lesions. A RL model was developed to predict and select SAM points to maximize segmentation performance. Statistical analysis of segmentation was conducted using pairwise t-tests. Results: Results show that increasing the number of prompt points significantly improves segmentation accuracy, with Dice coefficients rising from 0.272 for a single point to 0.806 for five or more points in ovarian tumors. The prompt location also influenced performance, with surface and union-based prompts outperforming center-based prompts, achieving mean Dice coefficients of 0.604 and 0.724 for ovarian and breast tumors, respectively. The RL agent achieved a peak Dice coefficient of 0.595 for ovarian tumors, outperforming random and alternative RL strategies. Additionally, it significantly reduced segmentation time, achieving a nearly 10-fold improvement compared to manual methods using SAM. Conclusion: While increased SAM prompts and non-centered prompts generally improved segmentation accuracy, each pathology and modality has specific optimal thresholds and placement strategies. Our RL agent achieved superior performance compared to other agents while achieving a significant reduction in segmentation time.
△ Less
Submitted 28 December, 2024; v1 submitted 23 December, 2024;
originally announced December 2024.
-
Data-Driven Quantification of Battery Degradation Modes via Critical Features from Charging
Authors:
Yuanhao Cheng,
Hanyu Bai,
Yichen Liang,
Xiaofan Cui,
Weiren Jiang,
Ziyou Song
Abstract:
Battery degradation modes influence the aging behavior of Li-ion batteries, leading to accelerated capacity loss and potential safety issues. Quantifying these aging mechanisms poses challenges for both online and offline diagnostics in charging station applications. Data-driven algorithms have emerged as effective tools for addressing state-of-health issues by learning hard-to-model electrochemic…
▽ More
Battery degradation modes influence the aging behavior of Li-ion batteries, leading to accelerated capacity loss and potential safety issues. Quantifying these aging mechanisms poses challenges for both online and offline diagnostics in charging station applications. Data-driven algorithms have emerged as effective tools for addressing state-of-health issues by learning hard-to-model electrochemical properties from data. This paper presents a data-driven method for quantifying battery degradation modes. Ninety-one statistical features are extracted from the incremental capacity curve derived from 1/3C charging data. These features are then screened based on dispersion, contribution, and correlation. Subsequently, machine learning models, including four baseline algorithms and a feedforward neural network, are used to estimate the degradation modes. Experimental validation indicates that the feedforward neural network outperforms the others, achieving a root mean square error of around 10\% across all three degradation modes (i.e., loss of lithium inventory, loss of active material on the positive electrode, and loss of active material on the negative electrode). The findings in this paper demonstrate the potential of machine learning for diagnosing battery degradation modes in charging station scenarios.
△ Less
Submitted 13 December, 2024;
originally announced December 2024.
-
Hipandas: Hyperspectral Image Joint Denoising and Super-Resolution by Image Fusion with the Panchromatic Image
Authors:
Shuang Xu,
Zixiang Zhao,
Haowen Bai,
Chang Yu,
Jiangjun Peng,
Xiangyong Cao,
Deyu Meng
Abstract:
Hyperspectral images (HSIs) are frequently noisy and of low resolution due to the constraints of imaging devices. Recently launched satellites can concurrently acquire HSIs and panchromatic (PAN) images, enabling the restoration of HSIs to generate clean and high-resolution imagery through fusing PAN images for denoising and super-resolution. However, previous studies treated these two tasks as in…
▽ More
Hyperspectral images (HSIs) are frequently noisy and of low resolution due to the constraints of imaging devices. Recently launched satellites can concurrently acquire HSIs and panchromatic (PAN) images, enabling the restoration of HSIs to generate clean and high-resolution imagery through fusing PAN images for denoising and super-resolution. However, previous studies treated these two tasks as independent processes, resulting in accumulated errors. This paper introduces \textbf{H}yperspectral \textbf{I}mage Joint \textbf{Pand}enoising \textbf{a}nd Pan\textbf{s}harpening (Hipandas), a novel learning paradigm that reconstructs HRHS images from noisy low-resolution HSIs (LRHS) and high-resolution PAN images. The proposed zero-shot Hipandas framework consists of a guided denoising network, a guided super-resolution network, and a PAN reconstruction network, utilizing an HSI low-rank prior and a newly introduced detail-oriented low-rank prior. The interconnection of these networks complicates the training process, necessitating a two-stage training strategy to ensure effective training. Experimental results on both simulated and real-world datasets indicate that the proposed method surpasses state-of-the-art algorithms, yielding more accurate and visually pleasing HRHS images.
△ Less
Submitted 5 December, 2024;
originally announced December 2024.
-
Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis
Authors:
Akshita Gupta,
Tatiana Likhomanenko,
Karren Dai Yang,
Richard He Bai,
Zakaria Aldeneh,
Navdeep Jaitly
Abstract:
The rapid progress of foundation models and large language models (LLMs) has fueled significantly improvement in the capabilities of machine learning systems that benefit from mutlimodal input data. However, existing multimodal models are predominantly built on top of pre-trained LLMs, which can limit accurate modeling of temporal dependencies across other modalities and thus limit the model's abi…
▽ More
The rapid progress of foundation models and large language models (LLMs) has fueled significantly improvement in the capabilities of machine learning systems that benefit from mutlimodal input data. However, existing multimodal models are predominantly built on top of pre-trained LLMs, which can limit accurate modeling of temporal dependencies across other modalities and thus limit the model's ability to jointly process and leverage multimodal inputs. To specifically investigate the alignment of text, video, and speech modalities in LLM-style (decoder-only) models, we consider a simplified multimodal generation task, Video-Text to Speech (VTTS): speech generation conditioned on both its corresponding text and video of talking people. The ultimate goal is to generate speech that not only follows the text but also aligns temporally with the video and is consistent with the facial expressions. In this paper, we first introduce Visatronic, a unified multimodal decoder-only transformer model that adopts an LLM-style architecture to embed visual, textual, and speech inputs into a shared subspace, treating all modalities as temporally aligned token streams. Next, we carefully explore different token mixing strategies to understand the best way to propagate information from the steps where video and text conditioning is input to the steps where the audio is generated. We extensively evaluate Visatronic on the challenging VoxCeleb2 dataset and demonstrate zero-shot generalization to LRS3, where Visatronic, trained on VoxCeleb2, achieves a 4.5% WER, outperforming prior SOTA methods trained only on LRS3, which report a 21.4% WER. Additionally, we propose a new objective metric, TimeSync, specifically designed to measure phoneme-level temporal alignment between generated and reference speech, further ensuring synchronization quality. Demo: https://apple.github.io/visatronic-demo/
△ Less
Submitted 29 May, 2025; v1 submitted 26 November, 2024;
originally announced November 2024.
-
SEAL: SEmantic-Augmented Imitation Learning via Language Model
Authors:
Chengyang Gu,
Yuxin Pan,
Haotian Bai,
Hui Xiong,
Yize Chen
Abstract:
Hierarchical Imitation Learning (HIL) is a promising approach for tackling long-horizon decision-making tasks. While it is a challenging task due to the lack of detailed supervisory labels for sub-goal learning, and reliance on hundreds to thousands of expert demonstrations. In this work, we introduce SEAL, a novel framework that leverages Large Language Models (LLMs)'s powerful semantic and world…
▽ More
Hierarchical Imitation Learning (HIL) is a promising approach for tackling long-horizon decision-making tasks. While it is a challenging task due to the lack of detailed supervisory labels for sub-goal learning, and reliance on hundreds to thousands of expert demonstrations. In this work, we introduce SEAL, a novel framework that leverages Large Language Models (LLMs)'s powerful semantic and world knowledge for both specifying sub-goal space and pre-labeling states to semantically meaningful sub-goal representations without prior knowledge of task hierarchies. SEAL employs a dual-encoder structure, combining supervised LLM-guided sub-goal learning with unsupervised Vector Quantization (VQ) for more robust sub-goal representations. Additionally, SEAL incorporates a transition-augmented low-level planner for improved adaptation to sub-goal transitions. Our experiments demonstrate that SEAL outperforms state-of-the-art HIL methods and LLM-based planning approaches, particularly in settings with small expert datasets and complex long-horizon tasks.
△ Less
Submitted 3 October, 2024;
originally announced October 2024.
-
Exploring Prediction Targets in Masked Pre-Training for Speech Foundation Models
Authors:
Li-Wei Chen,
Takuya Higuchi,
He Bai,
Ahmed Hussen Abdelaziz,
Alexander Rudnicky,
Shinji Watanabe,
Tatiana Likhomanenko,
Barry-John Theobald,
Zakaria Aldeneh
Abstract:
Speech foundation models, such as HuBERT and its variants, are pre-trained on large amounts of unlabeled speech data and then used for a range of downstream tasks. These models use a masked prediction objective, where the model learns to predict information about masked input segments from the unmasked context. The choice of prediction targets in this framework impacts their performance on downstr…
▽ More
Speech foundation models, such as HuBERT and its variants, are pre-trained on large amounts of unlabeled speech data and then used for a range of downstream tasks. These models use a masked prediction objective, where the model learns to predict information about masked input segments from the unmasked context. The choice of prediction targets in this framework impacts their performance on downstream tasks. For instance, models pre-trained with targets that capture prosody learn representations suited for speaker-related tasks, while those pre-trained with targets that capture phonetics learn representations suited for content-related tasks. Moreover, prediction targets can differ in the level of detail they capture. Models pre-trained with targets that encode fine-grained acoustic features perform better on tasks like denoising, while those pre-trained with targets focused on higher-level abstractions are more effective for content-related tasks. Despite the importance of prediction targets, the design choices that affect them have not been thoroughly studied. This work explores the design choices and their impact on downstream task performance. Our results indicate that the commonly used design choices for HuBERT can be suboptimal. We propose approaches to create more informative prediction targets and demonstrate their effectiveness through improvements across various downstream tasks.
△ Less
Submitted 17 January, 2025; v1 submitted 16 September, 2024;
originally announced September 2024.
-
Content-driven Magnitude-Derivative Spectrum Complementary Learning for Hyperspectral Image Classification
Authors:
Huiyan Bai,
Tingfa Xu,
Huan Chen,
Peifu Liu,
Jianan Li
Abstract:
Extracting discriminative information from complex spectral details in hyperspectral image (HSI) for HSI classification is pivotal. While current prevailing methods rely on spectral magnitude features, they could cause confusion in certain classes, resulting in misclassification and decreased accuracy. We find that the derivative spectrum proves more adept at capturing concealed information, there…
▽ More
Extracting discriminative information from complex spectral details in hyperspectral image (HSI) for HSI classification is pivotal. While current prevailing methods rely on spectral magnitude features, they could cause confusion in certain classes, resulting in misclassification and decreased accuracy. We find that the derivative spectrum proves more adept at capturing concealed information, thereby offering a distinct advantage in separating these confusion classes. Leveraging the complementarity between spectral magnitude and derivative features, we propose a Content-driven Spectrum Complementary Network based on Magnitude-Derivative Dual Encoder, employing these two features as combined inputs. To fully utilize their complementary information, we raise a Content-adaptive Point-wise Fusion Module, enabling adaptive fusion of dual-encoder features in a point-wise selective manner, contingent upon feature representation. To preserve a rich source of complementary information while extracting more distinguishable features, we introduce a Hybrid Disparity-enhancing Loss that enhances the differential expression of the features from the two branches and increases the inter-class distance. As a result, our method achieves state-of-the-art results on the extensive WHU-OHS dataset and eight other benchmark datasets.
△ Less
Submitted 26 July, 2024;
originally announced July 2024.
-
dMel: Speech Tokenization made Simple
Authors:
Richard He Bai,
Tatiana Likhomanenko,
Ruixiang Zhang,
Zijin Gu,
Zakaria Aldeneh,
Navdeep Jaitly
Abstract:
Large language models have revolutionized natural language processing by leveraging self-supervised pretraining on vast textual data. Inspired by this success, researchers have investigated various compression-based speech tokenization methods to discretize continuous speech signals, enabling the application of language modeling techniques to discrete tokens. However, audio compressor introduces a…
▽ More
Large language models have revolutionized natural language processing by leveraging self-supervised pretraining on vast textual data. Inspired by this success, researchers have investigated various compression-based speech tokenization methods to discretize continuous speech signals, enabling the application of language modeling techniques to discrete tokens. However, audio compressor introduces additional complexity and computational cost, and often fail on out-of-domain audio signals. In this work, we introduce a novel speech representation (dmel) that discretizes mel-filterbank channels into intensity bins, creating a simpler yet more effective representation compared to existing speech tokenization methods. Our approach demonstrates superior performance in preserving audio content, robustness to out-of-domain data, and offers a training-free, natural, and streamable representation. To address the high-dimensional nature of log-mel spectrograms, we propose an efficient parallel encoding and decoding method for high-dimensional tokens using an LM-style transformer architecture. This innovation enables us to develop RichTTS and RichASR, two models sharing the same architecture while achieving comparable or better results than specialized existing methods. Our results demonstrate the effectiveness of dmel in achieving high performance on both speech synthesis and recognition tasks within a unified framework, paving the way for efficient and effective joint modeling of speech and text.
△ Less
Submitted 21 May, 2025; v1 submitted 22 July, 2024;
originally announced July 2024.
-
Learned HDR Image Compression for Perceptually Optimal Storage and Display
Authors:
Peibei Cao,
Haoyu Chen,
Jingzhe Ma,
Yu-Chieh Yuan,
Zhiyong Xie,
Xin Xie,
Haiqing Bai,
Kede Ma
Abstract:
High dynamic range (HDR) capture and display have seen significant growth in popularity driven by the advancements in technology and increasing consumer demand for superior image quality. As a result, HDR image compression is crucial to fully realize the benefits of HDR imaging without suffering from large file sizes and inefficient data handling. Conventionally, this is achieved by introducing a…
▽ More
High dynamic range (HDR) capture and display have seen significant growth in popularity driven by the advancements in technology and increasing consumer demand for superior image quality. As a result, HDR image compression is crucial to fully realize the benefits of HDR imaging without suffering from large file sizes and inefficient data handling. Conventionally, this is achieved by introducing a residual/gain map as additional metadata to bridge the gap between HDR and low dynamic range (LDR) images, making the former compatible with LDR image codecs but offering suboptimal rate-distortion performance. In this work, we initiate efforts towards end-to-end optimized HDR image compression for perceptually optimal storage and display. Specifically, we learn to compress an HDR image into two bitstreams: one for generating an LDR image to ensure compatibility with legacy LDR displays, and another as side information to aid HDR image reconstruction from the output LDR image. To measure the perceptual quality of output HDR and LDR images, we use two recently proposed image distortion metrics, both validated against human perceptual data of image quality and with reference to the uncompressed HDR image. Through end-to-end optimization for rate-distortion performance, our method dramatically improves HDR and LDR image quality at all bit rates.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
Unraveling Radiomics Complexity: Strategies for Optimal Simplicity in Predictive Modeling
Authors:
Mahdi Ait Lhaj Loutfi,
Teodora Boblea Podasca,
Alex Zwanenburg,
Taman Upadhaya,
Jorge Barrios,
David R. Raleigh,
William C. Chen,
Dante P. I. Capaldi,
Hong Zheng,
Olivier Gevaert,
Jing Wu,
Alvin C. Silva,
Paul J. Zhang,
Harrison X. Bai,
Jan Seuntjens,
Steffen Löck,
Patrick O. Richard,
Olivier Morin,
Caroline Reinhold,
Martin Lepage,
Martin Vallières
Abstract:
Background: The high dimensionality of radiomic feature sets, the variability in radiomic feature types and potentially high computational requirements all underscore the need for an effective method to identify the smallest set of predictive features for a given clinical problem. Purpose: Develop a methodology and tools to identify and explain the smallest set of predictive radiomic features. Mat…
▽ More
Background: The high dimensionality of radiomic feature sets, the variability in radiomic feature types and potentially high computational requirements all underscore the need for an effective method to identify the smallest set of predictive features for a given clinical problem. Purpose: Develop a methodology and tools to identify and explain the smallest set of predictive radiomic features. Materials and Methods: 89,714 radiomic features were extracted from five cancer datasets: low-grade glioma, meningioma, non-small cell lung cancer (NSCLC), and two renal cell carcinoma cohorts (n=2104). Features were categorized by computational complexity into morphological, intensity, texture, linear filters, and nonlinear filters. Models were trained and evaluated on each complexity level using the area under the curve (AUC). The most informative features were identified, and their importance was explained. The optimal complexity level and associated most informative features were identified using systematic statistical significance analyses and a false discovery avoidance procedure, respectively. Their predictive importance was explained using a novel tree-based method. Results: MEDimage, a new open-source tool, was developed to facilitate radiomic studies. Morphological features were optimal for MRI-based meningioma (AUC: 0.65) and low-grade glioma (AUC: 0.68). Intensity features were optimal for CECT-based renal cell carcinoma (AUC: 0.82) and CT-based NSCLC (AUC: 0.76). Texture features were optimal for MRI-based renal cell carcinoma (AUC: 0.72). Tuning the Hounsfield unit range improved results for CECT-based renal cell carcinoma (AUC: 0.86). Conclusion: Our proposed methodology and software can estimate the optimal radiomics complexity level for specific medical outcomes, potentially simplifying the use of radiomics in predictive modeling across various contexts.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Once-for-All: Controllable Generative Image Compression with Dynamic Granularity Adaption
Authors:
Anqi Li,
Feng Li,
Yuxi Liu,
Runmin Cong,
Yao Zhao,
Huihui Bai
Abstract:
Although recent generative image compression methods have demonstrated impressive potential in optimizing the rate-distortion-perception trade-off, they still face the critical challenge of flexible rate adaption to diverse compression necessities and scenarios. To overcome this challenge, this paper proposes a Controllable Generative Image Compression framework, termed Control-GIC, the first capa…
▽ More
Although recent generative image compression methods have demonstrated impressive potential in optimizing the rate-distortion-perception trade-off, they still face the critical challenge of flexible rate adaption to diverse compression necessities and scenarios. To overcome this challenge, this paper proposes a Controllable Generative Image Compression framework, termed Control-GIC, the first capable of fine-grained bitrate adaption across a broad spectrum while ensuring high-fidelity and generality compression. Control-GIC is grounded in a VQGAN framework that encodes an image as a sequence of variable-length codes (i.e. VQ-indices), which can be losslessly compressed and exhibits a direct positive correlation with the bitrates. Drawing inspiration from the classical coding principle, we correlate the information density of local image patches with their granular representations. Hence, we can flexibly determine a proper allocation of granularity for the patches to achieve dynamic adjustment for VQ-indices, resulting in desirable compression rates. We further develop a probabilistic conditional decoder capable of retrieving historic encoded multi-granularity representations according to transmitted codes, and then reconstruct hierarchical granular features in the formalization of conditional probability, enabling more informative aggregation to improve reconstruction realism. Our experiments show that Control-GIC allows highly flexible and controllable bitrate adaption where the results demonstrate its superior performance over recent state-of-the-art methods.
△ Less
Submitted 4 December, 2024; v1 submitted 2 June, 2024;
originally announced June 2024.
-
Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition
Authors:
Zijin Gu,
Tatiana Likhomanenko,
He Bai,
Erik McDermott,
Ronan Collobert,
Navdeep Jaitly
Abstract:
Language models (LMs) have long been used to improve results of automatic speech recognition (ASR) systems, but they are unaware of the errors that ASR systems make. Error correction models are designed to fix ASR errors, however, they showed little improvement over traditional LMs mainly due to the lack of supervised training data. In this paper, we present Denoising LM (DLM), which is a…
▽ More
Language models (LMs) have long been used to improve results of automatic speech recognition (ASR) systems, but they are unaware of the errors that ASR systems make. Error correction models are designed to fix ASR errors, however, they showed little improvement over traditional LMs mainly due to the lack of supervised training data. In this paper, we present Denoising LM (DLM), which is a $\textit{scaled}$ error correction model trained with vast amounts of synthetic data, significantly exceeding prior attempts meanwhile achieving new state-of-the-art ASR performance. We use text-to-speech (TTS) systems to synthesize audio, which is fed into an ASR system to produce noisy hypotheses, which are then paired with the original texts to train the DLM. DLM has several $\textit{key ingredients}$: (i) up-scaled model and data; (ii) usage of multi-speaker TTS systems; (iii) combination of multiple noise augmentation strategies; and (iv) new decoding techniques. With a Transformer-CTC ASR, DLM achieves 1.5% word error rate (WER) on $\textit{test-clean}$ and 3.3% WER on $\textit{test-other}$ on Librispeech, which to our knowledge are the best reported numbers in the setting where no external audio data are used and even match self-supervised methods which use external audio data. Furthermore, a single DLM is applicable to different ASRs, and greatly surpassing the performance of conventional LM based beam-search rescoring. These results indicate that properly investigated error correction models have the potential to replace conventional LMs, holding the key to a new level of accuracy in ASR systems.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
Structural Entities Extraction and Patient Indications Incorporation for Chest X-ray Report Generation
Authors:
Kang Liu,
Zhuoqi Ma,
Xiaolu Kang,
Zhusi Zhong,
Zhicheng Jiao,
Grayson Baird,
Harrison Bai,
Qiguang Miao
Abstract:
The automated generation of imaging reports proves invaluable in alleviating the workload of radiologists. A clinically applicable reports generation algorithm should demonstrate its effectiveness in producing reports that accurately describe radiology findings and attend to patient-specific indications. In this paper, we introduce a novel method, \textbf{S}tructural \textbf{E}ntities extraction a…
▽ More
The automated generation of imaging reports proves invaluable in alleviating the workload of radiologists. A clinically applicable reports generation algorithm should demonstrate its effectiveness in producing reports that accurately describe radiology findings and attend to patient-specific indications. In this paper, we introduce a novel method, \textbf{S}tructural \textbf{E}ntities extraction and patient indications \textbf{I}ncorporation (SEI) for chest X-ray report generation. Specifically, we employ a structural entities extraction (SEE) approach to eliminate presentation-style vocabulary in reports and improve the quality of factual entity sequences. This reduces the noise in the following cross-modal alignment module by aligning X-ray images with factual entity sequences in reports, thereby enhancing the precision of cross-modal alignment and further aiding the model in gradient-free retrieval of similar historical cases. Subsequently, we propose a cross-modal fusion network to integrate information from X-ray images, similar historical cases, and patient-specific indications. This process allows the text decoder to attend to discriminative features of X-ray images, assimilate historical diagnostic information from similar cases, and understand the examination intention of patients. This, in turn, assists in triggering the text decoder to produce high-quality reports. Experiments conducted on MIMIC-CXR validate the superiority of SEI over state-of-the-art approaches on both natural language generation and clinical efficacy metrics.
△ Less
Submitted 22 May, 2024;
originally announced May 2024.
-
Multi-modality Regional Alignment Network for Covid X-Ray Survival Prediction and Report Generation
Authors:
Zhusi Zhong,
Jie Li,
John Sollee,
Scott Collins,
Harrison Bai,
Paul Zhang,
Terrence Healey,
Michael Atalay,
Xinbo Gao,
Zhicheng Jiao
Abstract:
In response to the worldwide COVID-19 pandemic, advanced automated technologies have emerged as valuable tools to aid healthcare professionals in managing an increased workload by improving radiology report generation and prognostic analysis. This study proposes Multi-modality Regional Alignment Network (MRANet), an explainable model for radiology report generation and survival prediction that foc…
▽ More
In response to the worldwide COVID-19 pandemic, advanced automated technologies have emerged as valuable tools to aid healthcare professionals in managing an increased workload by improving radiology report generation and prognostic analysis. This study proposes Multi-modality Regional Alignment Network (MRANet), an explainable model for radiology report generation and survival prediction that focuses on high-risk regions. By learning spatial correlation in the detector, MRANet visually grounds region-specific descriptions, providing robust anatomical regions with a completion strategy. The visual features of each region are embedded using a novel survival attention mechanism, offering spatially and risk-aware features for sentence encoding while maintaining global coherence across tasks. A cross LLMs alignment is employed to enhance the image-to-text transfer process, resulting in sentences rich with clinical detail and improved explainability for radiologist. Multi-center experiments validate both MRANet's overall performance and each module's composition within the model, encouraging further advancements in radiology report generation research emphasizing clinical interpretation and trustworthiness in AI models applied to medical studies. The code is available at https://github.com/zzs95/MRANet.
△ Less
Submitted 22 May, 2024;
originally announced May 2024.
-
Restricting Voltage Deviation of DC Microgrids with Critical and Ordinary Nodes
Authors:
Handong Bai,
Peng Li,
Hongwei Zhang
Abstract:
Restricting bus voltage deviation is crucial for normal operation of multi-bus DC microgrids, yet it has received insufficient attention due to the conflict between two main control objectives in DC microgrids, i.e., voltage regulation and current sharing. By revealing a necessary and sufficient condition for achieving these two objectives, this paper proposes a compromised distributed control alg…
▽ More
Restricting bus voltage deviation is crucial for normal operation of multi-bus DC microgrids, yet it has received insufficient attention due to the conflict between two main control objectives in DC microgrids, i.e., voltage regulation and current sharing. By revealing a necessary and sufficient condition for achieving these two objectives, this paper proposes a compromised distributed control algorithm, which regulates the voltage deviation of all buses by relaxing the accuracy of current sharing. Moreover, for a class of DC Microgrids consisting of both critical nodes and ordinary nodes, this paper proposes a distributed control algorithm that restricts the voltage deviation of critical nodes and simultaneously keeps the current sharing of ordinary nodes. This algorithm also works under plug-and-play settings. Simulations illustrate our theory.
△ Less
Submitted 22 May, 2024;
originally announced May 2024.
-
On the Reachability of 3-Dimensional Paths with a Prescribed Curvature Bound
Authors:
Juho Bae,
Ji Hoon Bai,
Byung-Yoon Lee,
Jun-Yong Lee,
Chang-Hun Lee
Abstract:
This paper presents the reachability analysis of curves in $\mathbb{R}^3$ with a prescribed curvature bound. Based on Pontryagin Maximum Principle, we leverage the existing knowledge on the structure of solutions to minimum-time problems, or Markov-Dubins problem, to reachability considerations. Based on this development, two types of reachability are discussed. First, we prove that any boundary p…
▽ More
This paper presents the reachability analysis of curves in $\mathbb{R}^3$ with a prescribed curvature bound. Based on Pontryagin Maximum Principle, we leverage the existing knowledge on the structure of solutions to minimum-time problems, or Markov-Dubins problem, to reachability considerations. Based on this development, two types of reachability are discussed. First, we prove that any boundary point of the reachability set, with the directional component taken into account as well as geometric coordinates, can be reached via curves of H, CSC, CCC, or their respective subsegments, where H denotes a helicoidal arc, C a circular arc with maximum curvature, and S a straight segment. Second, we show that the reachability set when directional component is not considered\textemdash{}the position reachability set\textemdash{}is simply a solid of revolution of its two-dimensional counterpart, the Dubins car. These findings extend the developments presented in literature on Dubins car into spatial curves in $\mathbb{R}^3$.
△ Less
Submitted 26 March, 2025; v1 submitted 27 March, 2024;
originally announced March 2024.
-
Region-Adaptive Transform with Segmentation Prior for Image Compression
Authors:
Yuxi Liu,
Wenhan Yang,
Huihui Bai,
Yunchao Wei,
Yao Zhao
Abstract:
Learned Image Compression (LIC) has shown remarkable progress in recent years. Existing works commonly employ CNN-based or self-attention-based modules as transform methods for compression. However, there is no prior research on neural transform that focuses on specific regions. In response, we introduce the class-agnostic segmentation masks (i.e. semantic masks without category labels) for extrac…
▽ More
Learned Image Compression (LIC) has shown remarkable progress in recent years. Existing works commonly employ CNN-based or self-attention-based modules as transform methods for compression. However, there is no prior research on neural transform that focuses on specific regions. In response, we introduce the class-agnostic segmentation masks (i.e. semantic masks without category labels) for extracting region-adaptive contextual information. Our proposed module, Region-Adaptive Transform, applies adaptive convolutions on different regions guided by the masks. Additionally, we introduce a plug-and-play module named Scale Affine Layer to incorporate rich contexts from various regions. While there have been prior image compression efforts that involve segmentation masks as additional intermediate inputs, our approach differs significantly from them. Our advantages lie in that, to avoid extra bitrate overhead, we treat these masks as privilege information, which is accessible during the model training stage but not required during the inference phase. To the best of our knowledge, we are the first to employ class-agnostic masks as privilege information and achieve superior performance in pixel-fidelity metrics, such as Peak Signal to Noise Ratio (PSNR). The experimental results demonstrate our improvement compared to previously well-performing methods, with about 8.2% bitrate saving compared to VTM-17.0. The source code is available at https://github.com/GityuxiLiu/SegPIC-for-Image-Compression.
△ Less
Submitted 24 September, 2024; v1 submitted 1 March, 2024;
originally announced March 2024.
-
Constraint-Aware Mesh Refinement Method by Reachability Set Envelope of Curvature Bounded Paths
Authors:
Juho Bae,
Ji Hoon Bai,
Byung-Yoon Lee,
Jun-Yong Lee
Abstract:
This paper presents an enhanced direct-method-based approach for the real-time solution of optimal control problems to handle path constraints, such as obstacles. The principal contributions of this work are twofold: first, the existing methods for constructing reachability sets in the literature are extended to derive the envelope of these sets, which determines the region swept by all feasible t…
▽ More
This paper presents an enhanced direct-method-based approach for the real-time solution of optimal control problems to handle path constraints, such as obstacles. The principal contributions of this work are twofold: first, the existing methods for constructing reachability sets in the literature are extended to derive the envelope of these sets, which determines the region swept by all feasible trajectories between adjacent sample points. Second, we propose a novel method to guarantee constraint violation-free between discrete states in two dimensions through mesh refinement approach. To illustrate the effectiveness of the proposed methodology, numerical simulations are conducted on real-time path planning for fixed-wing unmanned aerial vehicles.
△ Less
Submitted 4 March, 2024; v1 submitted 25 January, 2024;
originally announced January 2024.
-
PIPO-Net: A Penalty-based Independent Parameters Optimization Deep Unfolding Network
Authors:
Xiumei Li,
Zhijie Zhang,
Huang Bai,
Ljubiša Stanković,
Junpeng Hao,
Junmei Sun
Abstract:
Compressive sensing (CS) has been widely applied in signal and image processing fields. Traditional CS reconstruction algorithms have a complete theoretical foundation but suffer from the high computational complexity, while fashionable deep network-based methods can achieve high-accuracy reconstruction of CS but are short of interpretability. These facts motivate us to develop a deep unfolding ne…
▽ More
Compressive sensing (CS) has been widely applied in signal and image processing fields. Traditional CS reconstruction algorithms have a complete theoretical foundation but suffer from the high computational complexity, while fashionable deep network-based methods can achieve high-accuracy reconstruction of CS but are short of interpretability. These facts motivate us to develop a deep unfolding network named the penalty-based independent parameters optimization network (PIPO-Net) to combine the merits of the above mentioned two kinds of CS methods. Each module of PIPO-Net can be viewed separately as an optimization problem with respective penalty function. The main characteristic of PIPO-Net is that, in each round of training, the learnable parameters in one module are updated independently from those of other modules. This makes the network more flexible to find the optimal solutions of the corresponding problems. Moreover, the mean-subtraction sampling and the high-frequency complementary blocks are developed to improve the performance of PIPO-Net. Experiments on reconstructing CS images demonstrate the effectiveness of the proposed PIPO-Net.
△ Less
Submitted 4 November, 2023;
originally announced November 2023.
-
A Sigmoid-based car-following model to improve acceleration stability in traffic oscillation and following failure in free flow
Authors:
Xingyu Chen,
Haijian Bai
Abstract:
This paper proposes an improved Intelligent driving model (Sigmoid-IDM) to address the problems of excessive acceleration in traffic oscillation and following failure in free flow. The Sigmoid-IDM uses a Sigmoid function to enhance the start-following characteristics, improve the output strategy of the spacing term, and stabilize the steady-state velocity in free flow. Moreover, the model asymmetr…
▽ More
This paper proposes an improved Intelligent driving model (Sigmoid-IDM) to address the problems of excessive acceleration in traffic oscillation and following failure in free flow. The Sigmoid-IDM uses a Sigmoid function to enhance the start-following characteristics, improve the output strategy of the spacing term, and stabilize the steady-state velocity in free flow. Moreover, the model asymmetry is improved by means of introducing cautious following distance, driving caution factor, and segmentation function. The anti-interference ability of the Sigmoid-IDM is demonstrated by local stability and string stability analyses.
△ Less
Submitted 3 September, 2023;
originally announced September 2023.
-
Challenges and Opportunities for Second-life Batteries: A Review of Key Technologies and Economy
Authors:
Xubo Gu,
Hanyu Bai,
Xiaofan Cui,
Juner Zhu,
Weichao Zhuang,
Zhaojian Li,
Xiaosong Hu,
Ziyou Song
Abstract:
Due to the increasing volume of Electric Vehicles in automotive markets and the limited lifetime of onboard lithium-ion batteries (LIBs), the large-scale retirement of LIBs is imminent. The battery packs retired from Electric Vehicles still own 70%-80% of the initial capacity, thus having the potential to be utilized in scenarios with lower energy and power requirements to maximize the value of LI…
▽ More
Due to the increasing volume of Electric Vehicles in automotive markets and the limited lifetime of onboard lithium-ion batteries (LIBs), the large-scale retirement of LIBs is imminent. The battery packs retired from Electric Vehicles still own 70%-80% of the initial capacity, thus having the potential to be utilized in scenarios with lower energy and power requirements to maximize the value of LIBs. However, spent batteries are commonly less reliable than fresh batteries due to their degraded performance, thereby necessitating a comprehensive assessment from safety and economic perspectives before further utilization. To this end, this paper reviews the key technological and economic aspects of second-life batteries (SLBs). Firstly, we introduce various degradation models for first-life batteries and identify an opportunity to combine physics-based theories with data-driven methods to establish explainable models with physical laws that can be generalized. However, degradation models specifically tailored to SLBs are currently absent. Therefore, we analyze the applicability of existing battery degradation models developed for first-life batteries in SLB applications. Secondly, we investigate fast screening and regrouping techniques and discuss the regrouping standards for the first time to guide the classification procedure and enhance the performance and safety of SLBs. Thirdly, we scrutinize the economic analysis of SLBs and summarize the potentially profitable applications. Finally, we comprehensively examine and compare power electronics technologies that can substantially improve the performance of SLBs, including high-efficiency energy transformation technologies, active equalization technologies, and technologies to improve reliability and safety.
△ Less
Submitted 13 August, 2023;
originally announced August 2023.
-
You Can Mask More For Extremely Low-Bitrate Image Compression
Authors:
Anqi Li,
Feng Li,
Jiaxin Han,
Huihui Bai,
Runmin Cong,
Chunjie Zhang,
Meng Wang,
Weisi Lin,
Yao Zhao
Abstract:
Learned image compression (LIC) methods have experienced significant progress during recent years. However, these methods are primarily dedicated to optimizing the rate-distortion (R-D) performance at medium and high bitrates (> 0.1 bits per pixel (bpp)), while research on extremely low bitrates is limited. Besides, existing methods fail to explicitly explore the image structure and texture compon…
▽ More
Learned image compression (LIC) methods have experienced significant progress during recent years. However, these methods are primarily dedicated to optimizing the rate-distortion (R-D) performance at medium and high bitrates (> 0.1 bits per pixel (bpp)), while research on extremely low bitrates is limited. Besides, existing methods fail to explicitly explore the image structure and texture components crucial for image compression, treating them equally alongside uninformative components in networks. This can cause severe perceptual quality degradation, especially under low-bitrate scenarios. In this work, inspired by the success of pre-trained masked autoencoders (MAE) in many downstream tasks, we propose to rethink its mask sampling strategy from structure and texture perspectives for high redundancy reduction and discriminative feature representation, further unleashing the potential of LIC methods. Therefore, we present a dual-adaptive masking approach (DA-Mask) that samples visible patches based on the structure and texture distributions of original images. We combine DA-Mask and pre-trained MAE in masked image modeling (MIM) as an initial compressor that abstracts informative semantic context and texture representations. Such a pipeline can well cooperate with LIC networks to achieve further secondary compression while preserving promising reconstruction quality. Consequently, we propose a simple yet effective masked compression model (MCM), the first framework that unifies MIM and LIC end-to-end for extremely low-bitrate image compression. Extensive experiments have demonstrated that our approach outperforms recent state-of-the-art methods in R-D performance, visual quality, and downstream applications, at very low bitrates. Our code is available at https://github.com/lianqi1008/MCM.git.
△ Less
Submitted 27 June, 2023;
originally announced June 2023.
-
Exploring Resolution Fields for Scalable Image Compression with Uncertainty Guidance
Authors:
Dongyi Zhang,
Feng Li,
Man Liu,
Runmin Cong,
Huihui Bai,
Meng Wang,
Yao Zhao
Abstract:
Recently, there are significant advancements in learning-based image compression methods surpassing traditional coding standards. Most of them prioritize achieving the best rate-distortion performance for a particular compression rate, which limits their flexibility and adaptability in various applications with complex and varying constraints. In this work, we explore the potential of resolution f…
▽ More
Recently, there are significant advancements in learning-based image compression methods surpassing traditional coding standards. Most of them prioritize achieving the best rate-distortion performance for a particular compression rate, which limits their flexibility and adaptability in various applications with complex and varying constraints. In this work, we explore the potential of resolution fields in scalable image compression and propose the reciprocal pyramid network (RPN) that fulfills the need for more adaptable and versatile compression. Specifically, RPN first builds a compression pyramid and generates the resolution fields at different levels in a top-down manner. The key design lies in the cross-resolution context mining module between adjacent levels, which performs feature enriching and distillation to mine meaningful contextualized information and remove unnecessary redundancy, producing informative resolution fields as residual priors. The scalability is achieved by progressive bitstream reusing and resolution field incorporation varying at different levels. Furthermore, between adjacent compression levels, we explicitly quantify the aleatoric uncertainty from the bottom decoded representations and develop an uncertainty-guided loss to update the upper-level compression parameters, forming a reverse pyramid process that enforces the network to focus on the textured pixels with high variance for more reliable and accurate reconstruction. Combining resolution field exploration and uncertainty guidance in a pyramid manner, RPN can effectively achieve spatial and quality scalable image compression. Experiments show the superiority of RPN against existing classical and deep learning-based scalable codecs. Code will be available at https://github.com/JGIroro/RPNSIC.
△ Less
Submitted 15 June, 2023;
originally announced June 2023.
-
Reinforcement Learning-based Control of Nonlinear Systems using Carleman Approximation: Structured and Unstructured Designs
Authors:
Jishnudeep Kar,
He Bai,
Aranya Chakrabortty
Abstract:
We develop data-driven reinforcement learning (RL) control designs for input-affine nonlinear systems. We use Carleman linearization to express the state-space representation of the nonlinear dynamical model in the Carleman space, and develop a real-time algorithm that can learn nonlinear state-feedback controllers using state and input measurements in the infinite-dimensional Carleman space. Ther…
▽ More
We develop data-driven reinforcement learning (RL) control designs for input-affine nonlinear systems. We use Carleman linearization to express the state-space representation of the nonlinear dynamical model in the Carleman space, and develop a real-time algorithm that can learn nonlinear state-feedback controllers using state and input measurements in the infinite-dimensional Carleman space. Thereafter, we study the practicality of having a finite-order truncation of the control signal, followed by its closed-loop stability analysis. Finally, we develop two additional designs that can learn structured as well as sparse representations of the RL-based nonlinear controller, and provide theoretical conditions for ensuring their closed-loop stability. We present numerical examples to show how our proposed method generates closed-loop responses that are close to the optimal performance of the nonlinear plant. We also compare our designs to other data-driven nonlinear RL control methods such as those based on neural networks, and illustrate their relative advantages and drawbacks.
△ Less
Submitted 7 August, 2024; v1 submitted 21 February, 2023;
originally announced February 2023.
-
ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech
Authors:
Xiaoran Fan,
Chao Pang,
Tian Yuan,
He Bai,
Renjie Zheng,
Pengfei Zhu,
Shuohuan Wang,
Junkun Chen,
Zeyu Chen,
Liang Huang,
Yu Sun,
Hua Wu
Abstract:
Speech representation learning has improved both speech understanding and speech synthesis tasks for single language. However, its ability in cross-lingual scenarios has not been explored. In this paper, we extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks, including cross-lingual multi-speaker voice cloning and cross-lingual multi-speaker speech editing. We prop…
▽ More
Speech representation learning has improved both speech understanding and speech synthesis tasks for single language. However, its ability in cross-lingual scenarios has not been explored. In this paper, we extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks, including cross-lingual multi-speaker voice cloning and cross-lingual multi-speaker speech editing. We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes given a speech example and its transcription. By learning to reconstruct the masked parts of the input in different languages, our model shows great improvements over speaker-embedding-based multi-speaker TTS methods. Moreover, our framework is end-to-end for both the training and the inference without any finetuning effort. In cross-lingual multi-speaker voice cloning and cross-lingual multi-speaker speech editing tasks, our experiments show that our model outperforms speaker-embedding-based multi-speaker TTS methods.
△ Less
Submitted 4 December, 2022; v1 submitted 7 November, 2022;
originally announced November 2022.
-
AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation
Authors:
Yuanfeng Ji,
Haotian Bai,
Jie Yang,
Chongjian Ge,
Ye Zhu,
Ruimao Zhang,
Zhen Li,
Lingyan Zhang,
Wanling Ma,
Xiang Wan,
Ping Luo
Abstract:
Despite the considerable progress in automatic abdominal multi-organ segmentation from CT/MRI scans in recent years, a comprehensive evaluation of the models' capabilities is hampered by the lack of a large-scale benchmark from diverse clinical scenarios. Constraint by the high cost of collecting and labeling 3D medical data, most of the deep learning models to date are driven by datasets with a l…
▽ More
Despite the considerable progress in automatic abdominal multi-organ segmentation from CT/MRI scans in recent years, a comprehensive evaluation of the models' capabilities is hampered by the lack of a large-scale benchmark from diverse clinical scenarios. Constraint by the high cost of collecting and labeling 3D medical data, most of the deep learning models to date are driven by datasets with a limited number of organs of interest or samples, which still limits the power of modern deep models and makes it difficult to provide a fully comprehensive and fair estimate of various methods. To mitigate the limitations, we present AMOS, a large-scale, diverse, clinical dataset for abdominal organ segmentation. AMOS provides 500 CT and 100 MRI scans collected from multi-center, multi-vendor, multi-modality, multi-phase, multi-disease patients, each with voxel-level annotations of 15 abdominal organs, providing challenging examples and test-bed for studying robust segmentation algorithms under diverse targets and scenarios. We further benchmark several state-of-the-art medical segmentation models to evaluate the status of the existing methods on this new challenging dataset. We have made our datasets, benchmark servers, and baselines publicly available, and hope to inspire future research. Information can be found at https://amos22.grand-challenge.org.
△ Less
Submitted 1 September, 2022; v1 submitted 16 June, 2022;
originally announced June 2022.
-
GoAutoBash: Golang-based Multi-Thread Automatic Pull-Execute Framework with GitHub Webhooks And Queuing Strategy
Authors:
Hao Bai
Abstract:
Recently, more and more server tasks are done using full automation, including grading tasks for students in the college courses, integrating tasks for programmers in big projects and server-based transactions, and visualization tasks for researchers in a data-dense topic. Using automation on servers provides a great possibility for reducing the burden on manual tasks. Although server tools like C…
▽ More
Recently, more and more server tasks are done using full automation, including grading tasks for students in the college courses, integrating tasks for programmers in big projects and server-based transactions, and visualization tasks for researchers in a data-dense topic. Using automation on servers provides a great possibility for reducing the burden on manual tasks. Although server tools like CI/CD for continuous integration and Hexo for automated blog deployment have been developed, they're highly dedicated to certain functionalities and thus lack general usage. In this paper, we introduce a Golang-based automation framework that reacts to the events happening on GitHub in a multi-thread approach. This framework utilizes a queue to arrange the tasks submitted and execute each task with a thread in a preemptive manner. We then use the project GoAutoGrader to illustrate a specific implementation of this framework and its value in implementing high-freedom server applications. As Golang is developing in a rapid way because of its incredible parallel programming efficiency and a super-easy way to learn on the basis of C-like programming languages, we decide to develop this system in Golang.
△ Less
Submitted 13 June, 2022;
originally announced June 2022.
-
A$^3$T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing
Authors:
He Bai,
Renjie Zheng,
Junkun Chen,
Xintong Li,
Mingbo Ma,
Liang Huang
Abstract:
Recently, speech representation learning has improved many speech-related tasks such as speech recognition, speech classification, and speech-to-text translation. However, all the above tasks are in the direction of speech understanding, but for the inverse direction, speech synthesis, the potential of representation learning is yet to be realized, due to the challenging nature of generating high-…
▽ More
Recently, speech representation learning has improved many speech-related tasks such as speech recognition, speech classification, and speech-to-text translation. However, all the above tasks are in the direction of speech understanding, but for the inverse direction, speech synthesis, the potential of representation learning is yet to be realized, due to the challenging nature of generating high-quality speech. To address this problem, we propose our framework, Alignment-Aware Acoustic-Text Pretraining (A$^3$T), which reconstructs masked acoustic signals with text input and acoustic-text alignment during training. In this way, the pretrained model can generate high quality reconstructed spectrogram, which can be applied to the speech editing and unseen speaker TTS directly. Experiments show A$^3$T outperforms SOTA models on speech editing, and improves multi-speaker speech synthesis without the external speaker verification model.
△ Less
Submitted 18 June, 2022; v1 submitted 17 March, 2022;
originally announced March 2022.
-
Distributed Cooperative Multi-Agent Reinforcement Learning with Directed Coordination Graph
Authors:
Gangshan Jing,
He Bai,
Jemin George,
Aranya Chakrabortty,
Piyush. K. Sharma
Abstract:
Existing distributed cooperative multi-agent reinforcement learning (MARL) frameworks usually assume undirected coordination graphs and communication graphs while estimating a global reward via consensus algorithms for policy evaluation. Such a framework may induce expensive communication costs and exhibit poor scalability due to requirement of global consensus. In this work, we study MARLs with d…
▽ More
Existing distributed cooperative multi-agent reinforcement learning (MARL) frameworks usually assume undirected coordination graphs and communication graphs while estimating a global reward via consensus algorithms for policy evaluation. Such a framework may induce expensive communication costs and exhibit poor scalability due to requirement of global consensus. In this work, we study MARLs with directed coordination graphs, and propose a distributed RL algorithm where the local policy evaluations are based on local value functions. The local value function of each agent is obtained by local communication with its neighbors through a directed learning-induced communication graph, without using any consensus algorithm. A zeroth-order optimization (ZOO) approach based on parameter perturbation is employed to achieve gradient estimation. By comparing with existing ZOO-based RL algorithms, we show that our proposed distributed RL algorithm guarantees high scalability. A distributed resource allocation example is shown to illustrate the effectiveness of our algorithm.
△ Less
Submitted 9 January, 2022;
originally announced January 2022.
-
Super-resolution reconstruction of cytoskeleton image based on A-net deep learning network
Authors:
Qian Chen,
Haoxin Bai,
Bingchen Che,
Tianyun Zhao,
Ce Zhang,
Kaige Wang,
Jintao Bai,
Wei Zhao
Abstract:
To date, live-cell imaging at the nanometer scale remains challenging. Even though super-resolution microscopy methods have enabled visualization of subcellular structures below the optical resolution limit, the spatial resolution is still far from enough for the structural reconstruction of biomolecules in vivo (i.e. ~24 nm thickness of microtubule fiber). In this study, we proposed an A-net netw…
▽ More
To date, live-cell imaging at the nanometer scale remains challenging. Even though super-resolution microscopy methods have enabled visualization of subcellular structures below the optical resolution limit, the spatial resolution is still far from enough for the structural reconstruction of biomolecules in vivo (i.e. ~24 nm thickness of microtubule fiber). In this study, we proposed an A-net network and showed that the resolution of cytoskeleton images captured by a confocal microscope can be significantly improved by combining the A-net deep learning network with the DWDC algorithm based on degradation model. Utilizing the DWDC algorithm to construct new datasets and taking advantage of A-net neural network's features (i.e., considerably fewer layers), we successfully removed the noise and flocculent structures, which originally interfere with the cellular structure in the raw image, and improved the spatial resolution by 10 times using relatively small dataset. We, therefore, conclude that the proposed algorithm that combines A-net neural network with the DWDC method is a suitable and universal approach for exacting structural details of biomolecules, cells and organs from low-resolution images.
△ Less
Submitted 17 December, 2021;
originally announced December 2021.
-
Asynchronous Distributed Reinforcement Learning for LQR Control via Zeroth-Order Block Coordinate Descent
Authors:
Gangshan Jing,
He Bai,
Jemin George,
Aranya Chakrabortty,
Piyush K. Sharma
Abstract:
Recently introduced distributed zeroth-order optimization (ZOO) algorithms have shown their utility in distributed reinforcement learning (RL). Unfortunately, in the gradient estimation process, almost all of them require random samples with the same dimension as the global variable and/or require evaluation of the global cost function, which may induce high estimation variance for large-scale net…
▽ More
Recently introduced distributed zeroth-order optimization (ZOO) algorithms have shown their utility in distributed reinforcement learning (RL). Unfortunately, in the gradient estimation process, almost all of them require random samples with the same dimension as the global variable and/or require evaluation of the global cost function, which may induce high estimation variance for large-scale networks. In this paper, we propose a novel distributed zeroth-order algorithm by leveraging the network structure inherent in the optimization objective, which allows each agent to estimate its local gradient by local cost evaluation independently, without use of any consensus protocol. The proposed algorithm exhibits an asynchronous update scheme, and is designed for stochastic non-convex optimization with a possibly non-convex feasible domain based on the block coordinate descent method. The algorithm is later employed as a distributed model-free RL algorithm for distributed linear quadratic regulator design, where a learning graph is designed to describe the required interaction relationship among agents in distributed learning. We provide an empirical validation of the proposed algorithm to benchmark its performance on convergence rate and variance against a centralized ZOO algorithm.
△ Less
Submitted 2 May, 2024; v1 submitted 26 July, 2021;
originally announced July 2021.
-
Variance Reduction of Quadcopter Trajectory Tracking in Turbulent Wind
Authors:
Asma Tabassum,
Rohit K. S. S. Vuppala,
He Bai,
Kursat Kara
Abstract:
We consider a quadcopter operating in a turbulent windy environment. The turbulent environment may be imposed on a quadcopter by structures, landscapes, terrains and most importantly by the unique physical phenomena in the lower atmosphere. Turbulence can negatively impact quadcopter's performance and operations. Modeling turbulence as a stochastic random input, we investigate control designs that…
▽ More
We consider a quadcopter operating in a turbulent windy environment. The turbulent environment may be imposed on a quadcopter by structures, landscapes, terrains and most importantly by the unique physical phenomena in the lower atmosphere. Turbulence can negatively impact quadcopter's performance and operations. Modeling turbulence as a stochastic random input, we investigate control designs that can reduce the turbulence effects on the quadcopter's motion. In particular, we design a minimum cost variance (MCV) controller aiming to minimize the cost in terms of its weighted sum of mean and variance. We linearize the quadcopter dynamics and examine the MCV controller derived from a set of coupled algebraic Riccati equations (CARE) with full-state feedback. Our preliminary simulation results show reduction in variance and in mean trajectory tracking error compared to a traditional linear quadratic regulator (LQR).
△ Less
Submitted 25 August, 2021; v1 submitted 20 April, 2021;
originally announced April 2021.
-
Dynamic Control Allocation between Onboard and Delayed Remote Control for Unmanned Aircraft System Detect-and-Avoid
Authors:
Asma Tabassum,
He Bai
Abstract:
This paper develops and evaluates the performance of an allocation agent to be potentially integrated into the onboard Detect and Avoid (DAA) computer of an Unmanned Aircraft System (UAS). We consider a UAS that can be fully controlled by the onboard DAA system and by a remote human pilot. With a communication channel prone to latency, we consider a mixed initiative interaction environment, where…
▽ More
This paper develops and evaluates the performance of an allocation agent to be potentially integrated into the onboard Detect and Avoid (DAA) computer of an Unmanned Aircraft System (UAS). We consider a UAS that can be fully controlled by the onboard DAA system and by a remote human pilot. With a communication channel prone to latency, we consider a mixed initiative interaction environment, where the control authority of the UAS is dynamically allocated by the allocation agent. In an encounter with a dynamic intruder, the probability of collision may increase in the absence of pilot commands in the presence of latency. Moreover, a delayed pilot command may not result in safe resolution of the current scenario and need to be improvised. We design an optimization algorithm to reduce collision risk and refine delayed pilot commands. Towards this end, a Markov Decision Process (MDP)and its solution are employed to create a wait time map. The map consists of estimated times that the UAS can wait for the remote pilot commands at each state. A command blending algorithm is designed to select an avoidance maneuver that prioritizes the pilot intention extracted from the pilot commands. The wait time map and the command blending algorithm are implemented and integrated into a closed-loop simulator. We conduct ten thousands fast-time Monte Carlo simulations and compare the performance of the integrated setup with a standalone DAA setup. The simulation results show that the allocation agent enables the UAS to wait without inducing any near mid air collision (NMAC) and severe loss of well clear (LoWC) while positively improve pilot involvement in the encounter resolution.
△ Less
Submitted 13 March, 2021;
originally announced March 2021.
-
Learning Distributed Stabilizing Controllers for Multi-Agent Systems
Authors:
Gangshan Jing,
He Bai,
Jemin George,
Aranya Chakrabortty,
Piyush K. Sharma
Abstract:
We address the problem of model-free distributed stabilization of heterogeneous multi-agent systems using reinforcement learning (RL). Two algorithms are developed. The first algorithm solves a centralized linear quadratic regulator (LQR) problem without knowing any initial stabilizing gain in advance. The second algorithm builds upon the results of the first algorithm, and extends it to distribut…
▽ More
We address the problem of model-free distributed stabilization of heterogeneous multi-agent systems using reinforcement learning (RL). Two algorithms are developed. The first algorithm solves a centralized linear quadratic regulator (LQR) problem without knowing any initial stabilizing gain in advance. The second algorithm builds upon the results of the first algorithm, and extends it to distributed stabilization of multi-agent systems with predefined interaction graphs. Rigorous proofs are provided to show that the proposed algorithms achieve guaranteed convergence if specific conditions hold. A simulation example is presented to demonstrate the theoretical results.
△ Less
Submitted 7 March, 2021;
originally announced March 2021.
-
Exploration of Whether Skylight Polarization Patterns Contain Three-dimensional Attitude Information
Authors:
Huaju Liang,
Hongyang Bai,
Tong Zhou
Abstract:
Our previous work has demonstrated that Rayleigh model, which is widely used in polarized skylight navigation to describe skylight polarization patterns, does not contain three-dimensional (3D) attitude information [1]. However, it is still necessary to further explore whether the skylight polarization patterns contain 3D attitude information. So, in this paper, a social spider optimization (SSO)…
▽ More
Our previous work has demonstrated that Rayleigh model, which is widely used in polarized skylight navigation to describe skylight polarization patterns, does not contain three-dimensional (3D) attitude information [1]. However, it is still necessary to further explore whether the skylight polarization patterns contain 3D attitude information. So, in this paper, a social spider optimization (SSO) method is proposed to estimate three Euler angles, which considers the difference of each pixel among polarization images based on template matching (TM) to make full use of the captured polarization information. In addition, to explore this problem, we not only use angle of polarization (AOP) and degree of polarization (DOP) information, but also the light intensity (LI) information. So, a sky model is established, which combines Berry model and Hosek model to fully describe AOP, DOP, and LI information in the sky, and considers the influence of four neutral points, ground albedo, atmospheric turbidity, and wavelength. The results of simulation show that the SSO algorithm can estimate 3D attitude and the established sky model contains 3D attitude information. However, when there are measurement noise or model error, the accuracy of 3D attitude estimation drops significantly. Especially in field experiment, it is very difficult to estimate 3D attitude. Finally, the results are discussed in detail.
△ Less
Submitted 30 November, 2020;
originally announced December 2020.
-
Online Observer-Based Inverse Reinforcement Learning
Authors:
Ryan Self,
Kevin Coleman,
He Bai,
Rushikesh Kamalapurkar
Abstract:
In this paper, a novel approach to the output-feedback inverse reinforcement learning (IRL) problem is developed by casting the IRL problem, for linear systems with quadratic cost functions, as a state estimation problem. Two observer-based techniques for IRL are developed, including a novel observer method that re-uses previous state estimates via history stacks. Theoretical guarantees for conver…
▽ More
In this paper, a novel approach to the output-feedback inverse reinforcement learning (IRL) problem is developed by casting the IRL problem, for linear systems with quadratic cost functions, as a state estimation problem. Two observer-based techniques for IRL are developed, including a novel observer method that re-uses previous state estimates via history stacks. Theoretical guarantees for convergence and robustness are established under appropriate excitation conditions. Simulations demonstrate the performance of the developed observers and filters under noisy and noise-free measurements.
△ Less
Submitted 17 July, 2023; v1 submitted 3 November, 2020;
originally announced November 2020.
-
Decomposability and Parallel Computation of Multi-Agent LQR
Authors:
Gangshan Jing,
He Bai,
Jemin George,
Aranya Chakrabortty
Abstract:
Individual agents in a multi-agent system (MAS) may have decoupled open-loop dynamics, but a cooperative control objective usually results in coupled closed-loop dynamics thereby making the control design computationally expensive. The computation time becomes even higher when a learning strategy such as reinforcement learning (RL) needs to be applied to deal with the situation when the agents dyn…
▽ More
Individual agents in a multi-agent system (MAS) may have decoupled open-loop dynamics, but a cooperative control objective usually results in coupled closed-loop dynamics thereby making the control design computationally expensive. The computation time becomes even higher when a learning strategy such as reinforcement learning (RL) needs to be applied to deal with the situation when the agents dynamics are not known. To resolve this problem, we propose a parallel RL scheme for a linear quadratic regulator (LQR) design in a continuous-time linear MAS. The idea is to exploit the structural properties of two graphs embedded in the $Q$ and $R$ weighting matrices in the LQR objective to define an orthogonal transformation that can convert the original LQR design to multiple decoupled smaller-sized LQR designs. We show that if the MAS is homogeneous then this decomposition retains closed-loop optimality. Conditions for decomposability, an algorithm for constructing the transformation matrix, a parallel RL algorithm, and robustness analysis when the design is applied to non-homogeneous MAS are presented. Simulations show that the proposed approach can guarantee significant speed-up in learning without any loss in the cumulative value of the LQR cost.
△ Less
Submitted 7 March, 2021; v1 submitted 16 October, 2020;
originally announced October 2020.
-
Model-Free Optimal Control of Linear Multi-Agent Systems via Decomposition and Hierarchical Approximation
Authors:
Gangshan Jing,
He Bai,
Jemin George,
Aranya Chakrabortty
Abstract:
Designing the optimal linear quadratic regulator (LQR) for a large-scale multi-agent system (MAS) is time-consuming since it involves solving a large-size matrix Riccati equation. The situation is further exasperated when the design needs to be done in a model-free way using schemes such as reinforcement learning (RL). To reduce this computational complexity, we decompose the large-scale LQR desig…
▽ More
Designing the optimal linear quadratic regulator (LQR) for a large-scale multi-agent system (MAS) is time-consuming since it involves solving a large-size matrix Riccati equation. The situation is further exasperated when the design needs to be done in a model-free way using schemes such as reinforcement learning (RL). To reduce this computational complexity, we decompose the large-scale LQR design problem into multiple smaller-size LQR design problems. We consider the objective function to be specified over an undirected graph, and cast the decomposition as a graph clustering problem. The graph is decomposed into two parts, one consisting of independent clusters of connected components, and the other containing edges that connect different clusters. Accordingly, the resulting controller has a hierarchical structure, consisting of two components. The first component optimizes the performance of each independent cluster by solving the smaller-size LQR design problem in a model-free way using an RL algorithm. The second component accounts for the objective coupling different clusters, which is achieved by solving a least squares problem in one shot. Although suboptimal, the hierarchical controller adheres to a particular structure as specified by inter-agent couplings in the objective function and by the decomposition strategy. Mathematical formulations are established to find a decomposition that minimizes the number of required communication links or reduces the optimality gap. Numerical simulations are provided to highlight the pros and cons of the proposed designs.
△ Less
Submitted 16 March, 2021; v1 submitted 14 August, 2020;
originally announced August 2020.
-
Hierarchical Control of Multi-Agent Systems using Online Reinforcement Learning
Authors:
He Bai,
Jemin George,
Aranya Chakrabortty
Abstract:
We propose a new reinforcement learning based approach to designing hierarchical linear quadratic regulator (LQR) controllers for heterogeneous linear multi-agent systems with unknown state-space models and separated control objectives. The separation arises from grouping the agents into multiple non-overlapping groups, and defining the control goal as two distinct objectives. The first objective…
▽ More
We propose a new reinforcement learning based approach to designing hierarchical linear quadratic regulator (LQR) controllers for heterogeneous linear multi-agent systems with unknown state-space models and separated control objectives. The separation arises from grouping the agents into multiple non-overlapping groups, and defining the control goal as two distinct objectives. The first objective aims to minimize a group-wise block-decentralized LQR function that models group-level mission. The second objective, on the other hand, tries to minimize an LQR function between the average states (centroids) of the groups. Exploiting this separation, we redefine the weighting matrices of the LQR functions in a way that they allow us to decouple their respective algebraic Riccati equations. Thereafter, we develop a reinforcement learning strategy that uses online measurements of the agent states and the average states to learn the respective controllers based on the approximate Riccati equations. Since the first controller is block-decentralized and, therefore, can be learned in parallel, while the second controller is reduced-dimensional due to averaging, the overall design enjoys a significantly reduced learning time compared to centralized reinforcement learning.
△ Less
Submitted 28 July, 2020;
originally announced July 2020.
-
Emulating UAV Motion by Utilizing Robotic Arm for mmWave Wireless Channel Characterization
Authors:
Amit Kachroo,
Collin A. Thornton,
Md Arifur Rahman Sarker,
Wooyeol Choi,
He Bai,
Ickhyun Song,
John O'Hara,
Sabit Ekin
Abstract:
In this paper, millimeter wave (mmWave) wireless channel characteristics (Doppler spread and path loss modeling) for Unmanned Aerial Vehicles (UAVs) assisted communication is analyzed and studied by emulating the real UAV motion using a robotic arm. The motion considers the actual turbulence caused by the wind gusts to the UAV in the atmosphere, which is statistically modeled by the widely used Dr…
▽ More
In this paper, millimeter wave (mmWave) wireless channel characteristics (Doppler spread and path loss modeling) for Unmanned Aerial Vehicles (UAVs) assisted communication is analyzed and studied by emulating the real UAV motion using a robotic arm. The motion considers the actual turbulence caused by the wind gusts to the UAV in the atmosphere, which is statistically modeled by the widely used Dryden wind model. The frequency under consideration is 28 GHz in an anechoic chamber setting. A total of 11 distance points from 3.5 feet to 23.5 feet in increments of 2 feet were considered in this experiment. At each distance point, 3 samples of data were collected for better inference purposes. In this emulated environment, it was found out that the average Doppler spread at these different distances was around -20 Hz and +20 Hz at the noise floor of -60 dB. On the other hand, the path loss exponent was found to be 1.843. This study presents and lays out a novel framework of emulating UAV motion for mmWave communication systems, which will pave the way out for future design and implementation of next generation UAV-assisted wireless communication systems.
△ Less
Submitted 21 March, 2021; v1 submitted 20 June, 2020;
originally announced June 2020.
-
Reduced-Dimensional Reinforcement Learning Control using Singular Perturbation Approximations
Authors:
Sayak Mukherjee,
He Bai,
Aranya Chakrabortty
Abstract:
We present a set of model-free, reduced-dimensional reinforcement learning (RL) based optimal control designs for linear time-invariant singularly perturbed (SP) systems. We first present a state-feedback and output-feedback based RL control design for a generic SP system with unknown state and input matrices. We take advantage of the underlying time-scale separation property of the plant to learn…
▽ More
We present a set of model-free, reduced-dimensional reinforcement learning (RL) based optimal control designs for linear time-invariant singularly perturbed (SP) systems. We first present a state-feedback and output-feedback based RL control design for a generic SP system with unknown state and input matrices. We take advantage of the underlying time-scale separation property of the plant to learn a linear quadratic regulator (LQR) for only its slow dynamics, thereby saving a significant amount of learning time compared to the conventional full-dimensional RL controller. We analyze the sub-optimality of the design using SP approximation theorems and provide sufficient conditions for closed-loop stability. Thereafter, we extend both designs to clustered multi-agent consensus networks, where the SP property reflects through clustering. We develop both centralized and cluster-wise block-decentralized RL controllers for such networks, in reduced dimensions. We demonstrate the details of the implementation of these controllers using simulations of relevant numerical examples and compare them with conventional RL designs to show the computational benefits of our approach.
△ Less
Submitted 29 April, 2020;
originally announced April 2020.
-
Deep Optimized Multiple Description Image Coding via Scalar Quantization Learning
Authors:
Lijun Zhao,
Huihui Bai,
Anhong Wang,
Yao Zhao
Abstract:
In this paper, we introduce a deep multiple description coding (MDC) framework optimized by minimizing multiple description (MD) compressive loss. First, MD multi-scale-dilated encoder network generates multiple description tensors, which are discretized by scalar quantizers, while these quantized tensors are decompressed by MD cascaded-ResBlock decoder networks. To greatly reduce the total amount…
▽ More
In this paper, we introduce a deep multiple description coding (MDC) framework optimized by minimizing multiple description (MD) compressive loss. First, MD multi-scale-dilated encoder network generates multiple description tensors, which are discretized by scalar quantizers, while these quantized tensors are decompressed by MD cascaded-ResBlock decoder networks. To greatly reduce the total amount of artificial neural network parameters, an auto-encoder network composed of these two types of network is designed as a symmetrical parameter sharing structure. Second, this autoencoder network and a pair of scalar quantizers are simultaneously learned in an end-to-end self-supervised way. Third, considering the variation in the image spatial distribution, each scalar quantizer is accompanied by an importance-indicator map to generate MD tensors, rather than using direct quantization. Fourth, we introduce the multiple description structural similarity distance loss, which implicitly regularizes the diversified multiple description generations, to explicitly supervise multiple description diversified decoding in addition to MD reconstruction loss. Finally, we demonstrate that our MDC framework performs better than several state-of-the-art MDC approaches regarding image coding efficiency when tested on several commonly available datasets.
△ Less
Submitted 12 January, 2020;
originally announced January 2020.
-
RTN: Reparameterized Ternary Network
Authors:
Yuhang Li,
Xin Dong,
Sai Qian Zhang,
Haoli Bai,
Yuanpeng Chen,
Wei Wang
Abstract:
To deploy deep neural networks on resource-limited devices, quantization has been widely explored. In this work, we study the extremely low-bit networks which have tremendous speed-up, memory saving with quantized activation and weights. We first bring up three omitted issues in extremely low-bit networks: the squashing range of quantized values; the gradient vanishing during backpropagation and t…
▽ More
To deploy deep neural networks on resource-limited devices, quantization has been widely explored. In this work, we study the extremely low-bit networks which have tremendous speed-up, memory saving with quantized activation and weights. We first bring up three omitted issues in extremely low-bit networks: the squashing range of quantized values; the gradient vanishing during backpropagation and the unexploited hardware acceleration of ternary networks. By reparameterizing quantized activation and weights vector with full precision scale and offset for fixed ternary vector, we decouple the range and magnitude from the direction to extenuate the three issues. Learnable scale and offset can automatically adjust the range of quantized values and sparsity without gradient vanishing. A novel encoding and computation pat-tern are designed to support efficient computing for our reparameterized ternary network (RTN). Experiments on ResNet-18 for ImageNet demonstrate that the proposed RTN finds a much better efficiency between bitwidth and accuracy, and achieves up to 26.76% relative accuracy improvement compared with state-of-the-art methods. Moreover, we validate the proposed computation pattern on Field Programmable Gate Arrays (FPGA), and it brings 46.46x and 89.17x savings on power and area respectively compared with the full precision convolution.
△ Less
Submitted 12 December, 2019; v1 submitted 4 December, 2019;
originally announced December 2019.
-
Compressed Sensing with Probability-based Prior Information
Authors:
Q. Jiang,
S. Li,
Z. Zhu,
H. Bai,
X. He,
R. C. de Lamare
Abstract:
This paper deals with the design of a sensing matrix along with a sparse recovery algorithm by utilizing the probability-based prior information for compressed sensing system. With the knowledge of the probability for each atom of the dictionary being used, a diagonal weighted matrix is obtained and then the sensing matrix is designed by minimizing a weighted function such that the Gram of the equ…
▽ More
This paper deals with the design of a sensing matrix along with a sparse recovery algorithm by utilizing the probability-based prior information for compressed sensing system. With the knowledge of the probability for each atom of the dictionary being used, a diagonal weighted matrix is obtained and then the sensing matrix is designed by minimizing a weighted function such that the Gram of the equivalent dictionary is as close to the Gram of dictionary as possible. An analytical solution for the corresponding sensing matrix is derived which leads to low computational complexity. We also exploit this prior information through the sparse recovery stage and propose a probability-driven orthogonal matching pursuit algorithm that improves the accuracy of the recovery. Simulations for synthetic data and application scenarios of surveillance video are carried out to compare the performance of the proposed methods with some existing algorithms. The results reveal that the proposed CS system outperforms existing CS systems.
△ Less
Submitted 27 October, 2019;
originally announced October 2019.