-
Single-Channel Distance-Based Source Separation for Mobile GPU in Outdoor and Indoor Environments
Authors:
Hanbin Bae,
Byungjun Kang,
Jiwon Kim,
Jaeyong Hwang,
Hosang Sung,
Hoon-Young Cho
Abstract:
This study emphasizes the significance of exploring distance-based source separation (DSS) in outdoor environments. Unlike existing studies that primarily focus on indoor settings, the proposed model is designed to capture the unique characteristics of outdoor audio sources. It incorporates advanced techniques, including a two-stage conformer block, a linear relation-aware self-attention (RSA), an…
▽ More
This study emphasizes the significance of exploring distance-based source separation (DSS) in outdoor environments. Unlike existing studies that primarily focus on indoor settings, the proposed model is designed to capture the unique characteristics of outdoor audio sources. It incorporates advanced techniques, including a two-stage conformer block, a linear relation-aware self-attention (RSA), and a TensorFlow Lite GPU delegate. While the linear RSA may not capture physical cues as explicitly as the quadratic RSA, the linear RSA enhances the model's context awareness, leading to improved performance on the DSS that requires an understanding of physical cues in outdoor and indoor environments. The experimental results demonstrated that the proposed model overcomes the limitations of existing approaches and considerably enhances energy efficiency and real-time inference speed on mobile devices.
△ Less
Submitted 6 January, 2025;
originally announced January 2025.
-
GPT-4o System Card
Authors:
OpenAI,
:,
Aaron Hurst,
Adam Lerer,
Adam P. Goucher,
Adam Perelman,
Aditya Ramesh,
Aidan Clark,
AJ Ostrow,
Akila Welihinda,
Alan Hayes,
Alec Radford,
Aleksander MÄ…dry,
Alex Baker-Whitcomb,
Alex Beutel,
Alex Borzunov,
Alex Carney,
Alex Chow,
Alex Kirillov,
Alex Nichol,
Alex Paino,
Alex Renzin,
Alex Tachard Passos,
Alexander Kirillov,
Alexi Christakis
, et al. (395 additional authors not shown)
Abstract:
GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 mil…
▽ More
GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50\% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House, we are sharing the GPT-4o System Card, which includes our Preparedness Framework evaluations. In this System Card, we provide a detailed look at GPT-4o's capabilities, limitations, and safety evaluations across multiple categories, focusing on speech-to-speech while also evaluating text and image capabilities, and measures we've implemented to ensure the model is safe and aligned. We also include third-party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o's text and vision capabilities.
△ Less
Submitted 25 October, 2024;
originally announced October 2024.
-
FINALLY: fast and universal speech enhancement with studio-like quality
Authors:
Nicholas Babaev,
Kirill Tamogashev,
Azat Saginbaev,
Ivan Shchekotov,
Hanbin Bae,
Hosang Sung,
WonJun Lee,
Hoon-Young Cho,
Pavel Andreev
Abstract:
In this paper, we address the challenge of speech enhancement in real-world recordings, which often contain various forms of distortion, such as background noise, reverberation, and microphone artifacts. We revisit the use of Generative Adversarial Networks (GANs) for speech enhancement and theoretically show that GANs are naturally inclined to seek the point of maximum density within the conditio…
▽ More
In this paper, we address the challenge of speech enhancement in real-world recordings, which often contain various forms of distortion, such as background noise, reverberation, and microphone artifacts. We revisit the use of Generative Adversarial Networks (GANs) for speech enhancement and theoretically show that GANs are naturally inclined to seek the point of maximum density within the conditional clean speech distribution, which, as we argue, is essential for the speech enhancement task. We study various feature extractors for perceptual loss to facilitate the stability of adversarial training, developing a methodology for probing the structure of the feature space. This leads us to integrate WavLM-based perceptual loss into MS-STFT adversarial training pipeline, creating an effective and stable training procedure for the speech enhancement model. The resulting speech enhancement model, which we refer to as FINALLY, builds upon the HiFi++ architecture, augmented with a WavLM encoder and a novel training pipeline. Empirical results on various datasets confirm our model's ability to produce clear, high-quality speech at 48 kHz, achieving state-of-the-art performance in the field of speech enhancement. Demo page: https://samsunglabs.github.io/FINALLY-page
△ Less
Submitted 31 October, 2024; v1 submitted 8 October, 2024;
originally announced October 2024.
-
Speech Boosting: Low-Latency Live Speech Enhancement for TWS Earbuds
Authors:
Hanbin Bae,
Pavel Andreev,
Azat Saginbaev,
Nicholas Babaev,
Won-Jun Lee,
Hosang Sung,
Hoon-Young Cho
Abstract:
This paper introduces a speech enhancement solution tailored for true wireless stereo (TWS) earbuds on-device usage. The solution was specifically designed to support conversations in noisy environments, with active noise cancellation (ANC) activated. The primary challenges for speech enhancement models in this context arise from computational complexity that limits on-device usage and latency tha…
▽ More
This paper introduces a speech enhancement solution tailored for true wireless stereo (TWS) earbuds on-device usage. The solution was specifically designed to support conversations in noisy environments, with active noise cancellation (ANC) activated. The primary challenges for speech enhancement models in this context arise from computational complexity that limits on-device usage and latency that must be less than 3 ms to preserve a live conversation. To address these issues, we evaluated several crucial design elements, including the network architecture and domain, design of loss functions, pruning method, and hardware-specific optimization. Consequently, we demonstrated substantial improvements in speech enhancement quality compared with that in baseline models, while simultaneously reducing the computational complexity and algorithmic latency.
△ Less
Submitted 27 September, 2024;
originally announced September 2024.
-
Energy Consumption of Plant Factory with Artificial Light: Challenges and Opportunities
Authors:
Wenyi Cai,
Kunlang Bu,
Lingyan Zha,
Jingjin Zhang,
Dayi Lai,
Hua Bao
Abstract:
Plant factory with artificial light (PFAL) is a promising technology for relieving the food crisis, especially in urban areas or arid regions endowed with abundant resources. However, lighting and HVAC (heating, ventilation, and air conditioning) systems of PFAL have led to much greater energy consumption than open-field and greenhouse farming, limiting the application of PFAL to a wider extent. R…
▽ More
Plant factory with artificial light (PFAL) is a promising technology for relieving the food crisis, especially in urban areas or arid regions endowed with abundant resources. However, lighting and HVAC (heating, ventilation, and air conditioning) systems of PFAL have led to much greater energy consumption than open-field and greenhouse farming, limiting the application of PFAL to a wider extent. Recent researches pay much more attention to the optimization of energy consumption in order to develop and promote the PFAL technology with reduced energy usage. This work comprehensively summarizes the current energy-saving methods on lighting, HVAC systems, as well as their coupling methods for a more energy-efficient PFAL. Besides, we offer our perspectives on further energy-saving strategies and exploit the renewable energy resources for PFAL to respond to the urgent need for energy-efficient production.
△ Less
Submitted 11 May, 2024;
originally announced May 2024.
-
Neural Rendering and Its Hardware Acceleration: A Review
Authors:
Xinkai Yan,
Jieting Xu,
Yuchi Huo,
Hujun Bao
Abstract:
Neural rendering is a new image and video generation method based on deep learning. It combines the deep learning model with the physical knowledge of computer graphics, to obtain a controllable and realistic scene model, and realize the control of scene attributes such as lighting, camera parameters, posture and so on. On the one hand, neural rendering can not only make full use of the advantages…
▽ More
Neural rendering is a new image and video generation method based on deep learning. It combines the deep learning model with the physical knowledge of computer graphics, to obtain a controllable and realistic scene model, and realize the control of scene attributes such as lighting, camera parameters, posture and so on. On the one hand, neural rendering can not only make full use of the advantages of deep learning to accelerate the traditional forward rendering process, but also provide new solutions for specific tasks such as inverse rendering and 3D reconstruction. On the other hand, the design of innovative hardware structures that adapt to the neural rendering pipeline breaks through the parallel computing and power consumption bottleneck of existing graphics processors, which is expected to provide important support for future key areas such as virtual and augmented reality, film and television creation and digital entertainment, artificial intelligence and the metaverse. In this paper, we review the technical connotation, main challenges, and research progress of neural rendering. On this basis, we analyze the common requirements of neural rendering pipeline for hardware acceleration and the characteristics of the current hardware acceleration architecture, and then discuss the design challenges of neural rendering processor architecture. Finally, the future development trend of neural rendering processor architecture is prospected.
△ Less
Submitted 6 January, 2024;
originally announced February 2024.
-
PUCA: Patch-Unshuffle and Channel Attention for Enhanced Self-Supervised Image Denoising
Authors:
Hyemi Jang,
Junsung Park,
Dahuin Jung,
Jaihyun Lew,
Ho Bae,
Sungroh Yoon
Abstract:
Although supervised image denoising networks have shown remarkable performance on synthesized noisy images, they often fail in practice due to the difference between real and synthesized noise. Since clean-noisy image pairs from the real world are extremely costly to gather, self-supervised learning, which utilizes noisy input itself as a target, has been studied. To prevent a self-supervised deno…
▽ More
Although supervised image denoising networks have shown remarkable performance on synthesized noisy images, they often fail in practice due to the difference between real and synthesized noise. Since clean-noisy image pairs from the real world are extremely costly to gather, self-supervised learning, which utilizes noisy input itself as a target, has been studied. To prevent a self-supervised denoising model from learning identical mapping, each output pixel should not be influenced by its corresponding input pixel; This requirement is known as J-invariance. Blind-spot networks (BSNs) have been a prevalent choice to ensure J-invariance in self-supervised image denoising. However, constructing variations of BSNs by injecting additional operations such as downsampling can expose blinded information, thereby violating J-invariance. Consequently, convolutions designed specifically for BSNs have been allowed only, limiting architectural flexibility. To overcome this limitation, we propose PUCA, a novel J-invariant U-Net architecture, for self-supervised denoising. PUCA leverages patch-unshuffle/shuffle to dramatically expand receptive fields while maintaining J-invariance and dilated attention blocks (DABs) for global context incorporation. Experimental results demonstrate that PUCA achieves state-of-the-art performance, outperforming existing methods in self-supervised image denoising.
△ Less
Submitted 16 October, 2023;
originally announced October 2023.
-
Energy Efficient Operation of Adaptive Massive MIMO 5G HetNets
Authors:
Siddarth Marwaha,
Eduard A. Jorswieck,
Mostafa Jassim,
Thomas Kuerner,
David Lopez Perez,
Xilnli Geng,
Harvey Bao
Abstract:
For energy efficient operation of the massive multiple-input multiple-output (MIMO) networks, various aspects of energy efficiency maximization have been addressed, where a careful selection of number of active antennas has shown significant gains. Moreover, switching-off physical resource blocks (PRBs) and carrier shutdown saves energy in low load scenarios. However, the joint optimization of spe…
▽ More
For energy efficient operation of the massive multiple-input multiple-output (MIMO) networks, various aspects of energy efficiency maximization have been addressed, where a careful selection of number of active antennas has shown significant gains. Moreover, switching-off physical resource blocks (PRBs) and carrier shutdown saves energy in low load scenarios. However, the joint optimization of spectral PRB allocation and spatial layering in a heterogeneous network has not been completely solved yet. Therefore, we study a power consumption model for multi-cell multi-user massive MIMO 5G network, capturing the joint effects of both dimensions. We characterize the optimal resource allocation under practical constraints, i.e., limited number of available antennas, PRBs, base stations (BSs), and frequency bands. We observe a single spatial layer achieving lowest energy consumption in very low load scenarios, whereas, spatial layering is required in high load scenarios. Finally, we derive novel algorithms for energy efficient user to BS assignment and propose an adaptive algorithm for PRB assignment and power control. All results are illustrated by numerical system-level simulations, describing a realistic metropolis scenario. The results show that a higher frequency band should be used to support users with large rate requirements via spatial multiplexing and assigning each user maximum available PRBs.
△ Less
Submitted 10 October, 2023; v1 submitted 16 February, 2023;
originally announced February 2023.
-
Correlation recurrent units: A novel neural architecture for improving the predictive performance of time-series data
Authors:
Sunghyun Sim,
Dohee Kim,
Hyerim Bae
Abstract:
The time-series forecasting (TSF) problem is a traditional problem in the field of artificial intelligence. Models such as Recurrent Neural Network (RNN), Long Short Term Memory (LSTM), and GRU (Gate Recurrent Units) have contributed to improving the predictive accuracy of TSF. Furthermore, model structures have been proposed to combine time-series decomposition methods, such as seasonal-trend dec…
▽ More
The time-series forecasting (TSF) problem is a traditional problem in the field of artificial intelligence. Models such as Recurrent Neural Network (RNN), Long Short Term Memory (LSTM), and GRU (Gate Recurrent Units) have contributed to improving the predictive accuracy of TSF. Furthermore, model structures have been proposed to combine time-series decomposition methods, such as seasonal-trend decomposition using Loess (STL) to ensure improved predictive accuracy. However, because this approach is learned in an independent model for each component, it cannot learn the relationships between time-series components. In this study, we propose a new neural architecture called a correlation recurrent unit (CRU) that can perform time series decomposition within a neural cell and learn correlations (autocorrelation and correlation) between each decomposition component. The proposed neural architecture was evaluated through comparative experiments with previous studies using five univariate time-series datasets and four multivariate time-series data. The results showed that long- and short-term predictive performance was improved by more than 10%. The experimental results show that the proposed CRU is an excellent method for TSF problems compared to other neural architectures.
△ Less
Submitted 28 August, 2024; v1 submitted 29 November, 2022;
originally announced November 2022.
-
Avocodo: Generative Adversarial Network for Artifact-free Vocoder
Authors:
Taejun Bak,
Junmo Lee,
Hanbin Bae,
Jinhyeok Yang,
Jae-Sung Bae,
Young-Sun Joo
Abstract:
Neural vocoders based on the generative adversarial neural network (GAN) have been widely used due to their fast inference speed and lightweight networks while generating high-quality speech waveforms. Since the perceptually important speech components are primarily concentrated in the low-frequency bands, most GAN-based vocoders perform multi-scale analysis that evaluates downsampled speech wavef…
▽ More
Neural vocoders based on the generative adversarial neural network (GAN) have been widely used due to their fast inference speed and lightweight networks while generating high-quality speech waveforms. Since the perceptually important speech components are primarily concentrated in the low-frequency bands, most GAN-based vocoders perform multi-scale analysis that evaluates downsampled speech waveforms. This multi-scale analysis helps the generator improve speech intelligibility. However, in preliminary experiments, we discovered that the multi-scale analysis which focuses on the low-frequency bands causes unintended artifacts, e.g., aliasing and imaging artifacts, which degrade the synthesized speech waveform quality. Therefore, in this paper, we investigate the relationship between these artifacts and GAN-based vocoders and propose a GAN-based vocoder, called Avocodo, that allows the synthesis of high-fidelity speech with reduced artifacts. We introduce two kinds of discriminators to evaluate speech waveforms in various perspectives: a collaborative multi-band discriminator and a sub-band discriminator. We also utilize a pseudo quadrature mirror filter bank to obtain downsampled multi-band speech waveforms while avoiding aliasing. According to experimental results, Avocodo outperforms baseline GAN-based vocoders, both objectively and subjectively, while reproducing speech with fewer artifacts.
△ Less
Submitted 3 January, 2023; v1 submitted 27 June, 2022;
originally announced June 2022.
-
An Application of a Modified Beta Factor Method for the Analysis of Software Common Cause Failures
Authors:
Tate Shorthill,
Han Bao,
Edward Chen,
Heng Ban
Abstract:
This paper presents an approach for modeling software common cause failures (CCFs) within digital instrumentation and control (I&C) systems. CCFs consist of a concurrent failure between two or more components due to a shared failure cause and coupling mechanism. This work emphasizes the importance of identifying software-centric attributes related to the coupling mechanisms necessary for simultane…
▽ More
This paper presents an approach for modeling software common cause failures (CCFs) within digital instrumentation and control (I&C) systems. CCFs consist of a concurrent failure between two or more components due to a shared failure cause and coupling mechanism. This work emphasizes the importance of identifying software-centric attributes related to the coupling mechanisms necessary for simultaneous failures of redundant software components. The groups of components that share coupling mechanisms are called common cause component groups (CCCGs). Most CCF models rely on operational data as the basis for establishing CCCG parameters and predicting CCFs. This work is motivated by two primary concerns: (1) a lack of operational and CCF data for estimating software CCF model parameters; and (2) the need to model single components as part of multiple CCCGs simultaneously. A hybrid approach was developed to account for these concerns by leveraging existing techniques: a modified beta factor model allows single components to be placed within multiple CCCGs, while a second technique provides software-specific model parameters for each CCCG. This hybrid approach provides a means to overcome the limitations of conventional methods while offering support for design decisions under the limited data scenario.
△ Less
Submitted 22 June, 2022;
originally announced June 2022.
-
Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitch
Authors:
Hanbin Bae,
Young-Sun Joo
Abstract:
The recently developed pitch-controllable text-to-speech (TTS) model, i.e. FastPitch, was conditioned for the pitch contours. However, the quality of the synthesized speech degraded considerably for pitch values that deviated significantly from the average pitch; i.e. the ability to control pitch was limited. To address this issue, we propose two algorithms to improve the robustness of FastPitch.…
▽ More
The recently developed pitch-controllable text-to-speech (TTS) model, i.e. FastPitch, was conditioned for the pitch contours. However, the quality of the synthesized speech degraded considerably for pitch values that deviated significantly from the average pitch; i.e. the ability to control pitch was limited. To address this issue, we propose two algorithms to improve the robustness of FastPitch. First, we propose a novel timbre-preserving pitch-shifting algorithm for natural pitch augmentation. Pitch-shifted speech samples sound more natural when using the proposed algorithm because the speaker's vocal timbre is maintained. Moreover, we propose a training algorithm that defines FastPitch using pitch-augmented speech datasets with different pitch ranges for the same sentence. The experimental results demonstrate that the proposed algorithms improve the pitch controllability of FastPitch.
△ Less
Submitted 12 April, 2022;
originally announced April 2022.
-
Quantitative Evaluation of Common Cause Failures in High Safety-significant Safety-related Digital Instrumentation and Control Systems in Nuclear Power Plants
Authors:
Han Bao,
Hongbin Zhang,
Tate Shorthill,
Edward Chen,
Svetlana Lawrence
Abstract:
Digital instrumentation and control (DIC) systems at nuclear power plants (NPPs) have many advantages over analog systems. They are proven to be more reliable, cheaper, and easier to maintain given obsolescence of analog components. However, they also pose new engineering and technical challenges, such as possibility of common cause failures (CCFs) unique to digital systems. This paper proposes a…
▽ More
Digital instrumentation and control (DIC) systems at nuclear power plants (NPPs) have many advantages over analog systems. They are proven to be more reliable, cheaper, and easier to maintain given obsolescence of analog components. However, they also pose new engineering and technical challenges, such as possibility of common cause failures (CCFs) unique to digital systems. This paper proposes a Platform for Risk Assessment of DIC (PRADIC) that is developed by Idaho National Laboratory (INL). A methodology for evaluation of software CCFs in high safety-significant safety-related DIC systems of NPPs was developed as part of the framework. The framework integrates three stages of a typical risk assessment, qualitative hazard analysis and quantitative reliability and consequence analyses. The quantified risks compared with respective acceptance criteria provide valuable insights for system architecture alternatives allowing design optimization in terms of risk reduction and cost savings. A comprehensive case study performed to demonstrate the framework capabilities is documented in this paper. Results show that the PRADIC is a powerful tool capable to identify potential digital-based CCFs, estimate their probabilities, and evaluate their impacts on system and plant safety.
△ Less
Submitted 7 April, 2022;
originally announced April 2022.
-
Korean Tokenization for Beam Search Rescoring in Speech Recognition
Authors:
Kyuhong Shim,
Hyewon Bae,
Wonyong Sung
Abstract:
The performance of automatic speech recognition (ASR) models can be greatly improved by proper beam-search decoding with external language model (LM). There has been an increasing interest in Korean speech recognition, but not many studies have been focused on the decoding procedure. In this paper, we propose a Korean tokenization method for neural network-based LM used for Korean ASR. Although th…
▽ More
The performance of automatic speech recognition (ASR) models can be greatly improved by proper beam-search decoding with external language model (LM). There has been an increasing interest in Korean speech recognition, but not many studies have been focused on the decoding procedure. In this paper, we propose a Korean tokenization method for neural network-based LM used for Korean ASR. Although the common approach is to use the same tokenization method for external LM as the ASR model, we show that it may not be the best choice for Korean. We propose a new tokenization method that inserts a special token, SkipTC, when there is no trailing consonant in a Korean syllable. By utilizing the proposed SkipTC token, the input sequence for LM becomes very regularly patterned so that the LM can better learn the linguistic characteristics. Our experiments show that the proposed approach achieves a lower word error rate compared to the same LM model without SkipTC. In addition, we are the first to report the ASR performance for the recently introduced large-scale 7,600h Korean speech dataset.
△ Less
Submitted 28 March, 2022; v1 submitted 22 February, 2022;
originally announced March 2022.
-
An Integrated Risk Assessment Process of Safety-Related Digital I&C Systems in Nuclear Power Plants
Authors:
Hongbin Zhang,
Han Bao,
Tate Shorthill,
Edward Quinn
Abstract:
Upgrading the existing analog instrumentation and control (IC) systems to state-of-the-art digital IC (DIC) systems will greatly benefit existing light-water reactors (LWRs). However, the issue of software common cause failure (CCF) remains an obstacle in terms of qualification for digital technologies. Existing analyses of CCFs in I&C systems mainly focus on hardware failures. With the applicatio…
▽ More
Upgrading the existing analog instrumentation and control (IC) systems to state-of-the-art digital IC (DIC) systems will greatly benefit existing light-water reactors (LWRs). However, the issue of software common cause failure (CCF) remains an obstacle in terms of qualification for digital technologies. Existing analyses of CCFs in I&C systems mainly focus on hardware failures. With the application and upgrading of new DIC systems, design flaws could cause software CCFs to become a potential threat to plant safety, considering that most redundancy designs use similar digital platforms or software in their operating and application systems. With complex multi-layer redundancy designs to meet the single failure criterion, these IC safety systems are of particular concern in U.S. Nuclear Regulatory Commission (NRC) licensing procedures. In Fiscal Year 2019, the Risk-Informed Systems Analysis (RISA) Pathway of the U.S. Department of Energy (DOE) Light Water Reactor Sustainability (LWRS) Program initiated a project to develop a risk assessment strategy for delivering a strong technical basis to support effective, licensable, and secure DIC technologies for digital upgrades and designs. An integrated risk assessment for the DIC (IRADIC) process was proposed for this strategy to identify potential key digital-induced failures, implement reliability analyses of related digital safety IC systems, and evaluate the unanalyzed sequences introduced by these failures (particularly software CCFs) at the plant level. This paper summarizes these RISA efforts in the risk analysis of safety-related DIC systems at Idaho National Laboratory.
△ Less
Submitted 16 December, 2021;
originally announced December 2021.
-
ProductAE: Towards Training Larger Channel Codes based on Neural Product Codes
Authors:
Mohammad Vahid Jamali,
Hamid Saber,
Homayoon Hatami,
Jung Hyun Bae
Abstract:
There have been significant research activities in recent years to automate the design of channel encoders and decoders via deep learning. Due the dimensionality challenge in channel coding, it is prohibitively complex to design and train relatively large neural channel codes via deep learning techniques. Consequently, most of the results in the literature are limited to relatively short codes hav…
▽ More
There have been significant research activities in recent years to automate the design of channel encoders and decoders via deep learning. Due the dimensionality challenge in channel coding, it is prohibitively complex to design and train relatively large neural channel codes via deep learning techniques. Consequently, most of the results in the literature are limited to relatively short codes having less than 100 information bits. In this paper, we construct ProductAEs, a computationally efficient family of deep-learning driven (encoder, decoder) pairs, that aim at enabling the training of relatively large channel codes (both encoders and decoders) with a manageable training complexity. We build upon the ideas from classical product codes, and propose constructing large neural codes using smaller code components. More specifically, instead of directly training the encoder and decoder for a large neural code of dimension $k$ and blocklength $n$, we provide a framework that requires training neural encoders and decoders for the code parameters $(n_1,k_1)$ and $(n_2,k_2)$ such that $n_1 n_2=n$ and $k_1 k_2=k$. Our training results show significant gains, over all ranges of signal-to-noise ratio (SNR), for a code of parameters $(225,100)$ and a moderate-length code of parameters $(441,196)$, over polar codes under successive cancellation (SC) decoder. Moreover, our results demonstrate meaningful gains over Turbo Autoencoder (TurboAE) and state-of-the-art classical codes. This is the first work to design product autoencoders and a pioneering work on training large channel codes.
△ Less
Submitted 10 September, 2022; v1 submitted 9 October, 2021;
originally announced October 2021.
-
Secrecy Offloading Rate Maximization for Multi-Access Mobile Edge Computing Networks
Authors:
Mingxiong Zhao,
Huiqi Bao,
Li Yin,
Jianping Yao,
Tony Q. S. Quek
Abstract:
This letter considers a multi-access mobile edge computing (MEC) network consisting of multiple users, multiple base stations, and a malicious eavesdropper. Specifically, the users adopt the partial offloading strategy by partitioning the computation task into several parts. One is executed locally and the others are securely offloaded to multiple MEC servers integrated into the base stations by l…
▽ More
This letter considers a multi-access mobile edge computing (MEC) network consisting of multiple users, multiple base stations, and a malicious eavesdropper. Specifically, the users adopt the partial offloading strategy by partitioning the computation task into several parts. One is executed locally and the others are securely offloaded to multiple MEC servers integrated into the base stations by leveraging the physical layer security to combat the eavesdropping. We jointly optimize power allocation, task partition, subcarrier allocation, and computation resource to maximize the secrecy offloading rate of the users, subject to communication and computation resource constraints. Numerical results demonstrate that our proposed scheme can respectively improve the secrecy offloading rate 1.11%--1.39% and 15.05%--17.35% (versus the increase of tasks' latency requirements), and 1.30%--1.75% and 6.08%--9.22% (versus the increase of the maximum transmit power) compared with the two benchmarks. Moreover, it further emphasizes the necessity of conducting computation offloading over multiple MEC servers.
△ Less
Submitted 21 September, 2021;
originally announced September 2021.
-
N-Singer: A Non-Autoregressive Korean Singing Voice Synthesis System for Pronunciation Enhancement
Authors:
Gyeong-Hoon Lee,
Tae-Woo Kim,
Hanbin Bae,
Min-Ji Lee,
Young-Ik Kim,
Hoon-Young Cho
Abstract:
Recently, end-to-end Korean singing voice systems have been designed to generate realistic singing voices. However, these systems still suffer from a lack of robustness in terms of pronunciation accuracy. In this paper, we propose N-Singer, a non-autoregressive Korean singing voice system, to synthesize accurate and pronounced Korean singing voices in parallel. N-Singer consists of a Transformer-b…
▽ More
Recently, end-to-end Korean singing voice systems have been designed to generate realistic singing voices. However, these systems still suffer from a lack of robustness in terms of pronunciation accuracy. In this paper, we propose N-Singer, a non-autoregressive Korean singing voice system, to synthesize accurate and pronounced Korean singing voices in parallel. N-Singer consists of a Transformer-based mel-generator, a convolutional network-based postnet, and voicing-aware discriminators. It can contribute in the following ways. First, for accurate pronunciation, N-Singer separately models linguistic and pitch information without other acoustic features. Second, to achieve improved mel-spectrograms, N-Singer uses a combination of Transformer-based modules and convolutional network-based modules. Third, in adversarial training, voicing-aware conditional discriminators are used to capture the harmonic features of voiced segments and noise components of unvoiced segments. The experimental results prove that N-Singer can synthesize a natural singing voice in parallel with a more accurate pronunciation than the baseline model.
△ Less
Submitted 21 February, 2022; v1 submitted 29 June, 2021;
originally announced June 2021.
-
FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis
Authors:
Taejun Bak,
Jae-Sung Bae,
Hanbin Bae,
Young-Ik Kim,
Hoon-Young Cho
Abstract:
Methods for modeling and controlling prosody with acoustic features have been proposed for neural text-to-speech (TTS) models. Prosodic speech can be generated by conditioning acoustic features. However, synthesized speech with a large pitch-shift scale suffers from audio quality degradation, and speaker characteristics deformation. To address this problem, we propose a feed-forward Transformer ba…
▽ More
Methods for modeling and controlling prosody with acoustic features have been proposed for neural text-to-speech (TTS) models. Prosodic speech can be generated by conditioning acoustic features. However, synthesized speech with a large pitch-shift scale suffers from audio quality degradation, and speaker characteristics deformation. To address this problem, we propose a feed-forward Transformer based TTS model that is designed based on the source-filter theory. This model, called FastPitchFormant, has a unique structure that handles text and acoustic features in parallel. With modeling each feature separately, the tendency that the model learns the relationship between two features can be mitigated.
△ Less
Submitted 29 June, 2021;
originally announced June 2021.
-
Joint Time and Power Allocation for 5G NR Unlicensed Systems
Authors:
Haizhou Bao,
Yiming Huo,
Xiaodai Dong,
Chuanhe Huang
Abstract:
The fifth-generation (5G) and beyond networks are designed to efficiently utilize the spectrum resources to meet various quality of service (QoS) requirements. The unlicensed frequency bands used by WiFi are mainly deployed for indoor applications and are not always fully occupied. The cellular industry has been working to enable cellular and WiFi coexistence. In particular, 5G New Radio in unlice…
▽ More
The fifth-generation (5G) and beyond networks are designed to efficiently utilize the spectrum resources to meet various quality of service (QoS) requirements. The unlicensed frequency bands used by WiFi are mainly deployed for indoor applications and are not always fully occupied. The cellular industry has been working to enable cellular and WiFi coexistence. In particular, 5G New Radio in unlicensed channel spectrum (NR-U) supports the uplink and downlink transmission on the maximum channel occupation time (MCOT) duration. In this paper, we consider maximizing the total throughput of both downlink and uplink in NR-U by jointly optimizing the time and power allocation during MCOT while ensuring fair coexistence with WiFi. Fairness is guaranteed in two steps: 1) tuning the access related parameters of NR-U to achieve proportional fairness, and 2) including 3GPP fairness from the throughput perspective as a constraint in NR-U throughput maximization. Numerical analysis and simulation have demonstrated the superior performance of the proposed resource allocation algorithm compared to conventional deployment strategies.
△ Less
Submitted 22 April, 2021;
originally announced April 2021.
-
Estimation of Closest In-Path Vehicle (CIPV) by Low-Channel LiDAR and Camera Sensor Fusion for Autonomous Vehicle
Authors:
Hyunjin Bae,
Gu Lee,
Jaeseung Yang,
Gwanjun Shin,
Yongseob Lim,
Gyeungho Choi
Abstract:
In autonomous driving, using a variety of sensors to recognize preceding vehicles in middle and long distance is helpful for improving driving performance and developing various functions. However, if only LiDAR or camera is used in the recognition stage, it is difficult to obtain necessary data due to the limitations of each sensor. In this paper, we proposed a method of converting the tracking d…
▽ More
In autonomous driving, using a variety of sensors to recognize preceding vehicles in middle and long distance is helpful for improving driving performance and developing various functions. However, if only LiDAR or camera is used in the recognition stage, it is difficult to obtain necessary data due to the limitations of each sensor. In this paper, we proposed a method of converting the tracking data of vision into bird's eye view (BEV) coordinates using an equation that projects LiDAR points onto an image, and a method of fusion between LiDAR and vision tracked data. Thus, the newly proposed method was effective through the results of detecting closest in-path vehicle (CIPV) in various situations. In addition, even when experimenting with the EuroNCAP autonomous emergency braking (AEB) test protocol using the result of fusion, AEB performance is improved through improved cognitive performance than when using only LiDAR. In experimental results, the performance of the proposed method was proved through actual vehicle tests in various scenarios. Consequently, it is convincing that the newly proposed sensor fusion method significantly improves the ACC function in autonomous maneuvering. We expect that this improvement in perception performance will contribute to improving the overall stability of ACC.
△ Less
Submitted 25 March, 2021;
originally announced March 2021.
-
A Neural Text-to-Speech Model Utilizing Broadcast Data Mixed with Background Music
Authors:
Hanbin Bae,
Jae-Sung Bae,
Young-Sun Joo,
Young-Ik Kim,
Hoon-Young Cho
Abstract:
Recently, it has become easier to obtain speech data from various media such as the internet or YouTube, but directly utilizing them to train a neural text-to-speech (TTS) model is difficult. The proportion of clean speech is insufficient and the remainder includes background music. Even with the global style token (GST). Therefore, we propose the following method to successfully train an end-to-e…
▽ More
Recently, it has become easier to obtain speech data from various media such as the internet or YouTube, but directly utilizing them to train a neural text-to-speech (TTS) model is difficult. The proportion of clean speech is insufficient and the remainder includes background music. Even with the global style token (GST). Therefore, we propose the following method to successfully train an end-to-end TTS model with limited broadcast data. First, the background music is removed from the speech by introducing a music filter. Second, the GST-TTS model with an auxiliary quality classifier is trained with the filtered speech and a small amount of clean speech. In particular, the quality classifier makes the embedding vector of the GST layer focus on representing the speech quality (filtered or clean) of the input speech. The experimental results verified that the proposed method synthesized much more high-quality speech than conventional methods.
△ Less
Submitted 4 March, 2021;
originally announced March 2021.
-
A Design of Cooperative Overtaking Based on Complex Lane Detection and Collision Risk Estimation
Authors:
Junlan Chen,
Ke Wang,
Huanhuan Bao,
Tao Chen
Abstract:
Cooperative overtaking is believed to have the capability of improving road safety and traffic efficiency by means of the real-time information exchange between traffic participants, including road infrastructures, nearby vehicles and others. In this paper, we focused on the critical issues of modeling, computation, and analysis of cooperative overtaking and made it playing a key role in the road…
▽ More
Cooperative overtaking is believed to have the capability of improving road safety and traffic efficiency by means of the real-time information exchange between traffic participants, including road infrastructures, nearby vehicles and others. In this paper, we focused on the critical issues of modeling, computation, and analysis of cooperative overtaking and made it playing a key role in the road overtaking area. In detail, for the purpose of extending the awareness of the surrounding environment, the lane markings in front of ego vehicle were detected and modeled with Bezier curve using an onboard camera. While the nearby vehicle positions were obtained through the vehicle-to-vehicle communication scheme making assure of the accuracy of localization. Then, Gaussian-based conflict potential field was proposed to guarantee the overtaking safety, which can quantitatively estimate the oncoming collision danger. To support the proposed method, many experiments were conducted on the human-in-the-loop simulation platform. The results demonstrated that our proposed method achieves better performance, especially in some unpredictable nature road circumstances.
△ Less
Submitted 11 August, 2020;
originally announced August 2020.
-
Speaking Speed Control of End-to-End Speech Synthesis using Sentence-Level Conditioning
Authors:
Jae-Sung Bae,
Hanbin Bae,
Young-Sun Joo,
Junmo Lee,
Gyeong-Hoon Lee,
Hoon-Young Cho
Abstract:
This paper proposes a controllable end-to-end text-to-speech (TTS) system to control the speaking speed (speed-controllable TTS; SCTTS) of synthesized speech with sentence-level speaking-rate value as an additional input. The speaking-rate value, the ratio of the number of input phonemes to the length of input speech, is adopted in the proposed system to control the speaking speed. Furthermore, th…
▽ More
This paper proposes a controllable end-to-end text-to-speech (TTS) system to control the speaking speed (speed-controllable TTS; SCTTS) of synthesized speech with sentence-level speaking-rate value as an additional input. The speaking-rate value, the ratio of the number of input phonemes to the length of input speech, is adopted in the proposed system to control the speaking speed. Furthermore, the proposed SCTTS system can control the speaking speed while retaining other speech attributes, such as the pitch, by adopting the global style token-based style encoder. The proposed SCTTS does not require any additional well-trained model or an external speech database to extract phoneme-level duration information and can be trained in an end-to-end manner. In addition, our listening tests on fast-, normal-, and slow-speed speech showed that the SCTTS can generate more natural speech than other phoneme duration control approaches which increase or decrease duration at the same rate for the entire sentence, especially in the case of slow-speed speech.
△ Less
Submitted 13 August, 2020; v1 submitted 30 July, 2020;
originally announced July 2020.
-
Evaluation of Sampling Methods for Robotic Sediment Sampling Systems
Authors:
Jun Han Bae,
Wonse Jo,
Jee Hwan Park,
Richard M. Voyles,
Sara K. McMillan,
Byung-Cheol Min
Abstract:
Analysis of sediments from rivers, lakes, reservoirs, wetlands and other constructed surface water impoundments is an important tool to characterize the function and health of these systems, but is generally carried out manually. This is costly and can be hazardous and difficult for humans due to inaccessibility, contamination, or availability of required equipment. Robotic sampling systems can ea…
▽ More
Analysis of sediments from rivers, lakes, reservoirs, wetlands and other constructed surface water impoundments is an important tool to characterize the function and health of these systems, but is generally carried out manually. This is costly and can be hazardous and difficult for humans due to inaccessibility, contamination, or availability of required equipment. Robotic sampling systems can ease these burdens, but little work has examined the efficiency of such sampling means and no prior work has investigated the quality of the resulting samples. This paper presents an experimental study that evaluates and optimizes sediment sampling patterns applied to a robot sediment sampling system that allows collection of minimally-disturbed sediment cores from natural and man-made water bodies for various sediment types. To meet this need, we developed and tested a robotic sampling platform in the laboratory to test functionality under a range of sediment types and operating conditions. Specifically, we focused on three patterns by which a cylindrical coring device was driven into the sediment (linear, helical, and zig-zag) for three sediment types (coarse sand, medium sand, and silt). The results show that the optimal sampling pattern varies depending on the type of sediment and can be optimized based on the sampling objective. We examined two sampling objectives: maximizing the mass of minimally disturbed sediment and minimizing the power per mass of sample. This study provides valuable data to aid in the selection of optimal sediment coring methods for various applications and builds a solid foundation for future field testing under a range of environmental conditions.
△ Less
Submitted 23 June, 2020;
originally announced June 2020.
-
A Redundancy-Guided Approach for the Hazard Analysis of Digital Instrumentation and Control Systems in Advanced Nuclear Power Plants
Authors:
Tate Shorthill,
Han Bao,
Hongbin Zhang,
Heng Ban
Abstract:
Digital instrumentation and control (I&C) upgrades are a vital research area for nuclear industry. Despite their performance benefits, deployment of digital I&C in nuclear power plants (NPPs) has been limited. Digital I&C systems exhibit complex failure modes including common cause failures (CCFs) which can be difficult to identify. This paper describes the development of a redundancy-guided appli…
▽ More
Digital instrumentation and control (I&C) upgrades are a vital research area for nuclear industry. Despite their performance benefits, deployment of digital I&C in nuclear power plants (NPPs) has been limited. Digital I&C systems exhibit complex failure modes including common cause failures (CCFs) which can be difficult to identify. This paper describes the development of a redundancy-guided application of the Systems-Theoretic Process Analysis (STPA) and Fault Tree Analysis (FTA) for the hazard analysis of digital I&C in advanced NPPs. The resulting Redundancy-guided System-theoretic Hazard Analysis (RESHA) is applied for the case study of a representative state-of-the-art digital reactor trip system. The analysis qualitatively and systematically identifies the most critical CCFs and other hazards of digital I&C systems. Ultimately, RESHA can help researchers make informed decisions for how, and to what degree, defensive measures such as redundancy, diversity, and defense-in-depth can be used to mitigate or eliminate the potential hazards of digital I&C systems.
△ Less
Submitted 5 May, 2020;
originally announced May 2020.
-
IROS 2019 Lifelong Robotic Vision Challenge -- Lifelong Object Recognition Report
Authors:
Qi She,
Fan Feng,
Qi Liu,
Rosa H. M. Chan,
Xinyue Hao,
Chuanlin Lan,
Qihan Yang,
Vincenzo Lomonaco,
German I. Parisi,
Heechul Bae,
Eoin Brophy,
Baoquan Chen,
Gabriele Graffieti,
Vidit Goel,
Hyonyoung Han,
Sathursan Kanagarajah,
Somesh Kumar,
Siew-Kei Lam,
Tin Lun Lam,
Liang Ma,
Davide Maltoni,
Lorenzo Pellegrini,
Duvindu Piyasena,
Shiliang Pu,
Debdoot Sheet
, et al. (11 additional authors not shown)
Abstract:
This report summarizes IROS 2019-Lifelong Robotic Vision Competition (Lifelong Object Recognition Challenge) with methods and results from the top $8$ finalists (out of over~$150$ teams). The competition dataset (L)ifel(O)ng (R)obotic V(IS)ion (OpenLORIS) - Object Recognition (OpenLORIS-object) is designed for driving lifelong/continual learning research and application in robotic vision domain, w…
▽ More
This report summarizes IROS 2019-Lifelong Robotic Vision Competition (Lifelong Object Recognition Challenge) with methods and results from the top $8$ finalists (out of over~$150$ teams). The competition dataset (L)ifel(O)ng (R)obotic V(IS)ion (OpenLORIS) - Object Recognition (OpenLORIS-object) is designed for driving lifelong/continual learning research and application in robotic vision domain, with everyday objects in home, office, campus, and mall scenarios. The dataset explicitly quantifies the variants of illumination, object occlusion, object size, camera-object distance/angles, and clutter information. Rules are designed to quantify the learning capability of the robotic vision system when faced with the objects appearing in the dynamic environments in the contest. Individual reports, dataset information, rules, and released source code can be found at the project homepage: "https://lifelong-robotic-vision.github.io/competition/".
△ Less
Submitted 26 April, 2020;
originally announced April 2020.
-
Investigations of the Influences of a CNN's Receptive Field on Segmentation of Subnuclei of Bilateral Amygdalae
Authors:
Han Bao
Abstract:
Segmentation of objects with various sizes is relatively less explored in medical imaging, and has been very challenging in computer vision tasks in general. We hypothesize that the receptive field of a deep model corresponds closely to the size of object to be segmented, which could critically influence the segmentation accuracy of objects with varied sizes. In this study, we employed "AmygNet",…
▽ More
Segmentation of objects with various sizes is relatively less explored in medical imaging, and has been very challenging in computer vision tasks in general. We hypothesize that the receptive field of a deep model corresponds closely to the size of object to be segmented, which could critically influence the segmentation accuracy of objects with varied sizes. In this study, we employed "AmygNet", a dual-branch fully convolutional neural network (FCNN) with two different sizes of receptive fields, to investigate the effects of receptive field on segmenting four major subnuclei of bilateral amygdalae. The experiment was conducted on 14 subjects, which are all 3-dimensional MRI human brain images. Since the scale of different subnuclear groups are different, by investigating the accuracy of each subnuclear group while using receptive fields of various sizes, we may find which kind of receptive field is suitable for object of which scale respectively. In the given condition, AmygNet with multiple receptive fields presents great potential in segmenting objects of different sizes.
△ Less
Submitted 7 November, 2019;
originally announced November 2019.
-
An efficient coding algorithm for general Framed Pulse Width Modulations
Authors:
Soon-Won Kwon,
Hyeon-Min Bae
Abstract:
This paper introduces a new coding algorithm for Framed Pulse Width Modulation (FPWM). The proposed algorithm requires 93% fewer look-up tables (LUTs) than the previous FPWM coding algorithm and increases a bitrate by 25%. The proposed algorithm is compatible with general FPWM with various frame lengths and pulse width resolutions. Theoretical bitrates and the sizes of LUT required for coding vari…
▽ More
This paper introduces a new coding algorithm for Framed Pulse Width Modulation (FPWM). The proposed algorithm requires 93% fewer look-up tables (LUTs) than the previous FPWM coding algorithm and increases a bitrate by 25%. The proposed algorithm is compatible with general FPWM with various frame lengths and pulse width resolutions. Theoretical bitrates and the sizes of LUT required for coding various FPWMs are also provided. The MATLAB simulation demonstrates the proposed FPWM signal which contains 14-bit information in 8 UI frame length, showing 75% higher bitrate than the NRZ signal with the same baud rate. The decoding algorithm restores the original bit without any bit error and validates the proposed FPWM and its coding scheme.
△ Less
Submitted 1 May, 2019;
originally announced May 2019.
-
A fully-digital semi-rotational frequency detection algorithm for bang-bang CDRs
Authors:
Soon-Won Kwon,
Hanho Choi,
Younho Jeon,
Bongjin Kim,
WooHyun Kwon,
Homin Park,
Kyeongha Kwon,
Gain Kim,
Hyeon-Min Bae
Abstract:
This work presents a new frequency acquisition method using semi-rotational frequency detection (SRFD) algorithm for a reference-less clock and data recovery (CDR) in a serial-link receiver. The proposed SRFD algorithm classifies the bang-bang phase detector(BBPD) outputs to estimate the current phase state, and detects the frequency mismatch between the input data and the sampling clock. The VCO-…
▽ More
This work presents a new frequency acquisition method using semi-rotational frequency detection (SRFD) algorithm for a reference-less clock and data recovery (CDR) in a serial-link receiver. The proposed SRFD algorithm classifies the bang-bang phase detector(BBPD) outputs to estimate the current phase state, and detects the frequency mismatch between the input data and the sampling clock. The VCO-track path in a digital loop filter (DLF) enables online calibration of a drifted frequency of VCO caused by temperature or voltage variation after a frequency acquisition. The proposed algorithm can be implemented as a digitally-synthesized circuit, lowering design efforts for referenceless CDRs. A 10 Gbps transceiver IC with the proposed algorithm, fabricated in a 65nm CMOS process, demonstrates successful recovery of the input phase without any reference clock.
△ Less
Submitted 1 May, 2019;
originally announced May 2019.
-
Generating Multi-Scroll Chua's Attractors via Simplified Piecewise-Linear Chua's Diode
Authors:
Ning Wang,
Chengqing Li,
Han Bao,
Mo Chen,
Bocheng Bao
Abstract:
High implementation complexity of multi-scroll circuit is a bottleneck problem in real chaos-based communication. Especially, in multi-scroll Chua's circuit, the simplified implementation of piecewise-linear resistors with multiple segments is difficult due to their intricate irregular breakpoints and slopes. To solve the challenge, this paper presents a systematic scheme for synthesizing a Chua's…
▽ More
High implementation complexity of multi-scroll circuit is a bottleneck problem in real chaos-based communication. Especially, in multi-scroll Chua's circuit, the simplified implementation of piecewise-linear resistors with multiple segments is difficult due to their intricate irregular breakpoints and slopes. To solve the challenge, this paper presents a systematic scheme for synthesizing a Chua's diode with multi-segment piecewise-linearity, which is achieved by cascading even-numbered passive nonlinear resistors with odd-numbered ones via a negative impedance converter. The traditional voltage mode op-amps are used to implement nonlinear resistors. As no extra DC bias voltage is employed, the scheme can be implemented by much simpler circuits. The voltage-current characteristics of the obtained Chua's diode are analyzed theoretically and verified by numerical simulations. Using the Chua's diode and a second-order active Sallen-Key high-pass filter, a new inductor-free Chua's circuit is then constructed to generate multi-scroll chaotic attractors. Different number of scrolls can be generated by changing the number of passive nonlinear resistor cells or adjusting two coupling parameters. Besides, the system can be scaled by using different power supplies, satisfying the low-voltage low-power requirement of integrated circuit design. The circuit simulations and hardware experiments both confirmed the feasibility of the designed system.
△ Less
Submitted 21 August, 2019; v1 submitted 27 October, 2018;
originally announced October 2018.