-
Performance analysis of metasurface-based spatial multimode transmission for 6G wireless communications
Authors:
Ju Yong Lee,
Seung-Won Keum,
Sang Min Oh,
Dang-Oh Kim,
Dong-Ho Cho
Abstract:
In 6th generation wireless communication technology, it is important to utilize space resources efficiently. Recently, holographic multiple-input multiple-output (HMIMO) and meta-surface technology have attracted attention as technologies that maximize space utilization for 6G mobile communications. However, studies on HMIMO communications are still in an initial stage and its fundamental limits a…
▽ More
In 6th generation wireless communication technology, it is important to utilize space resources efficiently. Recently, holographic multiple-input multiple-output (HMIMO) and meta-surface technology have attracted attention as technologies that maximize space utilization for 6G mobile communications. However, studies on HMIMO communications are still in an initial stage and its fundamental limits are yet to be unveiled. It is well known that the Fourier transform relationship can be obtained using a lens in the optical field, but research to apply it to the mobile communication field is in the early stages. In this paper, we show that the Fourier transform relationship between signals can be obtained when two metasurfaces are aligned or unaligned, and analyze the transmission and reception power, and the maximum number of spatial multimodes that can be transmitted. In addition, to reduce transmission complexity, we propose a spatial multimode transmission system using three metasurfaces and analyze signal characteristics on the meta-surfaces. In numerical results, we provide the performance of spatial multimode transmission in case of using rectangular and Gaussian signals.
△ Less
Submitted 31 March, 2025;
originally announced April 2025.
-
360-Degree Video Super Resolution and Quality Enhancement Challenge: Methods and Results
Authors:
Ahmed Telili,
Wassim Hamidouche,
Ibrahim Farhat,
Hadi Amirpour,
Christian Timmerer,
Ibrahim Khadraoui,
Jiajie Lu,
The Van Le,
Jeonneung Baek,
Jin Young Lee,
Yiying Wei,
Xiaopeng Sun,
Yu Gao,
JianCheng Huangl,
Yujie Zhong
Abstract:
Omnidirectional (360-degree) video is rapidly gaining popularity due to advancements in immersive technologies like virtual reality (VR) and extended reality (XR). However, real-time streaming of such videos, especially in live mobile scenarios like unmanned aerial vehicles (UAVs), is challenged by limited bandwidth and strict latency constraints. Traditional methods, such as compression and adapt…
▽ More
Omnidirectional (360-degree) video is rapidly gaining popularity due to advancements in immersive technologies like virtual reality (VR) and extended reality (XR). However, real-time streaming of such videos, especially in live mobile scenarios like unmanned aerial vehicles (UAVs), is challenged by limited bandwidth and strict latency constraints. Traditional methods, such as compression and adaptive resolution, help but often compromise video quality and introduce artifacts that degrade the viewer experience. Additionally, the unique spherical geometry of 360-degree video presents challenges not encountered in traditional 2D video. To address these issues, we initiated the 360-degree Video Super Resolution and Quality Enhancement Challenge. This competition encourages participants to develop efficient machine learning solutions to enhance the quality of low-bitrate compressed 360-degree videos, with two tracks focusing on 2x and 4x super-resolution (SR). In this paper, we outline the challenge framework, detailing the two competition tracks and highlighting the SR solutions proposed by the top-performing models. We assess these models within a unified framework, considering quality enhancement, bitrate gain, and computational efficiency. This challenge aims to drive innovation in real-time 360-degree video streaming, improving the quality and accessibility of immersive visual experiences.
△ Less
Submitted 11 November, 2024;
originally announced November 2024.
-
SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech
Authors:
Minchan Kim,
Myeonghun Jeong,
Joun Yeop Lee,
Nam Soo Kim
Abstract:
We present SegINR, a novel approach to neural Text-to-Speech (TTS) that addresses sequence alignment without relying on an auxiliary duration predictor and complex autoregressive (AR) or non-autoregressive (NAR) frame-level sequence modeling. SegINR simplifies the process by converting text sequences directly into frame-level features. It leverages an optimal text encoder to extract embeddings, tr…
▽ More
We present SegINR, a novel approach to neural Text-to-Speech (TTS) that addresses sequence alignment without relying on an auxiliary duration predictor and complex autoregressive (AR) or non-autoregressive (NAR) frame-level sequence modeling. SegINR simplifies the process by converting text sequences directly into frame-level features. It leverages an optimal text encoder to extract embeddings, transforming each into a segment of frame-level features using a conditional implicit neural representation (INR). This method, named segment-wise INR (SegINR), models temporal dynamics within each segment and autonomously defines segment boundaries, reducing computational costs. We integrate SegINR into a two-stage TTS framework, using it for semantic token prediction. Our experiments in zero-shot adaptive TTS scenarios demonstrate that SegINR outperforms conventional methods in speech quality with computational efficiency.
△ Less
Submitted 6 October, 2024;
originally announced October 2024.
-
High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model
Authors:
Joun Yeop Lee,
Myeonghun Jeong,
Minchan Kim,
Ji-Hyun Lee,
Hoon-Young Cho,
Nam Soo Kim
Abstract:
We propose a novel two-stage text-to-speech (TTS) framework with two types of discrete tokens, i.e., semantic and acoustic tokens, for high-fidelity speech synthesis. It features two core components: the Interpreting module, which processes text and a speech prompt into semantic tokens focusing on linguistic contents and alignment, and the Speaking module, which captures the timbre of the target v…
▽ More
We propose a novel two-stage text-to-speech (TTS) framework with two types of discrete tokens, i.e., semantic and acoustic tokens, for high-fidelity speech synthesis. It features two core components: the Interpreting module, which processes text and a speech prompt into semantic tokens focusing on linguistic contents and alignment, and the Speaking module, which captures the timbre of the target voice to generate acoustic tokens from semantic tokens, enriching speech reconstruction. The Interpreting stage employs a transducer for its robustness in aligning text to speech. In contrast, the Speaking stage utilizes a Conformer-based architecture integrated with a Grouped Masked Language Model (G-MLM) to boost computational efficiency. Our experiments verify that this innovative structure surpasses the conventional models in the zero-shot scenario in terms of speech quality and speaker similarity.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction
Authors:
Minchan Kim,
Myeonghun Jeong,
Byoung Jin Choi,
Semin Kim,
Joun Yeop Lee,
Nam Soo Kim
Abstract:
We propose a novel text-to-speech (TTS) framework centered around a neural transducer. Our approach divides the whole TTS pipeline into semantic-level sequence-to-sequence (seq2seq) modeling and fine-grained acoustic modeling stages, utilizing discrete semantic tokens obtained from wav2vec2.0 embeddings. For a robust and efficient alignment modeling, we employ a neural transducer named token trans…
▽ More
We propose a novel text-to-speech (TTS) framework centered around a neural transducer. Our approach divides the whole TTS pipeline into semantic-level sequence-to-sequence (seq2seq) modeling and fine-grained acoustic modeling stages, utilizing discrete semantic tokens obtained from wav2vec2.0 embeddings. For a robust and efficient alignment modeling, we employ a neural transducer named token transducer for the semantic token prediction, benefiting from its hard monotonic alignment constraints. Subsequently, a non-autoregressive (NAR) speech generator efficiently synthesizes waveforms from these semantic tokens. Additionally, a reference speech controls temporal dynamics and acoustic conditions at each stage. This decoupled framework reduces the training complexity of TTS while allowing each stage to focus on semantic and acoustic modeling. Our experimental results on zero-shot adaptive TTS demonstrate that our model surpasses the baseline in terms of speech quality and speaker similarity, both objectively and subjectively. We also delve into the inference speed and prosody control capabilities of our approach, highlighting the potential of neural transducers in TTS frameworks.
△ Less
Submitted 2 January, 2024;
originally announced January 2024.
-
Efficient Parallel Audio Generation using Group Masked Language Modeling
Authors:
Myeonghun Jeong,
Minchan Kim,
Joun Yeop Lee,
Nam Soo Kim
Abstract:
We present a fast and high-quality codec language model for parallel audio generation. While SoundStorm, a state-of-the-art parallel audio generation model, accelerates inference speed compared to autoregressive models, it still suffers from slow inference due to iterative sampling. To resolve this problem, we propose Group-Masked Language Modeling~(G-MLM) and Group Iterative Parallel Decoding~(G-…
▽ More
We present a fast and high-quality codec language model for parallel audio generation. While SoundStorm, a state-of-the-art parallel audio generation model, accelerates inference speed compared to autoregressive models, it still suffers from slow inference due to iterative sampling. To resolve this problem, we propose Group-Masked Language Modeling~(G-MLM) and Group Iterative Parallel Decoding~(G-IPD) for efficient parallel audio generation. Both the training and sampling schemes enable the model to synthesize high-quality audio with a small number of iterations by effectively modeling the group-wise conditional dependencies. In addition, our model employs a cross-attention-based architecture to capture the speaker style of the prompt voice and improves computational efficiency. Experimental results demonstrate that our proposed model outperforms the baselines in prompt-based audio generation.
△ Less
Submitted 2 January, 2024;
originally announced January 2024.
-
Label Space Partition Selection for Multi-Object Tracking Using Two-Layer Partitioning
Authors:
Ji Youn Lee,
Changbeom Shim,
Hoa Van Nguyen,
Tran Thien Dat Nguyen,
Hyunjin Choi,
Youngho Kim
Abstract:
Estimating the trajectories of multi-objects poses a significant challenge due to data association ambiguity, which leads to a substantial increase in computational requirements. To address such problems, a divide-and-conquer manner has been employed with parallel computation. In this strategy, distinguished objects that have unique labels are grouped based on their statistical dependencies, the i…
▽ More
Estimating the trajectories of multi-objects poses a significant challenge due to data association ambiguity, which leads to a substantial increase in computational requirements. To address such problems, a divide-and-conquer manner has been employed with parallel computation. In this strategy, distinguished objects that have unique labels are grouped based on their statistical dependencies, the intersection of predicted measurements. Several geometry approaches have been used for label grouping since finding all intersected label pairs is clearly infeasible for large-scale tracking problems. This paper proposes an efficient implementation of label grouping for label-partitioned generalized labeled multi-Bernoulli filter framework using a secondary partitioning technique. This allows for parallel computation in the label graph indexing step, avoiding generating and eliminating duplicate comparisons. Additionally, we compare the performance of the proposed technique with several efficient spatial searching algorithms. The results demonstrate the superior performance of the proposed approach on large-scale data sets, enabling scalable trajectory estimation.
△ Less
Submitted 22 October, 2023;
originally announced October 2023.
-
Latent Filling: Latent Space Data Augmentation for Zero-shot Speech Synthesis
Authors:
Jae-Sung Bae,
Joun Yeop Lee,
Ji-Hyun Lee,
Seongkyu Mun,
Taehwa Kang,
Hoon-Young Cho,
Chanwoo Kim
Abstract:
Previous works in zero-shot text-to-speech (ZS-TTS) have attempted to enhance its systems by enlarging the training data through crowd-sourcing or augmenting existing speech data. However, the use of low-quality data has led to a decline in the overall system performance. To avoid such degradation, instead of directly augmenting the input data, we propose a latent filling (LF) method that adopts s…
▽ More
Previous works in zero-shot text-to-speech (ZS-TTS) have attempted to enhance its systems by enlarging the training data through crowd-sourcing or augmenting existing speech data. However, the use of low-quality data has led to a decline in the overall system performance. To avoid such degradation, instead of directly augmenting the input data, we propose a latent filling (LF) method that adopts simple but effective latent space data augmentation in the speaker embedding space of the ZS-TTS system. By incorporating a consistency loss, LF can be seamlessly integrated into existing ZS-TTS systems without the need for additional training stages. Experimental results show that LF significantly improves speaker similarity while preserving speech quality.
△ Less
Submitted 22 January, 2024; v1 submitted 5 October, 2023;
originally announced October 2023.
-
Lightweight Monocular Depth Estimation via Token-Sharing Transformer
Authors:
Dong-Jae Lee,
Jae Young Lee,
Hyounguk Shon,
Eojindl Yi,
Yeong-Hun Park,
Sung-Sik Cho,
Junmo Kim
Abstract:
Depth estimation is an important task in various robotics systems and applications. In mobile robotics systems, monocular depth estimation is desirable since a single RGB camera can be deployable at a low cost and compact size. Due to its significant and growing needs, many lightweight monocular depth estimation networks have been proposed for mobile robotics systems. While most lightweight monocu…
▽ More
Depth estimation is an important task in various robotics systems and applications. In mobile robotics systems, monocular depth estimation is desirable since a single RGB camera can be deployable at a low cost and compact size. Due to its significant and growing needs, many lightweight monocular depth estimation networks have been proposed for mobile robotics systems. While most lightweight monocular depth estimation methods have been developed using convolution neural networks, the Transformer has been gradually utilized in monocular depth estimation recently. However, massive parameters and large computational costs in the Transformer disturb the deployment to embedded devices. In this paper, we present a Token-Sharing Transformer (TST), an architecture using the Transformer for monocular depth estimation, optimized especially in embedded devices. The proposed TST utilizes global token sharing, which enables the model to obtain an accurate depth prediction with high throughput in embedded devices. Experimental results show that TST outperforms the existing lightweight monocular depth estimation methods. On the NYU Depth v2 dataset, TST can deliver depth maps up to 63.4 FPS in NVIDIA Jetson nano and 142.6 FPS in NVIDIA Jetson TX2, with lower errors than the existing methods. Furthermore, TST achieves real-time depth estimation of high-resolution images on Jetson TX2 with competitive results.
△ Less
Submitted 9 June, 2023;
originally announced June 2023.
-
SNAC: Speaker-normalized affine coupling layer in flow-based architecture for zero-shot multi-speaker text-to-speech
Authors:
Byoung Jin Choi,
Myeonghun Jeong,
Joun Yeop Lee,
Nam Soo Kim
Abstract:
Zero-shot multi-speaker text-to-speech (ZSM-TTS) models aim to generate a speech sample with the voice characteristic of an unseen speaker. The main challenge of ZSM-TTS is to increase the overall speaker similarity for unseen speakers. One of the most successful speaker conditioning methods for flow-based multi-speaker text-to-speech (TTS) models is to utilize the functions which predict the scal…
▽ More
Zero-shot multi-speaker text-to-speech (ZSM-TTS) models aim to generate a speech sample with the voice characteristic of an unseen speaker. The main challenge of ZSM-TTS is to increase the overall speaker similarity for unseen speakers. One of the most successful speaker conditioning methods for flow-based multi-speaker text-to-speech (TTS) models is to utilize the functions which predict the scale and bias parameters of the affine coupling layers according to the given speaker embedding vector. In this letter, we improve on the previous speaker conditioning method by introducing a speaker-normalized affine coupling (SNAC) layer which allows for unseen speaker speech synthesis in a zero-shot manner leveraging a normalization-based conditioning technique. The newly designed coupling layer explicitly normalizes the input by the parameters predicted from a speaker embedding vector while training, enabling an inverse process of denormalizing for a new speaker embedding at inference. The proposed conditioning scheme yields the state-of-the-art performance in terms of the speech quality and speaker similarity in a ZSM-TTS setting.
△ Less
Submitted 30 November, 2022;
originally announced November 2022.
-
An Empirical Study on L2 Accents of Cross-lingual Text-to-Speech Systems via Vowel Space
Authors:
Jihwan Lee,
Jae-Sung Bae,
Seongkyu Mun,
Heejin Choi,
Joun Yeop Lee,
Hoon-Young Cho,
Chanwoo Kim
Abstract:
With the recent developments in cross-lingual Text-to-Speech (TTS) systems, L2 (second-language, or foreign) accent problems arise. Moreover, running a subjective evaluation for such cross-lingual TTS systems is troublesome. The vowel space analysis, which is often utilized to explore various aspects of language including L2 accents, is a great alternative analysis tool. In this study, we apply th…
▽ More
With the recent developments in cross-lingual Text-to-Speech (TTS) systems, L2 (second-language, or foreign) accent problems arise. Moreover, running a subjective evaluation for such cross-lingual TTS systems is troublesome. The vowel space analysis, which is often utilized to explore various aspects of language including L2 accents, is a great alternative analysis tool. In this study, we apply the vowel space analysis method to explore L2 accents of cross-lingual TTS systems. Through the vowel space analysis, we observe the three followings: a) a parallel architecture (Glow-TTS) is less L2-accented than an auto-regressive one (Tacotron); b) L2 accents are more dominant in non-shared vowels in a language pair; and c) L2 accents of cross-lingual TTS systems share some phenomena with those of human L2 learners. Our findings imply that it is necessary for TTS systems to handle each language pair differently, depending on their linguistic characteristics such as non-shared vowels. They also hint that we can further incorporate linguistics knowledge in developing cross-lingual TTS systems.
△ Less
Submitted 6 November, 2022;
originally announced November 2022.
-
Ptychographic lens-less polarization microscopy
Authors:
Jeongsoo Kim,
Seungri Song,
Bora Kim,
Mirae Park,
Seung Jae Oh,
Daesuk Kim,
Barry Cense,
Yong-Min Huh,
Joo Yong Lee,
Chulmin Joo
Abstract:
Birefringence, an inherent characteristic of optically anisotropic materials, is widely utilized in various imaging applications ranging from material characterizations to clinical diagnosis. Polarized light microscopy enables high-resolution, high-contrast imaging of optically anisotropic specimens, but it is associated with mechanical rotations of polarizer/analyzer and relatively complex optica…
▽ More
Birefringence, an inherent characteristic of optically anisotropic materials, is widely utilized in various imaging applications ranging from material characterizations to clinical diagnosis. Polarized light microscopy enables high-resolution, high-contrast imaging of optically anisotropic specimens, but it is associated with mechanical rotations of polarizer/analyzer and relatively complex optical designs. Here, we present a novel form of polarization-sensitive microscopy capable of birefringence imaging of transparent objects without an optical lens and any moving parts. Our method exploits an optical mask-modulated polarization image sensor and single-input-state LED illumination design to obtain complex and birefringence images of the object via ptychographic phase retrieval. Using a camera with a pixel resolution of 3.45 um, the method achieves birefringence imaging with a half-pitch resolution of 2.46 um over a 59.74 mm^2 field-of-view, which corresponds to a space-bandwidth product of 9.9 megapixels. We demonstrate the high-resolution, large-area birefringence imaging capability of our method by presenting the birefringence images of various anisotropic objects, including a birefringent resolution target, liquid crystal polymer depolarizer, monosodium urate crystal, and excised mouse eye and heart tissues.
△ Less
Submitted 13 September, 2022;
originally announced September 2022.
-
Graph Fourier transforms on directed product graphs
Authors:
Cheng Cheng,
Yang Chen,
Jeon Yu Lee,
Qiyu Sun
Abstract:
Graph Fourier transform (GFT) is one of the fundamental tools in graph signal processing to decompose graph signals into different frequency components and to represent graph signals with strong correlation by different modes of variation effectively. The GFT on undirected graphs has been well studied and several approaches have been proposed to define GFTs on directed graphs. In this paper, based…
▽ More
Graph Fourier transform (GFT) is one of the fundamental tools in graph signal processing to decompose graph signals into different frequency components and to represent graph signals with strong correlation by different modes of variation effectively. The GFT on undirected graphs has been well studied and several approaches have been proposed to define GFTs on directed graphs. In this paper, based on the singular value decompositions of some graph Laplacians, we propose two GFTs on the Cartesian product graph of two directed graphs. We show that the proposed GFTs could represent spatial-temporal data sets on directed networks with strong correlation efficiently, and in the undirected graph setting they are essentially the joint GFT in the literature. In this paper, we also consider the bandlimiting procedure in the spectral domain of the proposed GFTs, and demonstrate its performance to denoise the temperature data set in the region of Brest (France) on January 2014.
△ Less
Submitted 7 September, 2022; v1 submitted 3 September, 2022;
originally announced September 2022.
-
Into-TTS : Intonation Template Based Prosody Control System
Authors:
Jihwan Lee,
Joun Yeop Lee,
Heejin Choi,
Seongkyu Mun,
Sangjun Park,
Jae-Sung Bae,
Chanwoo Kim
Abstract:
Intonations play an important role in delivering the intention of a speaker. However, current end-to-end TTS systems often fail to model proper intonations. To alleviate this problem, we propose a novel, intuitive method to synthesize speech in different intonations using predefined intonation templates. Prior to TTS model training, speech data are grouped into intonation templates in an unsupervi…
▽ More
Intonations play an important role in delivering the intention of a speaker. However, current end-to-end TTS systems often fail to model proper intonations. To alleviate this problem, we propose a novel, intuitive method to synthesize speech in different intonations using predefined intonation templates. Prior to TTS model training, speech data are grouped into intonation templates in an unsupervised manner. Two proposed modules are added to the end-to-end TTS framework: an intonation predictor and an intonation encoder. The intonation predictor recommends a suitable intonation template to the given text. The intonation encoder, attached to the text encoder output, synthesizes speech abiding the requested intonation template. Main contributions of our paper are: (a) an easy-to-use intonation control system covering a wide range of users; (b) better performance in wrapping speech in a requested intonation with improved objective and subjective evaluation; and (c) incorporating a pre-trained language model for intonation modelling. Audio samples are available at https://srtts.github.io/IntoTTS.
△ Less
Submitted 6 November, 2022; v1 submitted 4 April, 2022;
originally announced April 2022.
-
Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus
Authors:
Minchan Kim,
Myeonghun Jeong,
Byoung Jin Choi,
Sunghwan Ahn,
Joun Yeop Lee,
Nam Soo Kim
Abstract:
Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus, which is troublesome to collect. In this paper, we propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training. By leveraging wav2vec2.0 representation, unlabeled speech can highly improve performance, especially in the lack of labeled speech. We also…
▽ More
Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus, which is troublesome to collect. In this paper, we propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training. By leveraging wav2vec2.0 representation, unlabeled speech can highly improve performance, especially in the lack of labeled speech. We also extend the proposed method to zero-shot multi-speaker TTS (ZS-TTS). The experimental results verify the effectiveness of the proposed method in terms of naturalness, intelligibility, and speaker generalization. We highlight that the single speaker TTS model fine-tuned on the only 10 minutes of labeled dataset outperforms the other baselines, and the ZS-TTS model fine-tuned on the only 30 minutes of single speaker dataset can generate the voice of the arbitrary speaker, by pre-training on unlabeled multi-speaker speech corpus.
△ Less
Submitted 6 October, 2022; v1 submitted 29 March, 2022;
originally announced March 2022.