-
Enhancing Satellite Quantum Key Distribution with Dual Band Reconfigurable Intelligent Surfaces
Authors:
Muhammad Khalil,
Ke Wang,
Jinho Choi
Abstract:
This paper presents a novel system architecture for hybrid satellite communications, integrating quantum key distribution (QKD) and classical radio frequency (RF) data transmission using a dual-band reconfigurable intelligent surface (RIS). The motivation is to address the growing need for global, secure, and reliable communications by leveraging the security of quantum optical links and the robus…
▽ More
This paper presents a novel system architecture for hybrid satellite communications, integrating quantum key distribution (QKD) and classical radio frequency (RF) data transmission using a dual-band reconfigurable intelligent surface (RIS). The motivation is to address the growing need for global, secure, and reliable communications by leveraging the security of quantum optical links and the robustness of classical RF channels within a unified framework. By employing a frequency-selective RIS, the system independently optimizes both quantum (850 nm) and classical (S-band) channels in real time, dynamically adapting to environmental fluctuations such as atmospheric turbulence and rain attenuation. The joint optimization of the quantum bit error rate (QBER) and the classical signal-to noise ratio (SNR) is formulated as a quadratic unconstrained binary optimization (QUBO) problem, enabling efficient adaptive phase control utilizing both quantum and classical computational methods. Comprehensive theoretical modeling and simulations, benchmarked against experimental data from the Micius satellite, demonstrate substantial performance gains. Notably, the RIS assisted system reduces QBER from approximately 2.5% to 0.7%, increases the secure key rate (SKR) to over 30,000 bits per second, and enhances classical RF SNR by about 3 dB at high elevation angles. These results illustrate the practical potential of hybrid RIS-assisted satellite links to deliver robust, efficient, and secure global communications.
△ Less
Submitted 3 July, 2025;
originally announced July 2025.
-
Low-Complexity Semantic Packet Aggregation for Token Communication via Lookahead Search
Authors:
Seunghun Lee,
Jihong Park,
Jinho Choi,
Hyuncheol Park
Abstract:
Tokens are fundamental processing units of generative AI (GenAI) and large language models (LLMs), and token communication (TC) is essential for enabling remote AI-generate content (AIGC) and wireless LLM applications. Unlike traditional bits, each of which is independently treated, the semantics of each token depends on its surrounding context tokens. This inter-token dependency makes TC vulnerab…
▽ More
Tokens are fundamental processing units of generative AI (GenAI) and large language models (LLMs), and token communication (TC) is essential for enabling remote AI-generate content (AIGC) and wireless LLM applications. Unlike traditional bits, each of which is independently treated, the semantics of each token depends on its surrounding context tokens. This inter-token dependency makes TC vulnerable to outage channels, where the loss of a single token can significantly distort the original message semantics. Motivated by this, this paper focuses on optimizing token packetization to maximize the average token similarity (ATS) between the original and received token messages under outage channels. Due to inter-token dependency, this token grouping problem is combinatorial, with complexity growing exponentially with message length. To address this, we propose a novel framework of semantic packet aggregation with lookahead search (SemPA-Look), built on two core ideas. First, it introduces the residual semantic score (RSS) as a token-level surrogate for the message-level ATS, allowing robust semantic preservation even when a certain token packet is lost. Second, instead of full search, SemPA-Look applies a lookahead search-inspired algorithm that samples intra-packet token candidates without replacement (fixed depth), conditioned on inter-packet token candidates sampled with replacement (fixed width), thereby achieving linear complexity. Experiments on a remote AIGC task with the MS-COCO dataset (text captioned images) demonstrate that SemPA-Look achieves high ATS and LPIPS scores comparable to exhaustive search, while reducing computational complexity by up to 40$\times$. Compared to other linear-complexity algorithms such as the genetic algorithm (GA), SemPA-Look achieves 10$\times$ lower complexity, demonstrating its practicality for remote AIGC and other TC applications.
△ Less
Submitted 24 June, 2025;
originally announced June 2025.
-
InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding
Authors:
Minsoo Kim,
Kyuhong Shim,
Jungwook Choi,
Simyung Chang
Abstract:
Modern multimodal large language models (MLLMs) can reason over hour-long video, yet their key-value (KV) cache grows linearly with time--quickly exceeding the fixed memory of phones, AR glasses, and edge robots. Prior compression schemes either assume the whole video and user query are available offline or must first build the full cache, so memory still scales with stream length. InfiniPot-V is…
▽ More
Modern multimodal large language models (MLLMs) can reason over hour-long video, yet their key-value (KV) cache grows linearly with time--quickly exceeding the fixed memory of phones, AR glasses, and edge robots. Prior compression schemes either assume the whole video and user query are available offline or must first build the full cache, so memory still scales with stream length. InfiniPot-V is the first training-free, query-agnostic framework that enforces a hard, length-independent memory cap for streaming video understanding. During video encoding it monitors the cache and, once a user-set threshold is reached, runs a lightweight compression pass that (i) removes temporally redundant tokens via Temporal-axis Redundancy (TaR) metric and (ii) keeps semantically significant tokens via Value-Norm (VaN) ranking. Across four open-source MLLMs and four long-video and two streaming-video benchmarks, InfiniPot-V cuts peak GPU memory by up to 94%, sustains real-time generation, and matches or surpasses full-cache accuracy--even in multi-turn dialogues. By dissolving the KV cache bottleneck without retraining or query knowledge, InfiniPot-V closes the gap for on-device streaming video assistants.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
ASAP-FE: Energy-Efficient Feature Extraction Enabling Multi-Channel Keyword Spotting on Edge Processors
Authors:
Jongin Choi,
Jina Park,
Woojoo Lee,
Jae-Jin Lee,
Massoud Pedram
Abstract:
Multi-channel keyword spotting (KWS) has become crucial for voice-based applications in edge environments. However, its substantial computational and energy requirements pose significant challenges. We introduce ASAP-FE (Agile Sparsity-Aware Parallelized-Feature Extractor), a hardware-oriented front-end designed to address these challenges. Our framework incorporates three key innovations: (1) Hal…
▽ More
Multi-channel keyword spotting (KWS) has become crucial for voice-based applications in edge environments. However, its substantial computational and energy requirements pose significant challenges. We introduce ASAP-FE (Agile Sparsity-Aware Parallelized-Feature Extractor), a hardware-oriented front-end designed to address these challenges. Our framework incorporates three key innovations: (1) Half-overlapped Infinite Impulse Response (IIR) Framing: This reduces redundant data by approximately 25% while maintaining essential phoneme transition cues. (2) Sparsity-aware Data Reduction: We exploit frame-level sparsity to achieve an additional 50% data reduction by combining frame skipping with stride-based filtering. (3) Dynamic Parallel Processing: We introduce a parameterizable filter cluster and a priority-based scheduling algorithm that allows parallel execution of IIR filtering tasks, reducing latency and optimizing energy efficiency. ASAP-FE is implemented with various filter cluster sizes on edge processors, with functionality verified on FPGA prototypes and designs synthesized at 45 nm. Experimental results using TC-ResNet8, DS-CNN, and KWT-1 demonstrate that ASAP-FE reduces the average workload by 62.73% while supporting real-time processing for up to 32 channels. Compared to a conventional fully overlapped baseline, ASAP-FE achieves less than a 1% accuracy drop (e.g., 96.22% vs. 97.13% for DS-CNN), which is well within acceptable limits for edge AI. By adjusting the number of filter modules, our design optimizes the trade-off between performance and energy, with 15 parallel filters providing optimal performance for up to 25 channels. Overall, ASAP-FE offers a practical and efficient solution for multi-channel KWS on energy-constrained edge devices.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models
Authors:
Kyowoon Lee,
Artyom Stitsyuk,
Gunu Jho,
Inchul Hwang,
Jaesik Choi
Abstract:
Recent advances in Text-to-Speech (TTS) have significantly improved speech naturalness, increasing the demand for precise prosody control and mispronunciation correction. Existing approaches for prosody manipulation often depend on specialized modules or additional training, limiting their capacity for post-hoc adjustments. Similarly, traditional mispronunciation correction relies on grapheme-to-p…
▽ More
Recent advances in Text-to-Speech (TTS) have significantly improved speech naturalness, increasing the demand for precise prosody control and mispronunciation correction. Existing approaches for prosody manipulation often depend on specialized modules or additional training, limiting their capacity for post-hoc adjustments. Similarly, traditional mispronunciation correction relies on grapheme-to-phoneme dictionaries, making it less practical in low-resource settings. We introduce Counterfactual Activation Editing, a model-agnostic method that manipulates internal representations in a pre-trained TTS model to achieve post-hoc control of prosody and pronunciation. Experimental results show that our method effectively adjusts prosodic features and corrects mispronunciations while preserving synthesis quality. This opens the door to inference-time refinement of TTS outputs without retraining, bridging the gap between pre-trained TTS models and editable speech synthesis.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing
Authors:
Jeongsoo Choi,
Jaehun Kim,
Joon Son Chung
Abstract:
This paper introduces a cross-lingual dubbing system that translates speech from one language to another while preserving key characteristics such as duration, speaker identity, and speaking speed. Despite the strong translation quality of existing speech translation approaches, they often overlook the transfer of speech patterns, leading to mismatches with source speech and limiting their suitabi…
▽ More
This paper introduces a cross-lingual dubbing system that translates speech from one language to another while preserving key characteristics such as duration, speaker identity, and speaking speed. Despite the strong translation quality of existing speech translation approaches, they often overlook the transfer of speech patterns, leading to mismatches with source speech and limiting their suitability for dubbing applications. To address this, we propose a discrete diffusion-based speech-to-unit translation model with explicit duration control, enabling time-aligned translation. We then synthesize speech based on the predicted units and source identity with a conditional flow matching model. Additionally, we introduce a unit-based speed adaptation mechanism that guides the translation model to produce speech at a rate consistent with the source, without relying on any text. Extensive experiments demonstrate that our framework generates natural and fluent translations that align with the original speech's duration and speaking pace, while achieving competitive translation performance.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
VibE-SVC: Vibrato Extraction with High-frequency F0 Contour for Singing Voice Conversion
Authors:
Joon-Seung Choi,
Dong-Min Byun,
Hyung-Seok Oh,
Seong-Whan Lee
Abstract:
Controlling singing style is crucial for achieving an expressive and natural singing voice. Among the various style factors, vibrato plays a key role in conveying emotions and enhancing musical depth. However, modeling vibrato remains challenging due to its dynamic nature, making it difficult to control in singing voice conversion. To address this, we propose VibESVC, a controllable singing voice…
▽ More
Controlling singing style is crucial for achieving an expressive and natural singing voice. Among the various style factors, vibrato plays a key role in conveying emotions and enhancing musical depth. However, modeling vibrato remains challenging due to its dynamic nature, making it difficult to control in singing voice conversion. To address this, we propose VibESVC, a controllable singing voice conversion model that explicitly extracts and manipulates vibrato using discrete wavelet transform. Unlike previous methods that model vibrato implicitly, our approach decomposes the F0 contour into frequency components, enabling precise transfer. This allows vibrato control for enhanced flexibility. Experimental results show that VibE-SVC effectively transforms singing styles while preserving speaker similarity. Both subjective and objective evaluations confirm high-quality conversion.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment
Authors:
Jeongsoo Choi,
Zhikang Niu,
Ji-Hoon Kim,
Chunhui Wang,
Joon Son Chung,
Xie Chen
Abstract:
The goal of this paper is to optimize the training process of diffusion-based text-to-speech models. While recent studies have achieved remarkable advancements, their training demands substantial time and computational costs, largely due to the implicit guidance of diffusion models in learning complex intermediate representations. To address this, we propose A-DMA, an effective strategy for Accele…
▽ More
The goal of this paper is to optimize the training process of diffusion-based text-to-speech models. While recent studies have achieved remarkable advancements, their training demands substantial time and computational costs, largely due to the implicit guidance of diffusion models in learning complex intermediate representations. To address this, we propose A-DMA, an effective strategy for Accelerating training with Dual Modality Alignment. Our method introduces a novel alignment pipeline leveraging both text and speech modalities: text-guided alignment, which incorporates contextual representations, and speech-guided alignment, which refines semantic representations. By aligning hidden states with discriminative features, our training scheme reduces the reliance on diffusion models for learning complex representations. Extensive experiments demonstrate that A-DMA doubles the convergence speed while achieving superior performance over baselines. Code and demo samples are available at: https://github.com/ZhikangNiu/A-DMA
△ Less
Submitted 30 May, 2025; v1 submitted 26 May, 2025;
originally announced May 2025.
-
Communication-Efficient Hybrid Language Model via Uncertainty-Aware Opportunistic and Compressed Transmission
Authors:
Seungeun Oh,
Jinhyuk Kim,
Jihong Park,
Seung-Woo Ko,
Jinho Choi,
Tony Q. S. Quek,
Seong-Lyun Kim
Abstract:
To support emerging language-based applications using dispersed and heterogeneous computing resources, the hybrid language model (HLM) offers a promising architecture, where an on-device small language model (SLM) generates draft tokens that are validated and corrected by a remote large language model (LLM). However, the original HLM suffers from substantial communication overhead, as the LLM requ…
▽ More
To support emerging language-based applications using dispersed and heterogeneous computing resources, the hybrid language model (HLM) offers a promising architecture, where an on-device small language model (SLM) generates draft tokens that are validated and corrected by a remote large language model (LLM). However, the original HLM suffers from substantial communication overhead, as the LLM requires the SLM to upload the full vocabulary distribution for each token. Moreover, both communication and computation resources are wasted when the LLM validates tokens that are highly likely to be accepted. To overcome these limitations, we propose communication-efficient and uncertainty-aware HLM (CU-HLM). In CU-HLM, the SLM transmits truncated vocabulary distributions only when its output uncertainty is high. We validate the feasibility of this opportunistic transmission by discovering a strong correlation between SLM's uncertainty and LLM's rejection probability. Furthermore, we theoretically derive optimal uncertainty thresholds and optimal vocabulary truncation strategies. Simulation results show that, compared to standard HLM, CU-HLM achieves up to 206$\times$ higher token throughput by skipping 74.8% transmissions with 97.4% vocabulary compression, while maintaining 97.4% accuracy.
△ Less
Submitted 16 May, 2025;
originally announced May 2025.
-
Space-Time Beamforming for LEO Satellite Communications
Authors:
Jungbin Yim,
Jinseok Choi,
Jeonghun Park,
Ian P. Roberts,
Namyoon Lee
Abstract:
Inter-beam interference poses a significant challenge in low Earth orbit (LEO) satellite communications due to dense satellite constellations. To address this issue, we introduce spacetime beamforming, a novel paradigm that leverages the spacetime channel vector, uniquely determined by the angle of arrival (AoA) and relative Doppler shift, to optimize beamforming between a moving satellite transmi…
▽ More
Inter-beam interference poses a significant challenge in low Earth orbit (LEO) satellite communications due to dense satellite constellations. To address this issue, we introduce spacetime beamforming, a novel paradigm that leverages the spacetime channel vector, uniquely determined by the angle of arrival (AoA) and relative Doppler shift, to optimize beamforming between a moving satellite transmitter and a ground station user. We propose two space-time beamforming techniques: spacetime zero-forcing (ST-ZF) and space-time signal-to-leakage-plus-noise ratio (ST-SLNR) maximization. In a partially connected interference channel, ST-ZF achieves a 3dB SNR gain over the conventional interference avoidance method using maximum ratio transmission beamforming. Moreover, in general interference networks, ST-SLNR beamforming significantly enhances sum spectral efficiency compared to conventional interference management approaches. These results demonstrate the effectiveness of space-time beamforming in improving spectral efficiency and interference mitigation for next-generation LEO satellite networks.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Resolving Conflicting Constraints in Multi-Agent Reinforcement Learning with Layered Safety
Authors:
Jason J. Choi,
Jasmine Jerry Aloor,
Jingqi Li,
Maria G. Mendoza,
Hamsa Balakrishnan,
Claire J. Tomlin
Abstract:
Preventing collisions in multi-robot navigation is crucial for deployment. This requirement hinders the use of learning-based approaches, such as multi-agent reinforcement learning (MARL), on their own due to their lack of safety guarantees. Traditional control methods, such as reachability and control barrier functions, can provide rigorous safety guarantees when interactions are limited only to…
▽ More
Preventing collisions in multi-robot navigation is crucial for deployment. This requirement hinders the use of learning-based approaches, such as multi-agent reinforcement learning (MARL), on their own due to their lack of safety guarantees. Traditional control methods, such as reachability and control barrier functions, can provide rigorous safety guarantees when interactions are limited only to a small number of robots. However, conflicts between the constraints faced by different agents pose a challenge to safe multi-agent coordination.
To overcome this challenge, we propose a method that integrates multiple layers of safety by combining MARL with safety filters. First, MARL is used to learn strategies that minimize multiple agent interactions, where multiple indicates more than two. Particularly, we focus on interactions likely to result in conflicting constraints within the engagement distance. Next, for agents that enter the engagement distance, we prioritize pairs requiring the most urgent corrective actions. Finally, a dedicated safety filter provides tactical corrective actions to resolve these conflicts. Crucially, the design decisions for all layers of this framework are grounded in reachability analysis and a control barrier-value function-based filtering mechanism.
We validate our Layered Safe MARL framework in 1) hardware experiments using Crazyflie drones and 2) high-density advanced aerial mobility (AAM) operation scenarios, where agents navigate to designated waypoints while avoiding collisions. The results show that our method significantly reduces conflict while maintaining safety without sacrificing much efficiency (i.e., shorter travel time and distance) compared to baselines that do not incorporate layered safety. The project website is available at https://dinamo-mit.github.io/Layered-Safe-MARL/
△ Less
Submitted 4 May, 2025;
originally announced May 2025.
-
Semantic Packet Aggregation for Token Communication via Genetic Beam Search
Authors:
Seunghun Lee,
Jihong Park,
Jinho Choi,
Hyuncheol Park
Abstract:
Token communication (TC) is poised to play a pivotal role in emerging language-driven applications such as AI-generated content (AIGC) and wireless language models (LLMs). However, token loss caused by channel noise can severely degrade task performance. To address this, in this article, we focus on the problem of semantics-aware packetization and develop a novel algorithm, termed semantic packet…
▽ More
Token communication (TC) is poised to play a pivotal role in emerging language-driven applications such as AI-generated content (AIGC) and wireless language models (LLMs). However, token loss caused by channel noise can severely degrade task performance. To address this, in this article, we focus on the problem of semantics-aware packetization and develop a novel algorithm, termed semantic packet aggregation with genetic beam search (SemPA-GBeam), which aims to maximize the average token similarity (ATS) over erasure channels. Inspired from the genetic algorithm (GA) and the beam search algorithm, SemPA-GBeam iteratively optimizes token grouping for packetization within a fixed number of groups (i.e., fixed beam width in beam search) while randomly swapping a fraction of tokens (i.e., mutation in GA). Experiments on the MS-COCO dataset demonstrate that SemPA-GBeam achieves ATS and LPIPS scores comparable to exhaustive search while reducing complexity by more than 20x.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
Geometric Formulation of Unified Force-Impedance Control on SE(3) for Robotic Manipulators
Authors:
Joohwan Seo,
Nikhil Potu Surya Prakash,
Soomi Lee,
Arvind Kruthiventy,
Megan Teng,
Jongeun Choi,
Roberto Horowitz
Abstract:
In this paper, we present an impedance control framework on the SE(3) manifold, which enables force tracking while guaranteeing passivity. Building upon the unified force-impedance control (UFIC) and our previous work on geometric impedance control (GIC), we develop the geometric unified force impedance control (GUFIC) to account for the SE(3) manifold structure in the controller formulation using…
▽ More
In this paper, we present an impedance control framework on the SE(3) manifold, which enables force tracking while guaranteeing passivity. Building upon the unified force-impedance control (UFIC) and our previous work on geometric impedance control (GIC), we develop the geometric unified force impedance control (GUFIC) to account for the SE(3) manifold structure in the controller formulation using a differential geometric perspective. As in the case of the UFIC, the GUFIC utilizes energy tank augmentation for both force-tracking and impedance control to guarantee the manipulator's passivity relative to external forces. This ensures that the end effector maintains safe contact interaction with uncertain environments and tracks a desired interaction force. Moreover, we resolve a non-causal implementation problem in the UFIC formulation by introducing velocity and force fields. Due to its formulation on SE(3), the proposed GUFIC inherits the desirable SE(3) invariance and equivariance properties of the GIC, which helps increase sample efficiency in machine learning applications where a learning algorithm is incorporated into the control law. The proposed control law is validated in a simulation environment under scenarios requiring tracking an SE(3) trajectory, incorporating both position and orientation, while exerting a force on a surface. The codes are available at https://github.com/Joohwan-Seo/GUFIC_mujoco.
△ Less
Submitted 23 April, 2025;
originally announced April 2025.
-
CST-former: Multidimensional Attention-based Transformer for Sound Event Localization and Detection in Real Scenes
Authors:
Yusun Shul,
Dayun Choi,
Jung-Woo Choi
Abstract:
Sound event localization and detection (SELD) is a task for the classification of sound events and the identification of direction of arrival (DoA) utilizing multichannel acoustic signals. For effective classification and localization, a channel-spectro-temporal transformer (CST-former) was suggested. CST-former employs multidimensional attention mechanisms across the spatial, spectral, and tempor…
▽ More
Sound event localization and detection (SELD) is a task for the classification of sound events and the identification of direction of arrival (DoA) utilizing multichannel acoustic signals. For effective classification and localization, a channel-spectro-temporal transformer (CST-former) was suggested. CST-former employs multidimensional attention mechanisms across the spatial, spectral, and temporal domains to enlarge the model's capacity to learn the domain information essential for event detection and DoA estimation over time. In this work, we present an enhanced version of CST-former with multiscale unfolded local embedding (MSULE) developed to capture and aggregate domain information over multiple time-frequency scales. Also, we propose finetuning and post-processing techniques beneficial for conducting the SELD task over limited training datasets. In-depth ablation studies of the proposed architecture and detailed analysis on the proposed modules are carried out to validate the efficacy of multidimensional attentions on the SELD task. Empirical validation through experimentation on STARSS22 and STARSS23 datasets demonstrates the remarkable performance of CST-former and post-processing techniques without using external data.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
Trainable Adaptive Score Normalization for Automatic Speaker Verification
Authors:
Jeong-Hwan Choi,
Ju-Seok Seong,
Ye-Rin Jeoung,
Joon-Hyuk Chang
Abstract:
Adaptive S-norm (AS-norm) calibrates automatic speaker verification (ASV) scores by normalizing them utilize the scores of impostors which are similar to the input speaker. However, AS-norm does not involve any learning process, limiting its ability to provide appropriate regularization strength for various evaluation utterances. To address this limitation, we propose a trainable AS-norm (TAS-norm…
▽ More
Adaptive S-norm (AS-norm) calibrates automatic speaker verification (ASV) scores by normalizing them utilize the scores of impostors which are similar to the input speaker. However, AS-norm does not involve any learning process, limiting its ability to provide appropriate regularization strength for various evaluation utterances. To address this limitation, we propose a trainable AS-norm (TAS-norm) that leverages learnable impostor embeddings (LIEs), which are used to compose the cohort. These LIEs are initialized to represent each speaker in a training dataset consisting of impostor speakers. Subsequently, LIEs are fine-tuned by simulating an ASV evaluation. We utilize a margin penalty during top-scoring IEs selection in fine-tuning to prevent non-impostor speakers from being selected. In our experiments with ECAPA-TDNN, the proposed TAS-norm observed 4.11% and 10.62% relative improvement in equal error rate and minimum detection cost function, respectively, on VoxCeleb1-O trial compared with standard AS-norm without using proposed LIEs. We further validated the effectiveness of the TAS-norm on additional ASV datasets comprising Persian and Chinese, demonstrating its robustness across different languages.
△ Less
Submitted 6 April, 2025;
originally announced April 2025.
-
Data-Driven Hamiltonian for Direct Construction of Safe Set from Trajectory Data
Authors:
Jason J. Choi,
Christopher A. Strong,
Koushil Sreenath,
Namhoon Cho,
Claire J. Tomlin
Abstract:
In continuous-time optimal control, evaluating the Hamiltonian requires solving a constrained optimization problem using the system's dynamics model. Hamilton-Jacobi reachability analysis for safety verification has demonstrated practical utility only when efficient evaluation of the Hamiltonian over a large state-time grid is possible. In this study, we introduce the concept of a data-driven Hami…
▽ More
In continuous-time optimal control, evaluating the Hamiltonian requires solving a constrained optimization problem using the system's dynamics model. Hamilton-Jacobi reachability analysis for safety verification has demonstrated practical utility only when efficient evaluation of the Hamiltonian over a large state-time grid is possible. In this study, we introduce the concept of a data-driven Hamiltonian (DDH), which circumvents the need for an explicit dynamics model by relying only on mild prior knowledge (e.g., Lipschitz constants), thus enabling the construction of reachable sets directly from trajectory data. Recognizing that the Hamiltonian is the optimal inner product between a given costate and realizable state velocities, the DDH estimates the Hamiltonian using the worst-case realization of the velocity field based on the observed state trajectory data. This formulation ensures a conservative approximation of the true Hamiltonian for uncertain dynamics. The reachable set computed based on the DDH is also ensured to be a conservative approximation of the true reachable set. Next, we propose a data-efficient safe experiment framework for gradual expansion of safe sets using the DDH. This is achieved by iteratively conducting experiments within the computed data-driven safe set and updating the set using newly collected trajectory data. To demonstrate the capabilities of our approach, we showcase its effectiveness in safe flight envelope expansion for a tiltrotor vehicle transitioning from near-hover to forward flight.
△ Less
Submitted 4 April, 2025;
originally announced April 2025.
-
VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models
Authors:
Kim Sung-Bin,
Jeongsoo Choi,
Puyuan Peng,
Joon Son Chung,
Tae-Hyun Oh,
David Harwath
Abstract:
We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues. This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired individuals. Building on the success of Neural Codec Language Models (NCLMs) for speech synthesis, our method extends their capabilities by incorporating video featur…
▽ More
We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues. This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired individuals. Building on the success of Neural Codec Language Models (NCLMs) for speech synthesis, our method extends their capabilities by incorporating video features, ensuring that synthesized speech is time-synchronized and expressively aligned with facial movements while preserving natural prosody. To inject visual cues, we design adapters to align facial features with the NCLM token space and introduce audio-visual fusion layers to merge audio-visual information within the NCLM framework. Additionally, we curate CelebV-Dub, a new dataset of expressive, real-world videos specifically designed for automated video dubbing. Extensive experiments show that our model achieves high-quality, intelligible, and natural speech synthesis with accurate lip synchronization, outperforming existing methods in human perception and performing favorably in objective evaluations. We also adapt VoiceCraft-Dub for the video-to-speech task, demonstrating its versatility for various applications.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
Robust Transmission Design for Active RIS-Aided Systems
Authors:
Jinho Yang,
Hyeongtaek Lee,
Junil Choi
Abstract:
Different from conventional passive reconfigurable intelligent surfaces (RISs), incident signals and thermal noise can be amplified at active RISs. By exploiting the amplifying capability of active RISs, noticeable performance improvement can be expected when precise channel state information (CSI) is available. Since obtaining perfect CSI related to an RIS is difficult in practice, a robust trans…
▽ More
Different from conventional passive reconfigurable intelligent surfaces (RISs), incident signals and thermal noise can be amplified at active RISs. By exploiting the amplifying capability of active RISs, noticeable performance improvement can be expected when precise channel state information (CSI) is available. Since obtaining perfect CSI related to an RIS is difficult in practice, a robust transmission design is proposed in this paper to tackle the channel uncertainty issue, which will be more severe for active RIS-aided systems. To account for the worst-case scenario, the minimum achievable rate of each user is derived under a statistical CSI error model. Subsequently, an optimization problem is formulated to maximize the sum of the minimum achievable rate. Since the objective function is non-concave, the formulated problem is transformed into a tractable lower bound maximization problem, which is solved using an alternating optimization method. Numerical results show that the proposed robust design outperforms a baseline scheme that only exploits estimated CSI.
△ Less
Submitted 31 March, 2025;
originally announced April 2025.
-
Semantic Packet Aggregation and Repeated Transmission for Text-to-Image Generation
Authors:
Seunghun Lee,
Jihong Park,
Jinho Choi,
Hyuncheol Park
Abstract:
Text-based communication is expected to be prevalent in 6G applications such as wireless AI-generated content (AIGC). Motivated by this, this paper addresses the challenges of transmitting text prompts over erasure channels for a text-to-image AIGC task by developing the semantic segmentation and repeated transmission (SMART) algorithm. SMART groups words in text prompts into packets, prioritizing…
▽ More
Text-based communication is expected to be prevalent in 6G applications such as wireless AI-generated content (AIGC). Motivated by this, this paper addresses the challenges of transmitting text prompts over erasure channels for a text-to-image AIGC task by developing the semantic segmentation and repeated transmission (SMART) algorithm. SMART groups words in text prompts into packets, prioritizing the task-specific significance of semantics within these packets, and optimizes the number of repeated transmissions. Simulation results show that SMART achieves higher similarities in received texts and generated images compared to a character-level packetization baseline, while reducing computing latency by orders of magnitude compared to an exhaustive search baseline.
△ Less
Submitted 31 March, 2025;
originally announced March 2025.
-
A Self-Supervised Learning of a Foundation Model for Analog Layout Design Automation
Authors:
Sungyu Jeong,
Won Joon Choi,
Junung Choi,
Anik Biswas,
Byungsub Kim
Abstract:
We propose a UNet-based foundation model and its self-supervised learning method to address two key challenges: 1) lack of qualified annotated analog layout data, and 2) excessive variety in analog layout design tasks. For self-supervised learning, we propose random patch sampling and random masking techniques automatically to obtain enough training data from a small unannotated layout dataset. Th…
▽ More
We propose a UNet-based foundation model and its self-supervised learning method to address two key challenges: 1) lack of qualified annotated analog layout data, and 2) excessive variety in analog layout design tasks. For self-supervised learning, we propose random patch sampling and random masking techniques automatically to obtain enough training data from a small unannotated layout dataset. The obtained data are greatly augmented, less biased, equally sized, and contain enough information for excessive varieties of qualified layout patterns. By pre-training with the obtained data, the proposed foundation model can learn implicit general knowledge on layout patterns so that it can be fine-tuned for various downstream layout tasks with small task-specific datasets. Fine-tuning provides an efficient and consolidated methodology for diverse downstream tasks, reducing the enormous human effort to develop a model per task separately. In experiments, the foundation model was pre-trained using 324,000 samples obtained from 6 silicon-proved manually designed analog circuits, then it was fine-tuned for the five example downstream tasks: generating contacts, vias, dummy fingers, N-wells, and metal routings. The fine-tuned models successfully performed these tasks for more than one thousand unseen layout inputs, generating DRC/LVS-clean layouts for 96.6% of samples. Compared with training the model from scratch for the metal routing task, fine-tuning required only 1/8 of the data to achieve the same dice score of 0.95. With the same data, fine-tuning achieved a 90% lower validation loss and a 40% higher benchmark score than training from scratch.
△ Less
Submitted 28 March, 2025;
originally announced March 2025.
-
From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech
Authors:
Ji-Hoon Kim,
Jeongsoo Choi,
Jaehun Kim,
Chaeyoung Jung,
Joon Son Chung
Abstract:
The objective of this study is to generate high-quality speech from silent talking face videos, a task also known as video-to-speech synthesis. A significant challenge in video-to-speech synthesis lies in the substantial modality gap between silent video and multi-faceted speech. In this paper, we propose a novel video-to-speech system that effectively bridges this modality gap, significantly enha…
▽ More
The objective of this study is to generate high-quality speech from silent talking face videos, a task also known as video-to-speech synthesis. A significant challenge in video-to-speech synthesis lies in the substantial modality gap between silent video and multi-faceted speech. In this paper, we propose a novel video-to-speech system that effectively bridges this modality gap, significantly enhancing the quality of synthesized speech. This is achieved by learning of hierarchical representations from video to speech. Specifically, we gradually transform silent video into acoustic feature spaces through three sequential stages -- content, timbre, and prosody modeling. In each stage, we align visual factors -- lip movements, face identity, and facial expressions -- with corresponding acoustic counterparts to ensure the seamless transformation. Additionally, to generate realistic and coherent speech from the visual representations, we employ a flow matching model that estimates direct trajectories from a simple prior distribution to the target speech distribution. Extensive experiments demonstrate that our method achieves exceptional generation quality comparable to real utterances, outperforming existing methods by a significant margin.
△ Less
Submitted 21 March, 2025;
originally announced March 2025.
-
PD-Skygroundhook Controller for Semi-Active Suspension System Using Magnetorheological Fluid Dampers
Authors:
Hansol Lim,
Jee Won Lee,
Seung-Bok Choi,
Jongseong Brad Choi
Abstract:
This paper presents a Proportional-Derivative (PD) Skygroundhook controller for magnetorheological (MR) dampers in semi-active suspensions. Traditional skyhook, Groundhook, and hybrid Skygroundhook controllers are well-known for their ability to reduce body and wheel vibrations; however, each approach has limitations in handling a broad frequency spectrum and often relies on abrupt switching. By a…
▽ More
This paper presents a Proportional-Derivative (PD) Skygroundhook controller for magnetorheological (MR) dampers in semi-active suspensions. Traditional skyhook, Groundhook, and hybrid Skygroundhook controllers are well-known for their ability to reduce body and wheel vibrations; however, each approach has limitations in handling a broad frequency spectrum and often relies on abrupt switching. By adding a derivative action to the classical Skygroundhook logic, the proposed PD-Skygroundhook method enhances high-frequency damping and stabilizes transition behaviors. By leveraging the fast response of MR dampers, our controller adjusts the damper force continuously in real time to match the desired damping force of PD-Skygroundhook controller with efficient computation. Experimental evaluations under bump excitations and sine-sweeping tests demonstrate a significant reduction in sprung mass acceleration and unsprung mass acceleration, outperforming standard Skygroundhook in both ride comfort and road handling. These results highlight that the derivative action effectively reduces resonance peaks and smooths out force transitions of regular Skygroundhook. Our method offers a robust alternative to more computationally demanding semi-active controllers.
△ Less
Submitted 17 March, 2025;
originally announced March 2025.
-
MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation
Authors:
Sungwoo Cho,
Jeongsoo Choi,
Sungnyun Kim,
Se-Young Yun
Abstract:
Despite recent advances in text-to-speech (TTS) models, audio-visual to audio-visual (AV2AV) translation still faces a critical challenge: maintaining speaker consistency between the original and translated vocal and facial features. To address this issue, we propose a conditional flow matching (CFM) zero-shot audio-visual renderer that utilizes strong dual guidance from both audio and visual moda…
▽ More
Despite recent advances in text-to-speech (TTS) models, audio-visual to audio-visual (AV2AV) translation still faces a critical challenge: maintaining speaker consistency between the original and translated vocal and facial features. To address this issue, we propose a conditional flow matching (CFM) zero-shot audio-visual renderer that utilizes strong dual guidance from both audio and visual modalities. By leveraging multi-modal guidance with CFM, our model robustly preserves speaker-specific characteristics and significantly enhances zero-shot AV2AV translation abilities. For the audio modality, we enhance the CFM process by integrating robust speaker embeddings with x-vectors, which serve to bolster speaker consistency. Additionally, we convey emotional nuances to the face rendering module. The guidance provided by both audio and visual cues remains independent of semantic or linguistic content, allowing our renderer to effectively handle zero-shot translation tasks for monolingual speakers in different languages. We empirically demonstrate that the inclusion of high-quality mel-spectrograms conditioned on facial information not only enhances the quality of the synthesized speech but also positively influences facial generation, leading to overall performance improvements.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
A New Interpretation of the Time-Interleaved ADC Mismatch Problem: A Tracking-Based Hybrid Calibration Approach
Authors:
Jiwon Sung,
Jinseok Choi
Abstract:
Time-interleaved ADCs (TI-ADCs) achieve high sampling rates by interleaving multiple sub-ADCs in parallel. Mismatch errors between the sub-ADCs, however, can significantly degrade the signal quality, which is a main performance bottleneck. This paper presents a hybrid calibration approach by interpreting the mismatch problem as a tracking problem, and uses the extended Kalman filter for online est…
▽ More
Time-interleaved ADCs (TI-ADCs) achieve high sampling rates by interleaving multiple sub-ADCs in parallel. Mismatch errors between the sub-ADCs, however, can significantly degrade the signal quality, which is a main performance bottleneck. This paper presents a hybrid calibration approach by interpreting the mismatch problem as a tracking problem, and uses the extended Kalman filter for online estimation and compensation of the mismatch errors. After estimation, the desired signal is reconstructed using a truncated fractional delay filter and a high-pass filter. Simulations demonstrate that our algorithm substantially outperforms the existing hybrid calibration method in both mismatch estimation and compensation.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
SE(3)-Equivariant Robot Learning and Control: A Tutorial Survey
Authors:
Joohwan Seo,
Soochul Yoo,
Junwoo Chang,
Hyunseok An,
Hyunwoo Ryu,
Soomi Lee,
Arvind Kruthiventy,
Jongeun Choi,
Roberto Horowitz
Abstract:
Recent advances in deep learning and Transformers have driven major breakthroughs in robotics by employing techniques such as imitation learning, reinforcement learning, and LLM-based multimodal perception and decision-making. However, conventional deep learning and Transformer models often struggle to process data with inherent symmetries and invariances, typically relying on large datasets or ex…
▽ More
Recent advances in deep learning and Transformers have driven major breakthroughs in robotics by employing techniques such as imitation learning, reinforcement learning, and LLM-based multimodal perception and decision-making. However, conventional deep learning and Transformer models often struggle to process data with inherent symmetries and invariances, typically relying on large datasets or extensive data augmentation. Equivariant neural networks overcome these limitations by explicitly integrating symmetry and invariance into their architectures, leading to improved efficiency and generalization. This tutorial survey reviews a wide range of equivariant deep learning and control methods for robotics, from classic to state-of-the-art, with a focus on SE(3)-equivariant models that leverage the natural 3D rotational and translational symmetries in visual robotic manipulation and control design. Using unified mathematical notation, we begin by reviewing key concepts from group theory, along with matrix Lie groups and Lie algebras. We then introduce foundational group-equivariant neural network design and show how the group-equivariance can be obtained through their structure. Next, we discuss the applications of SE(3)-equivariant neural networks in robotics in terms of imitation learning and reinforcement learning. The SE(3)-equivariant control design is also reviewed from the perspective of geometric control. Finally, we highlight the challenges and future directions of equivariant methods in developing more robust, sample-efficient, and multi-modal real-world robotic systems.
△ Less
Submitted 23 April, 2025; v1 submitted 12 March, 2025;
originally announced March 2025.
-
Machine Learning for Future Wireless Communications: Channel Prediction Perspectives
Authors:
Hwanjin Kim,
Junil Choi,
David J. Love
Abstract:
Precise channel state knowledge is crucial in future wireless communication systems, which drives the need for accurate channel prediction without additional pilot overhead. While machine-learning (ML) methods for channel prediction show potential, existing approaches have limitations in their capability to adapt to environmental changes due to their extensive training requirements. In this paper,…
▽ More
Precise channel state knowledge is crucial in future wireless communication systems, which drives the need for accurate channel prediction without additional pilot overhead. While machine-learning (ML) methods for channel prediction show potential, existing approaches have limitations in their capability to adapt to environmental changes due to their extensive training requirements. In this paper, we introduce the channel prediction approaches in terms of the temporal channel prediction and the environmental adaptation. Then, we elaborate on the use of the advanced ML-based channel prediction to resolve the issues in traditional ML methods. The numerical results show that the advanced ML-based channel prediction has comparable accuracy with much less training overhead compared to conventional prediction methods. Also, we examine the training process, dataset characteristics, and the impact of source tasks and pre-trained models on channel prediction approaches. Finally, we discuss open challenges and possible future research directions of ML-based channel prediction.
△ Less
Submitted 25 February, 2025;
originally announced February 2025.
-
ScNeuGM: Scalable Neural Graph Modeling for Coloring-Based Contention and Interference Management in Wi-Fi 7
Authors:
Zhouyou Gu,
Jihong Park,
Jinho Choi
Abstract:
Carrier-sense multiple access with collision avoidance in Wi-Fi often leads to contention and interference, thereby increasing packet losses. These challenges have traditionally been modeled as a graph, with stations (STAs) represented as vertices and contention or interference as edges. Graph coloring assigns orthogonal transmission slots to STAs, managing contention and interference, e.g., using…
▽ More
Carrier-sense multiple access with collision avoidance in Wi-Fi often leads to contention and interference, thereby increasing packet losses. These challenges have traditionally been modeled as a graph, with stations (STAs) represented as vertices and contention or interference as edges. Graph coloring assigns orthogonal transmission slots to STAs, managing contention and interference, e.g., using the restricted target wake time (RTWT) mechanism introduced in Wi-Fi 7 standards. However, legacy graph models lack flexibility in optimizing these assignments, often failing to minimize slot usage while maintaining reliable transmissions. To address this issue, we propose ScNeuGM, a neural graph modeling (NGM) framework that flexibly trains a neural network (NN) to construct optimal graph models whose coloring corresponds to optimal slot assignments. ScNeuGM is highly scalable to large Wi-Fi networks with massive STA pairs: 1) it utilizes an evolution strategy (ES) to directly optimize the NN parameters based on one network-wise reward signal, avoiding exhaustive edge-wise feedback estimations in all STA pairs; 2) ScNeuGM also leverages a deep hashing function (DHF) to group contending or interfering STA pairs and restricts NGM NN training and inference to pairs within these groups, significantly reducing complexity. Simulations show that the ES-trained NN in ScNeuGM returns near-optimal graphs 4-10 times more often than algorithms requiring edge-wise feedback and reduces 25\% slots than legacy graph constructions. Furthermore, the DHF in ScNeuGM reduces the training and the inference time of NGM by 4 and 8 times, respectively, and the online slot assignment time by 3 times in large networks, and up to 30\% fewer packet losses in dynamic scenarios due to the timely assignments.
△ Less
Submitted 5 February, 2025;
originally announced February 2025.
-
Meta-Learning-Based People Counting and Localization Models Employing CSI from Commodity WiFi NICs
Authors:
Jihoon Cha,
Hwanjin Kim,
Junil Choi
Abstract:
In this paper, we consider people counting and localization systems exploiting channel state information (CSI) measured from commodity WiFi network interface cards (NICs). While CSI has useful information of amplitude and phase to describe signal propagation in a measurement environment of interest, CSI measurement suffers from offsets due to various uncertainties. Moreover, an uncontrollable exte…
▽ More
In this paper, we consider people counting and localization systems exploiting channel state information (CSI) measured from commodity WiFi network interface cards (NICs). While CSI has useful information of amplitude and phase to describe signal propagation in a measurement environment of interest, CSI measurement suffers from offsets due to various uncertainties. Moreover, an uncontrollable external environment where other WiFi devices communicate each other induces interfering signals, resulting in erroneous CSI captured at a receiver. In this paper, preprocessing of CSI is first proposed for offset removal, and it guarantees low-latency operation without any filtering process. Afterwards, we design people counting and localization models based on pre-training. To be adaptive to different measurement environments, meta-learning-based people counting and localization models are also proposed. Numerical results show that the proposed meta-learning-based people counting and localization models can achieve high sensing accuracy, compared to other learning schemes that follow simple training and test procedures.
△ Less
Submitted 5 February, 2025;
originally announced February 2025.
-
Multi-Modal Variable-Rate CSI Reconstruction for FDD Massive MIMO Systems
Authors:
Yunseo Nam,
Jiwook Choi
Abstract:
In frequency division duplex (FDD) systems, acquiring channel state information (CSI) at the base station (BS) traditionally relies on limited feedback from mobile terminals (MTs). However, the accuracy of channel reconstruction from feedback CSI is inherently constrained by the rate-distortion trade-off. To overcome this limitation, we propose a multi-modal channel reconstruction framework that l…
▽ More
In frequency division duplex (FDD) systems, acquiring channel state information (CSI) at the base station (BS) traditionally relies on limited feedback from mobile terminals (MTs). However, the accuracy of channel reconstruction from feedback CSI is inherently constrained by the rate-distortion trade-off. To overcome this limitation, we propose a multi-modal channel reconstruction framework that leverages auxiliary data, such as RGB images or uplink CSI, collected at the BS. By integrating contextual information from these modalities, the framework mitigates CSI distortions caused by noise, compression, and quantization. At its core, the framework utilizes an autoencoder network capable of generating variable-length CSI, tailored for rate-adaptive multi-modal channel reconstruction. By augmenting the foundational autoencoder network using a transfer learning-based multi-modal fusion strategy, we enable accurate channel reconstruction in both single-modal and multi-modal scenarios. To train and evaluate the network under diverse and realistic wireless conditions, we construct a synthetic dataset that pairs wireless channel data with sensor data through 3D modeling and ray tracing. Simulation results demonstrate that the proposed framework achieves near-optimal beamforming gains in 5G New Radio (5G NR)-compliant scenarios, highlighting the potential of sensor data integration to improve CSI reconstruction accuracy.
△ Less
Submitted 7 March, 2025; v1 submitted 21 January, 2025;
originally announced January 2025.
-
SIG-SDP: Sparse Interference Graph-Aided Semidefinite Programming for Large-Scale Wireless Time-Sensitive Networking
Authors:
Zhouyou Gu,
Jihong Park,
Branka Vucetic,
Jinho Choi
Abstract:
Wireless time-sensitive networking (WTSN) is essential for Industrial Internet of Things. We address the problem of minimizing time slots needed for WTSN transmissions while ensuring reliability subject to interference constraints -- an NP-hard task. Existing semidefinite programming (SDP) methods can relax and solve the problem but suffer from high polynomial complexity. We propose a sparse inter…
▽ More
Wireless time-sensitive networking (WTSN) is essential for Industrial Internet of Things. We address the problem of minimizing time slots needed for WTSN transmissions while ensuring reliability subject to interference constraints -- an NP-hard task. Existing semidefinite programming (SDP) methods can relax and solve the problem but suffer from high polynomial complexity. We propose a sparse interference graph-aided SDP (SIG-SDP) framework that exploits the interference's sparsity arising from attenuated signals between distant user pairs. First, the framework utilizes the sparsity to establish the upper and lower bounds of the minimum number of slots and uses binary search to locate the minimum within the bounds. Here, for each searched slot number, the framework optimizes a positive semidefinite (PSD) matrix indicating how likely user pairs share the same slot, and the constraint feasibility with the optimized PSD matrix further refines the slot search range. Second, the framework designs a matrix multiplicative weights (MMW) algorithm that accelerates the optimization, achieved by only sparsely adjusting interfering user pairs' elements in the PSD matrix while skipping the non-interfering pairs. We also design an online architecture to deploy the framework to adjust slot assignments based on real-time interference measurements. Simulations show that the SIG-SDP framework converges in near-linear complexity and is highly scalable to large networks. The framework minimizes the number of slots with up to 10 times faster computation and up to 100 times lower packet loss rates than compared methods. The online architecture demonstrates how the algorithm complexity impacts dynamic networks' performance.
△ Less
Submitted 20 January, 2025;
originally announced January 2025.
-
Blind Training for Channel-Adaptive Digital Semantic Communications
Authors:
Yongjeong Oh,
Joohyuk Park,
Jinho Choi,
Jihong Park,
Yo-Seb Jeon
Abstract:
Semantic encoders and decoders for digital semantic communication (SC) often struggle to adapt to variations in unpredictable channel environments and diverse system designs. To address these challenges, this paper proposes a novel framework for training semantic encoders and decoders to enable channel-adaptive digital SC. The core idea is to use binary symmetric channel (BSC) as a universal repre…
▽ More
Semantic encoders and decoders for digital semantic communication (SC) often struggle to adapt to variations in unpredictable channel environments and diverse system designs. To address these challenges, this paper proposes a novel framework for training semantic encoders and decoders to enable channel-adaptive digital SC. The core idea is to use binary symmetric channel (BSC) as a universal representation of generic digital communications, eliminating the need to specify channel environments or system designs. Based on this idea, our framework employs parallel BSCs to equivalently model the relationship between the encoder's output and the decoder's input. The bit-flip probabilities of these BSCs are treated as trainable parameters during end-to-end training, with varying levels of regularization applied to address diverse requirements in practical systems. The advantage of our framework is justified by developing a training-aware communication strategy for the inference stage. This strategy makes communication bit errors align with the pre-trained bit-flip probabilities by adaptively selecting power and modulation levels based on practical requirements and channel conditions. Simulation results demonstrate that the proposed framework outperforms existing training approaches in terms of both task performance and power consumption.
△ Less
Submitted 19 March, 2025; v1 submitted 4 January, 2025;
originally announced January 2025.
-
Scalable Beamforming Design for Multi-RIS-Aided MU-MIMO Systems with Imperfect CSIT
Authors:
Mintaek Oh,
Jinseok Choi
Abstract:
A reconfigurable intelligent surface (RIS) has emerged as a promising solution for enhancing the capabilities of wireless communications. This paper presents a scalable beamforming design for maximizing the spectral efficiency (SE) of multi-RIS-aided communications through joint optimization of the precoder and RIS phase shifts in multi-user multiple-input multiple-output (MU-MIMO) systems under i…
▽ More
A reconfigurable intelligent surface (RIS) has emerged as a promising solution for enhancing the capabilities of wireless communications. This paper presents a scalable beamforming design for maximizing the spectral efficiency (SE) of multi-RIS-aided communications through joint optimization of the precoder and RIS phase shifts in multi-user multiple-input multiple-output (MU-MIMO) systems under imperfect channel state information at the transmitter (CSIT). To address key challenges of the joint optimization problem, we first decompose it into two subproblems with deriving a proper lower bound. We then leverage a generalized power iteration (GPI) approach to identify a superior local optimal precoding solution. We further extend this approach to the RIS design using regularization; we set a RIS regularization function to efficiently handle the unitmodulus constraints, and also find the superior local optimal solution for RIS phase shifts under the GPI-based optimization framework. Subsequently, we propose an alternating optimization method. In particular, utilizing the block-diagonal structure of the matrices the GPI method, the proposed algorithm offers multi-RIS scalable beamforming as well as superior SE performance. Simulations validate the proposed method in terms of both the sum SE performance and the scalability.
△ Less
Submitted 1 January, 2025;
originally announced January 2025.
-
A Selective Secure Precoding Framework for MU-MIMO Rate-Splitting Multiple Access Networks Under Limited CSIT
Authors:
Sangmin Lee,
Seokjun Park,
Jeonghun Park,
Jinseok Choi
Abstract:
In this paper, we propose a robust and adaptable secure precoding framework designed to encapsulate a intricate scenario where legitimate users have different information security: secure private or normal public information. Leveraging rate-splitting multiple access (RSMA), we formulate the sum secrecy spectral efficiency (SE) maximization problem in downlink multi-user multiple-input multiple-ou…
▽ More
In this paper, we propose a robust and adaptable secure precoding framework designed to encapsulate a intricate scenario where legitimate users have different information security: secure private or normal public information. Leveraging rate-splitting multiple access (RSMA), we formulate the sum secrecy spectral efficiency (SE) maximization problem in downlink multi-user multiple-input multiple-output (MIMO) systems with multi-eavesdropper. To resolve the challenges including the heterogeneity of security, non-convexity, and non-smoothness of the problem, we initially approximate the problem using a LogSumExp technique. Subsequently, we derive the first-order optimality condition in the form of a generalized eigenvalue problem. We utilize a power iteration-based method to solve the condition, thereby achieving a superior local optimal solution. The proposed algorithm is further extended to a more realistic scenario involving limited channel state information at the transmitter (CSIT). To effectively utilize the limited channel information, we employ a conditional average rate approach. Handling the conditional average by deriving useful bounds, we establish a lower bound for the objective function under the conditional average. Then we apply the similar optimization method as for the perfect CSIT case. In simulations, we validate the proposed algorithm in terms of the sum secrecy SE.
△ Less
Submitted 26 December, 2024;
originally announced December 2024.
-
Integrated Sensing and Communications in Downlink FDD MIMO without CSI Feedback
Authors:
Namhyun Kim,
Juntaek Han,
Jinseok Choi,
Ahmed Alkhateeb,
Chan-Byoung Chae,
Jeonghun Park
Abstract:
In this paper, we propose a precoding framework for frequency division duplex (FDD) integrated sensing and communication (ISAC) systems with multiple-input multiple-output (MIMO). Specifically, we aim to maximize ergodic sum spectral efficiency (SE) while satisfying a sensing beam pattern constraint defined by the mean squared error (MSE). Our method reconstructs downlink (DL) channel state inform…
▽ More
In this paper, we propose a precoding framework for frequency division duplex (FDD) integrated sensing and communication (ISAC) systems with multiple-input multiple-output (MIMO). Specifically, we aim to maximize ergodic sum spectral efficiency (SE) while satisfying a sensing beam pattern constraint defined by the mean squared error (MSE). Our method reconstructs downlink (DL) channel state information (CSI) from uplink (UL) training signals using partial reciprocity, eliminating the need for CSI feedback. To obtain the error covariance matrix of the reconstructed DL CSI, we devise an observed Fisher information-based estimation technique. Leveraging this, to mitigate interference caused by imperfect DL CSI reconstruction and sensing operations, we propose a rate-splitting multiple access (RSMA) aided precoder optimization method. This method jointly updates the precoding vector and Lagrange multipliers by solving the nonlinear eigenvalue problem with eigenvector dependency to maximize SE. The numerical results show that the proposed design achieves precise beam pattern control, maximizes SE, and significantly improves the sensing-communication trade-off compared to the state-of-the-art methods in FDD ISAC scenarios.
△ Less
Submitted 10 June, 2025; v1 submitted 17 December, 2024;
originally announced December 2024.
-
V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow
Authors:
Jeongsoo Choi,
Ji-Hoon Kim,
Jinyu Li,
Joon Son Chung,
Shujie Liu
Abstract:
In this paper, we introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to generate natural and intelligible speech directly from silent talking face videos. While recent V2S systems have shown promising results on constrained datasets with limited speakers and vocabularies, their performance often degrades on real-world, unconstrained datasets due to the inherent variability and com…
▽ More
In this paper, we introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to generate natural and intelligible speech directly from silent talking face videos. While recent V2S systems have shown promising results on constrained datasets with limited speakers and vocabularies, their performance often degrades on real-world, unconstrained datasets due to the inherent variability and complexity of speech signals. To address these challenges, we decompose the speech signal into manageable subspaces (content, pitch, and speaker information), each representing distinct speech attributes, and predict them directly from the visual input. To generate coherent and realistic speech from these predicted attributes, we employ a rectified flow matching decoder built on a Transformer architecture, which models efficient probabilistic pathways from random noise to the target speech distribution. Extensive experiments demonstrate that V2SFlow significantly outperforms state-of-the-art methods, even surpassing the naturalness of ground truth utterances. Code and models are available at: https://github.com/kaistmm/V2SFlow
△ Less
Submitted 30 May, 2025; v1 submitted 29 November, 2024;
originally announced November 2024.
-
An Experimental Multi-Band Channel Characterization in the Upper Mid-Band
Authors:
Roberto Bomfin,
Ahmad Bazzi,
Hao Guo,
Hyeongtaek Lee,
Marco Mezzavilla,
Sundeep Rangan,
Junil Choi,
Marwa Chafii
Abstract:
The following paper provides a multi-band channel measurement analysis on the frequency range (FR)3. This study focuses on the FR3 low frequencies 6.5 GHz and 8.75 GHz with a setup tailored to the context of integrated sensing and communication (ISAC), where the data are collected with and without the presence of a target. A method based on multiple signal classification (MUSIC) is used to refine…
▽ More
The following paper provides a multi-band channel measurement analysis on the frequency range (FR)3. This study focuses on the FR3 low frequencies 6.5 GHz and 8.75 GHz with a setup tailored to the context of integrated sensing and communication (ISAC), where the data are collected with and without the presence of a target. A method based on multiple signal classification (MUSIC) is used to refine the delays of the channel impulse response estimates. The results reveal that the channel at the lower frequency 6.5 GHz has additional distinguishable multipath components in the presence of the target, while the one associated with the higher frequency 8.75 GHz has more blockage. The set of results reported in this paper serves as a benchmark for future multi-band studies in the FR3 spectrum.
△ Less
Submitted 19 November, 2024;
originally announced November 2024.
-
Unsupervised Training of a Dynamic Context-Aware Deep Denoising Framework for Low-Dose Fluoroscopic Imaging
Authors:
Sun-Young Jeon,
Sen Wang,
Adam S. Wang,
Garry E. Gold,
Jang-Hwan Choi
Abstract:
Fluoroscopy is critical for real-time X-ray visualization in medical imaging. However, low-dose images are compromised by noise, potentially affecting diagnostic accuracy. Noise reduction is crucial for maintaining image quality, especially given such challenges as motion artifacts and the limited availability of clean data in medical imaging. To address these issues, we propose an unsupervised tr…
▽ More
Fluoroscopy is critical for real-time X-ray visualization in medical imaging. However, low-dose images are compromised by noise, potentially affecting diagnostic accuracy. Noise reduction is crucial for maintaining image quality, especially given such challenges as motion artifacts and the limited availability of clean data in medical imaging. To address these issues, we propose an unsupervised training framework for dynamic context-aware denoising of fluoroscopy image sequences. First, we train the multi-scale recurrent attention U-Net (MSR2AU-Net) without requiring clean data to address the initial noise. Second, we incorporate a knowledge distillation-based uncorrelated noise suppression module and a recursive filtering-based correlated noise suppression module enhanced with motion compensation to further improve motion compensation and achieve superior denoising performance. Finally, we introduce a novel approach by combining these modules with a pixel-wise dynamic object motion cross-fusion matrix, designed to adapt to motion, and an edge-preserving loss for precise detail retention. To validate the proposed method, we conducted extensive numerical experiments on medical image datasets, including 3500 fluoroscopy images from dynamic phantoms (2,400 images for training, 1,100 for testing) and 350 clinical images from a spinal surgery patient. Moreover, we demonstrated the robustness of our approach across different imaging modalities by testing it on the publicly available 2016 Low Dose CT Grand Challenge dataset, using 4,800 images for training and 1,136 for testing. The results demonstrate that the proposed approach outperforms state-of-the-art unsupervised algorithms in both visual quality and quantitative evaluation while achieving comparable performance to well-established supervised learning methods across low-dose fluoroscopy and CT imaging.
△ Less
Submitted 29 October, 2024;
originally announced November 2024.
-
Channel-Coded Precoding for Multi-User MISO Systems
Authors:
Ly V. Nguyen,
Junil Choi,
Bjorn Ottersten,
A. Lee Swindlehurst
Abstract:
Precoding is a critical and long-standing technique in multi-user communication systems. However, the majority of existing precoding methods do not consider channel coding in their designs. In this paper, we consider the precoding problem in multi-user multiple-input single-output (MISO) systems, incorporating channel coding into the design. By leveraging the error-correcting capability of channel…
▽ More
Precoding is a critical and long-standing technique in multi-user communication systems. However, the majority of existing precoding methods do not consider channel coding in their designs. In this paper, we consider the precoding problem in multi-user multiple-input single-output (MISO) systems, incorporating channel coding into the design. By leveraging the error-correcting capability of channel codes we increase the degrees of freedom in the transmit signal design, thereby enhancing the overall system performance. We first propose a novel data-dependent precoding framework for coded MISO systems, referred to as channel-coded precoding (CCP), which maximizes the probability that information bits can be correctly recovered by the channel decoder. This proposed CCP framework allows the transmit signals to produce data symbol errors at the users' receivers, as long as the overall information BER performance can be improved. We develop the CCP framework for both one-bit and multi-bit error-correcting capacity and devise a projected gradient-based approach to solve the design problem. We also develop a robust CCP framework for the case where knowledge of perfect channel state information (CSI) is unavailable at the transmitter, taking into account the effect of both noise and channel estimation errors. Finally, we conduct numerous simulations to verify the effectiveness of the proposed CCP and its superiority compared to existing precoding methods, and we identify situations where the proposed CCP yields the most significant gains.
△ Less
Submitted 29 October, 2024;
originally announced October 2024.
-
Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding
Authors:
Tan Dat Nguyen,
Ji-Hoon Kim,
Jeongsoo Choi,
Shukjae Choi,
Jinseok Park,
Younglo Lee,
Joon Son Chung
Abstract:
The goal of this paper is to accelerate codec-based speech synthesis systems with minimum sacrifice to speech quality. We propose an enhanced inference method that allows for flexible trade-offs between speed and quality during inference without requiring additional training. Our core idea is to predict multiple tokens per inference step of the AR module using multiple prediction heads, resulting…
▽ More
The goal of this paper is to accelerate codec-based speech synthesis systems with minimum sacrifice to speech quality. We propose an enhanced inference method that allows for flexible trade-offs between speed and quality during inference without requiring additional training. Our core idea is to predict multiple tokens per inference step of the AR module using multiple prediction heads, resulting in a linear reduction in synthesis time as the number of heads increases. Furthermore, we introduce a novel speculative decoding technique that utilises a Viterbi-based algorithm to select the optimal sequence of generated tokens at each decoding step. In our experiments, we demonstrate that the time required to predict each token is reduced by a factor of 4 to 5 compared to baseline models, with minimal quality trade-off or even improvement in terms of speech intelligibility. Audio samples are available at: multpletokensprediction.github.io/multipletokensprediction.github.io/.
△ Less
Submitted 17 October, 2024;
originally announced October 2024.
-
Adaptive Power Allocation in Spaceborne Assisted NOMA Systems for Integrated Terrestrial Communications
Authors:
M Khalil,
Ke Wang,
Jinho Choi
Abstract:
This study introduces an innovative approach for adaptive power allocation in Non-Orthogonal Multiple Access (NOMA) systems, enhanced by the integration of spaceborne and terrestrial signals through a Reconfigurable Intelligent Surface (RIS). We develop an adaptive mechanism to adjust the power distribution between spaceborne and terrestrial signals according to variations in environmental conditi…
▽ More
This study introduces an innovative approach for adaptive power allocation in Non-Orthogonal Multiple Access (NOMA) systems, enhanced by the integration of spaceborne and terrestrial signals through a Reconfigurable Intelligent Surface (RIS). We develop an adaptive mechanism to adjust the power distribution between spaceborne and terrestrial signals according to variations in environmental conditions and elevation angles. This mechanism employs a sophisticated transition model that combines Gaussian Mixture Models with Log-Normal distributions to adaptively counteract the detrimental impacts of atmospheric attenuation and urban shadowing. These adaptive power adjustments significantly enhance system capacity, particularly improving the Signal-to-Interference-plus-Noise Ratio under diverse operational scenarios. Simulation studies confirm the efficacy of our method within an RIS-enhanced framework, showing an approximate 20\% increase in system capacity through optimized power management between spaceborne and terrestrial signals.
△ Less
Submitted 15 October, 2024;
originally announced October 2024.
-
Improved PCRLB for radar tracking in clutter with geometry-dependent target measurement uncertainty and application to radar trajectory control
Authors:
Yifang Shi,
Yu Zhang,
Linjiao Fu,
Dongliang Peng,
Qiang Lu,
Jee Woong Choi,
Alfonso Farina
Abstract:
In realistic radar tracking, target measurement uncertainty (TMU) in terms of both detection probability and measurement error covariance is significantly affected by the target-to-radar (T2R) geometry. However, existing posterior Cramer-Rao Lower Bounds (PCRLBs) rarely investigate the fundamental impact of T2R geometry on target measurement uncertainty and eventually on mean square error (MSE) of…
▽ More
In realistic radar tracking, target measurement uncertainty (TMU) in terms of both detection probability and measurement error covariance is significantly affected by the target-to-radar (T2R) geometry. However, existing posterior Cramer-Rao Lower Bounds (PCRLBs) rarely investigate the fundamental impact of T2R geometry on target measurement uncertainty and eventually on mean square error (MSE) of state estimate, inevitably resulting in over-conservative lower bound. To address this issue, this paper firstly derives the generalized model of target measurement error covariance for bistatic radar with moving receiver and transmitter illuminating any type of signal, along with its approximated solution to specify the impact of T2R geometry on error covariance. Based upon formulated TMU model, an improved PCRLB (IPCRLB) fully accounting for both measurement origin uncertainty and geometry-dependent TMU is then re-derived, both detection probability and measurement error covariance are treated as state-dependent parameters when differentiating log-likelihood with respect to target state. Compared to existing PCRLBs that partially or completely ignore the dependence of target measurement uncertainty on T2R geometry, proposed IPCRLB provides a much accurate (less-conservative) lower bound for radar tracking in clutter with geometry-dependent TMU. The new bound is then applied to radar trajectory control to effectively optimize T2R geometry and exhibits least uncertainty of acquired target measurement and more accurate state estimate for bistatic radar tracking in clutter, compared to state-of-the-art trajectory control methods.
△ Less
Submitted 8 October, 2024;
originally announced October 2024.
-
RIS-Enabled Cellular Systems Operated by Different Service Providers
Authors:
Hyeongtaek Lee,
Junil Choi
Abstract:
In realistic cellular communication systems, multiple service providers will operate within different frequency ranges. Each serving cell, which is managed by a distinct service provider, is designed individually due to the orthogonal frequencies. However, when a reconfigurable intelligent surface (RIS) is deployed for a certain cell, the RIS still incurs reflective channels for the overall system…
▽ More
In realistic cellular communication systems, multiple service providers will operate within different frequency ranges. Each serving cell, which is managed by a distinct service provider, is designed individually due to the orthogonal frequencies. However, when a reconfigurable intelligent surface (RIS) is deployed for a certain cell, the RIS still incurs reflective channels for the overall system since the RIS reflects signals across all frequency ranges. This may cause severe undesired performance degradation for the other cells unless the reflection coefficients are properly designed. To tackle this issue, by utilizing the Riemannian manifold optimization method, an RIS reflection coefficients design is proposed in this paper to maximize the performance improvements of the cell that deploys the RIS while minimizing the undesired performance degradation for the other cells simultaneously. Numerical results demonstrate that the proposed design can effectively balance the two objectives for practical scenarios.
△ Less
Submitted 27 September, 2024;
originally announced September 2024.
-
Predictive Covert Communication Against Multi-UAV Surveillance Using Graph Koopman Autoencoder
Authors:
Sivaram Krishnan,
Jihong Park,
Gregory Sherman,
Benjamin Campbell,
Jinho Choi
Abstract:
Low Probability of Detection (LPD) communication aims to obscure the presence of radio frequency (RF) signals to evade surveillance. In the context of mobile surveillance utilizing unmanned aerial vehicles (UAVs), achieving LPD communication presents significant challenges due to the UAVs' rapid and continuous movements, which are characterized by unknown nonlinear dynamics. Therefore, accurately…
▽ More
Low Probability of Detection (LPD) communication aims to obscure the presence of radio frequency (RF) signals to evade surveillance. In the context of mobile surveillance utilizing unmanned aerial vehicles (UAVs), achieving LPD communication presents significant challenges due to the UAVs' rapid and continuous movements, which are characterized by unknown nonlinear dynamics. Therefore, accurately predicting future locations of UAVs is essential for enabling real-time LPD communication. In this paper, we introduce a novel framework termed predictive covert communication, aimed at minimizing detectability in terrestrial ad-hoc networks under multi-UAV surveillance. Our data-driven method synergistically integrates graph neural networks (GNN) with Koopman theory to model the complex interactions within a multi-UAV network and facilitating long-term predictions by linearizing the dynamics, even with limited historical data. Extensive simulation results substantiate that the predicted trajectories using our method result in at least 63%-75% lower probability of detection when compared to well-known state-of-the-art baseline approaches, showing promise in enabling low-latency covert operations in practical scenarios.
△ Less
Submitted 25 September, 2024;
originally announced September 2024.
-
Gait Switching and Enhanced Stabilization of Walking Robots with Deep Learning-based Reachability: A Case Study on Two-link Walker
Authors:
Xingpeng Xia,
Jason J. Choi,
Ayush Agrawal,
Koushil Sreenath,
Claire J. Tomlin,
Somil Bansal
Abstract:
Learning-based approaches have recently shown notable success in legged locomotion. However, these approaches often lack accountability, necessitating empirical tests to determine their effectiveness. In this work, we are interested in designing a learning-based locomotion controller whose stability can be examined and guaranteed. This can be achieved by verifying regions of attraction (RoAs) of l…
▽ More
Learning-based approaches have recently shown notable success in legged locomotion. However, these approaches often lack accountability, necessitating empirical tests to determine their effectiveness. In this work, we are interested in designing a learning-based locomotion controller whose stability can be examined and guaranteed. This can be achieved by verifying regions of attraction (RoAs) of legged robots to their stable walking gaits. This is a non-trivial problem for legged robots due to their hybrid dynamics. Although previous work has shown the utility of Hamilton-Jacobi (HJ) reachability to solve this problem, its practicality was limited by its poor scalability. The core contribution of our work is the employment of a deep learning-based HJ reachability solution to the hybrid legged robot dynamics, which overcomes the previous work's limitation. With the learned reachability solution, first, we can estimate a library of RoAs for various gaits. Second, we can design a one-step predictive controller that effectively stabilizes to an individual gait within the verified RoA. Finally, we can devise a strategy that switches gaits, in response to external perturbations, whose feasibility is guided by the RoA analysis. We demonstrate our method in a two-link walker simulation, whose mathematical model is well established. Our method achieves improved stability than previous model-based methods, while ensuring transparency that was not present in the existing learning-based approaches.
△ Less
Submitted 10 September, 2024;
originally announced September 2024.
-
LiDAR-3DGS: LiDAR Reinforced 3D Gaussian Splatting for Multimodal Radiance Field Rendering
Authors:
Hansol Lim,
Hanbeom Chang,
Jongseong Brad Choi,
Chul Min Yeum
Abstract:
In this paper, we explore the capabilities of multimodal inputs to 3D Gaussian Splatting (3DGS) based Radiance Field Rendering. We present LiDAR-3DGS, a novel method of reinforcing 3DGS inputs with LiDAR generated point clouds to significantly improve the accuracy and detail of 3D models. We demonstrate a systematic approach of LiDAR reinforcement to 3DGS to enable capturing of important features…
▽ More
In this paper, we explore the capabilities of multimodal inputs to 3D Gaussian Splatting (3DGS) based Radiance Field Rendering. We present LiDAR-3DGS, a novel method of reinforcing 3DGS inputs with LiDAR generated point clouds to significantly improve the accuracy and detail of 3D models. We demonstrate a systematic approach of LiDAR reinforcement to 3DGS to enable capturing of important features such as bolts, apertures, and other details that are often missed by image-based features alone. These details are crucial for engineering applications such as remote monitoring and maintenance. Without modifying the underlying 3DGS algorithm, we demonstrate that even a modest addition of LiDAR generated point cloud significantly enhances the perceptual quality of the models. At 30k iterations, the model generated by our method resulted in an increase of 7.064% in PSNR and 0.565% in SSIM, respectively. Since the LiDAR used in this research was a commonly used commercial-grade device, the improvements observed were modest and can be further enhanced with higher-grade LiDAR systems. Additionally, these improvements can be supplementary to other derivative works of Radiance Field Rendering and also provide a new insight for future LiDAR and computer vision integrated modeling.
△ Less
Submitted 9 September, 2024;
originally announced September 2024.
-
NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers
Authors:
Nohil Park,
Heeseung Kim,
Che Hyun Lee,
Jooyoung Choi,
Jiheum Yeom,
Sungroh Yoon
Abstract:
We present NanoVoice, a personalized text-to-speech model that efficiently constructs voice adapters for multiple speakers simultaneously. NanoVoice introduces a batch-wise speaker adaptation technique capable of fine-tuning multiple references in parallel, significantly reducing training time. Beyond building separate adapters for each speaker, we also propose a parameter sharing technique that r…
▽ More
We present NanoVoice, a personalized text-to-speech model that efficiently constructs voice adapters for multiple speakers simultaneously. NanoVoice introduces a batch-wise speaker adaptation technique capable of fine-tuning multiple references in parallel, significantly reducing training time. Beyond building separate adapters for each speaker, we also propose a parameter sharing technique that reduces the number of parameters used for speaker adaptation. By incorporating a novel trainable scale matrix, NanoVoice mitigates potential performance degradation during parameter sharing. NanoVoice achieves performance comparable to the baselines, while training 4 times faster and using 45 percent fewer parameters for speaker adaptation with 40 reference voices. Extensive ablation studies and analysis further validate the efficiency of our model.
△ Less
Submitted 20 December, 2024; v1 submitted 24 September, 2024;
originally announced September 2024.
-
VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance
Authors:
Jiheum Yeom,
Heeseung Kim,
Jooyoung Choi,
Che Hyun Lee,
Nohil Park,
Sungroh Yoon
Abstract:
When applying parameter-efficient finetuning via LoRA onto speaker adaptive text-to-speech models, adaptation performance may decline compared to full-finetuned counterparts, especially for out-of-domain speakers. Here, we propose VoiceGuider, a parameter-efficient speaker adaptive text-to-speech system reinforced with autoguidance to enhance the speaker adaptation performance, reducing the gap ag…
▽ More
When applying parameter-efficient finetuning via LoRA onto speaker adaptive text-to-speech models, adaptation performance may decline compared to full-finetuned counterparts, especially for out-of-domain speakers. Here, we propose VoiceGuider, a parameter-efficient speaker adaptive text-to-speech system reinforced with autoguidance to enhance the speaker adaptation performance, reducing the gap against full-finetuned models. We carefully explore various ways of strengthening autoguidance, ultimately finding the optimal strategy. VoiceGuider as a result shows robust adaptation performance especially on extreme out-of-domain speech data. We provide audible samples in our demo page.
△ Less
Submitted 20 December, 2024; v1 submitted 24 September, 2024;
originally announced September 2024.
-
Speech-Declipping Transformer with Complex Spectrogram and Learnerble Temporal Features
Authors:
Younghoo Kwon,
Jung-Woo Choi
Abstract:
We present a transformer-based speech-declipping model that effectively recovers clipped signals across a wide range of input signal-to-distortion ratios (SDRs). While recent time-domain deep neural network (DNN)-based declippers have outperformed traditional handcrafted and spectrogram-based DNN approaches, they still struggle with low-SDR inputs. To address this, we incorporate a transformer-bas…
▽ More
We present a transformer-based speech-declipping model that effectively recovers clipped signals across a wide range of input signal-to-distortion ratios (SDRs). While recent time-domain deep neural network (DNN)-based declippers have outperformed traditional handcrafted and spectrogram-based DNN approaches, they still struggle with low-SDR inputs. To address this, we incorporate a transformer-based architecture that operates in the time-frequency (TF) domain. The TF-transformer architecture has demonstrated remarkable performance in the speech enhancement task for low-SDR signals but cannot be optimal for the time-domain artifact like clipping. To overcome the limitations of spectrogram-based DNNs, we design an extra convolutional block that directly extracts temporal features from time-domain waveforms. The joint analysis of complex spectrogram and learned temporal features allows the model to improve performance on both high- and low-SDR inputs. Our approach also preserves the unclipped portions of the speech signal during processing, preventing degradation typically seen when only spectral information is used. In evaluations on the VoiceBank-DEMAND and DNS challenge datasets, the proposed model consistently outperformed state-of-the-art (SOTA) declipping models across various metrics, demonstrating its robustness and generalizability.
△ Less
Submitted 18 September, 2024;
originally announced September 2024.
-
Multichannel-to-Multichannel Target Sound Extraction Using Direction and Timestamp Clues
Authors:
Dayun Choi,
Jung-Woo Choi
Abstract:
We propose a multichannel-to-multichannel target sound extraction (M2M-TSE) framework for separating multichannel target signals from a multichannel mixture of sound sources. Target sound extraction (TSE) isolates a specific target signal using user-provided clues, typically focusing on single-channel extraction with class labels or temporal activation maps. However, to preserve and utilize spatia…
▽ More
We propose a multichannel-to-multichannel target sound extraction (M2M-TSE) framework for separating multichannel target signals from a multichannel mixture of sound sources. Target sound extraction (TSE) isolates a specific target signal using user-provided clues, typically focusing on single-channel extraction with class labels or temporal activation maps. However, to preserve and utilize spatial information in multichannel audio signals, it is essential to extract multichannel signals of a target sound source. Moreover, the clue for extraction can also include spatial or temporal cues like direction-of-arrival (DoA) or timestamps of source activation. To address these challenges, we present an M2M framework that extracts a multichannel sound signal based on spatio-temporal clues. We demonstrate that our transformer-based architecture can successively accomplish the M2M-TSE task for multichannel signals synthesized from audio signals of diverse classes in different room environments. Furthermore, we show that the multichannel extraction task introduces sufficient inductive bias in the DNN, allowing it to directly handle DoA clues without utilizing hand-crafted spatial features.
△ Less
Submitted 18 September, 2024;
originally announced September 2024.
-
DeFT-Mamba: Universal Multichannel Sound Separation and Polyphonic Audio Classification
Authors:
Dongheon Lee,
Jung-Woo Choi
Abstract:
This paper presents a framework for universal sound separation and polyphonic audio classification, addressing the challenges of separating and classifying individual sound sources in a multichannel mixture. The proposed framework, DeFT-Mamba, utilizes the dense frequency-time attentive network (DeFTAN) combined with Mamba to extract sound objects, capturing the local time-frequency relations thro…
▽ More
This paper presents a framework for universal sound separation and polyphonic audio classification, addressing the challenges of separating and classifying individual sound sources in a multichannel mixture. The proposed framework, DeFT-Mamba, utilizes the dense frequency-time attentive network (DeFTAN) combined with Mamba to extract sound objects, capturing the local time-frequency relations through gated convolution block and the global time-frequency relations through position-wise Hybrid Mamba. DeFT-Mamba surpasses existing separation and classification networks by a large margin, particularly in complex scenarios involving in-class polyphony. Additionally, a classification-based source counting method is introduced to identify the presence of multiple sources, outperforming conventional threshold-based approaches. Separation refinement tuning is also proposed to improve performance further. The proposed framework is trained and tested on a multichannel universal sound separation dataset developed in this work, designed to mimic realistic environments with moving sources and varying onsets and offsets of polyphonic events.
△ Less
Submitted 18 September, 2024;
originally announced September 2024.