-
Wearable Tracking of Eye and Body Movements During Breaching Training: Towards Real-Time Blast Injury Monitoring
Authors:
Jeremy P. Kemmerer,
James R. Williamson,
Joseph Kim,
Elizabeth Halford,
Hrishikesh M. Rao,
Christopher J. Smalt
Abstract:
Repeated exposure to blast overpressure in occupational settings has been associated with changes in cognitive and psychological health, as well as deficits in neurosensory subsystems. In this work, we describe a wearable system to simultaneously monitor physiology and blast exposure levels and demonstrate how this system can identify individualized exposure levels corresponding to acute physiolog…
▽ More
Repeated exposure to blast overpressure in occupational settings has been associated with changes in cognitive and psychological health, as well as deficits in neurosensory subsystems. In this work, we describe a wearable system to simultaneously monitor physiology and blast exposure levels and demonstrate how this system can identify individualized exposure levels corresponding to acute physiological response to blast exposure. Machine learning was used to develop a dose-response model that fused multiple physiological measures (electrooculuography, gait, and balance) into a single risk score by predicting the level of blast exposure on held-out subjects (Fused model, R = 0.60). We found that blast events with peak pressure levels as low as 0.25 psi could be related to physiological changes and hence may contribute to blast injury. We also identified an individual subject with deteriorating reaction time scores that consistently showed a rapid and anomalous change in physiology-based risk scores after exposure to low-level blast events. Our results suggest that the wearable approach to blast monitoring is viable in weapons training environments as a complement to more direct but sparsely administered brain health assessments, potentially viable in austere environments, and that fusing multiple physiological signals can improve sensitivity.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge
Authors:
Chao-Han Huck Yang,
Sreyan Ghosh,
Qing Wang,
Jaeyeon Kim,
Hengyi Hong,
Sonal Kumar,
Guirui Zhong,
Zhifeng Kong,
S Sakshi,
Vaibhavi Lokegaonkar,
Oriol Nieto,
Ramani Duraiswami,
Dinesh Manocha,
Gunhee Kim,
Jun Du,
Rafael Valle,
Bryan Catanzaro
Abstract:
We present Task 5 of the DCASE 2025 Challenge: an Audio Question Answering (AQA) benchmark spanning multiple domains of sound understanding. This task defines three QA subsets (Bioacoustics, Temporal Soundscapes, and Complex QA) to test audio-language models on interactive question-answering over diverse acoustic scenes. We describe the dataset composition (from marine mammal calls to soundscapes…
▽ More
We present Task 5 of the DCASE 2025 Challenge: an Audio Question Answering (AQA) benchmark spanning multiple domains of sound understanding. This task defines three QA subsets (Bioacoustics, Temporal Soundscapes, and Complex QA) to test audio-language models on interactive question-answering over diverse acoustic scenes. We describe the dataset composition (from marine mammal calls to soundscapes and complex real-world clips), the evaluation protocol (top-1 accuracy with answer-shuffling robustness), and baseline systems (Qwen2-Audio-7B, AudioFlamingo 2, Gemini-2-Flash). Preliminary results on the development set are compared, showing strong variation across models and subsets. This challenge aims to advance the audio understanding and reasoning capabilities of audio-language models toward human-level acuity, which are crucial for enabling AI agents to perceive and interact about the world effectively.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
SwinLSTM Autoencoder for Temporal-Spatial-Frequency Domain CSI Compression in Massive MIMO Systems
Authors:
Aakash Saini,
Yunchou Xing,
Jee Hyun Kim,
Amir Ahmadian Tehrani,
Wolfgang Gerstacker
Abstract:
This study presents a parameter-light, low-complexity artificial intelligence/machine learning (AI/ML) model that enhances channel state information (CSI) feedback in wireless systems by jointly exploiting temporal, spatial, and frequency (TSF) domain correlations. While traditional frameworks use autoencoders for CSI compression at the user equipment (UE) and reconstruction at the network (NW) si…
▽ More
This study presents a parameter-light, low-complexity artificial intelligence/machine learning (AI/ML) model that enhances channel state information (CSI) feedback in wireless systems by jointly exploiting temporal, spatial, and frequency (TSF) domain correlations. While traditional frameworks use autoencoders for CSI compression at the user equipment (UE) and reconstruction at the network (NW) side in spatial-frequency (SF), massive multiple-input multiple-output (mMIMO) systems in low mobility scenarios exhibit strong temporal correlation alongside frequency and spatial correlations. An autoencoder architecture alone is insufficient to exploit the TSF domain correlation in CSI; a recurrent element is also required. To address the vanishing gradients problem, researchers in recent works have proposed state-of-the-art TSF domain CSI compression architectures that combine recurrent networks for temporal correlation exploitation with deep pre-trained autoencoder that handle SF domain CSI compression. However, this approach increases the number of parameters and computational complexity. To jointly utilize correlations across the TSF domain, we propose a novel, parameter-light, low-complexity AI/ML-based recurrent autoencoder architecture to compress CSI at the UE side and reconstruct it on the NW side while minimizing CSI feedback overhead.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
MAISY: Motion-Aware Image SYnthesis for Medical Image Motion Correction
Authors:
Andrew Zhang,
Hao Wang,
Shuchang Ye,
Michael Fulham,
Jinman Kim
Abstract:
Patient motion during medical image acquisition causes blurring, ghosting, and distorts organs, which makes image interpretation challenging. Current state-of-the-art algorithms using Generative Adversarial Network (GAN)-based methods with their ability to learn the mappings between corrupted images and their ground truth via Structural Similarity Index Measure (SSIM) loss effectively generate mot…
▽ More
Patient motion during medical image acquisition causes blurring, ghosting, and distorts organs, which makes image interpretation challenging. Current state-of-the-art algorithms using Generative Adversarial Network (GAN)-based methods with their ability to learn the mappings between corrupted images and their ground truth via Structural Similarity Index Measure (SSIM) loss effectively generate motion-free images. However, we identified the following limitations: (i) they mainly focus on global structural characteristics and therefore overlook localized features that often carry critical pathological information, and (ii) the SSIM loss function struggles to handle images with varying pixel intensities, luminance factors, and variance. In this study, we propose Motion-Aware Image SYnthesis (MAISY) which initially characterize motion and then uses it for correction by: (a) leveraging the foundation model Segment Anything Model (SAM), to dynamically learn spatial patterns along anatomical boundaries where motion artifacts are most pronounced and, (b) introducing the Variance-Selective SSIM (VS-SSIM) loss which adaptively emphasizes spatial regions with high pixel variance to preserve essential anatomical details during artifact correction. Experiments on chest and head CT datasets demonstrate that our model outperformed the state-of-the-art counterparts, with Peak Signal-to-Noise Ratio (PSNR) increasing by 40%, SSIM by 10%, and Dice by 16%.
△ Less
Submitted 8 May, 2025; v1 submitted 6 May, 2025;
originally announced May 2025.
-
Multi-Antenna Users in Cell-Free Massive MIMO: Stream Allocation and Necessity of Downlink Pilots
Authors:
Eren Berk Kama,
Junbeom Kim,
Emil Björnson
Abstract:
We consider a cell-free massive multiple-input multiple-output (MIMO) system with multiple antennas on the users and access points (APs). In previous works, the downlink spectral efficiency (SE) has been evaluated using the hardening bound that requires no downlink pilots. This approach works well for single-antenna users. In this paper, we show that much higher SEs can be achieved if downlink pil…
▽ More
We consider a cell-free massive multiple-input multiple-output (MIMO) system with multiple antennas on the users and access points (APs). In previous works, the downlink spectral efficiency (SE) has been evaluated using the hardening bound that requires no downlink pilots. This approach works well for single-antenna users. In this paper, we show that much higher SEs can be achieved if downlink pilots are sent when having multi-antenna users. The reason is that the effective channel matrix does not harden. We propose a pilot-based downlink estimation scheme, derive a new SE expression, and show numerically that it yields substantially higher performance when having correlated Rayleigh fading channels.
In cases with multi-antenna users, the APs can either transmit the same or different data streams. The latter reduces the fronthaul signaling but comes with a SE loss. We propose precoding and combining schemes for these cases and consider whether channel knowledge is shared between the APs. Finally, we show numerically how the number of users, APs, and the number of antennas on users and APs affect the SE.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
Stabilization by Controllers Having Integer Coefficients
Authors:
Joowon Lee,
Donggil Lee,
Junsoo Kim
Abstract:
The system property of ``having integer coefficients,'' that is, a transfer function has an integer monic polynomial as its denominator, is significant in the field of encrypted control as it is required for a dynamic controller to be realized over encrypted data. This paper shows that there always exists a controller with integer coefficients stabilizing a given discrete-time linear time-invarian…
▽ More
The system property of ``having integer coefficients,'' that is, a transfer function has an integer monic polynomial as its denominator, is significant in the field of encrypted control as it is required for a dynamic controller to be realized over encrypted data. This paper shows that there always exists a controller with integer coefficients stabilizing a given discrete-time linear time-invariant plant. A constructive algorithm to obtain such a controller is provided, along with numerical examples. Furthermore, the proposed method is applied to converting a pre-designed controller to have integer coefficients, while the original performance is preserved in the sense that the transfer function of the closed-loop system remains unchanged.
△ Less
Submitted 1 May, 2025;
originally announced May 2025.
-
Efficient COLREGs-Compliant Collision Avoidance using Turning Circle-based Control Barrier Function
Authors:
Changyu Lee,
Jinwook Park,
Jinwhan Kim
Abstract:
This paper proposes a computationally efficient collision avoidance algorithm using turning circle-based control barrier functions (CBFs) that comply with international regulations for preventing collisions at sea (COLREGs). Conventional CBFs often lack explicit consideration of turning capabilities and avoidance direction, which are key elements in developing a COLREGs-compliant collision avoidan…
▽ More
This paper proposes a computationally efficient collision avoidance algorithm using turning circle-based control barrier functions (CBFs) that comply with international regulations for preventing collisions at sea (COLREGs). Conventional CBFs often lack explicit consideration of turning capabilities and avoidance direction, which are key elements in developing a COLREGs-compliant collision avoidance algorithm. To overcome these limitations, we introduce two CBFs derived from left and right turning circles. These functions establish safety conditions based on the proximity between the traffic ships and the centers of the turning circles, effectively determining both avoidance directions and turning capabilities. The proposed method formulates a quadratic programming problem with the CBFs as constraints, ensuring safe navigation without relying on computationally intensive trajectory optimization. This approach significantly reduces computational effort while maintaining performance comparable to model predictive control-based methods. Simulation results validate the effectiveness of the proposed algorithm in enabling COLREGs-compliant, safe navigation, demonstrating its potential for reliable and efficient operation in complex maritime environments.
△ Less
Submitted 27 April, 2025;
originally announced April 2025.
-
Documentation on Encrypted Dynamic Control Simulation Code using Ring-LWE based Cryptosystems
Authors:
Yeongjun Jang,
Joowon Lee,
Junsoo Kim
Abstract:
Encrypted controllers offer secure computation by employing modern cryptosystems to execute control operations directly over encrypted data without decryption. However, incorporating cryptosystems into dynamic controllers significantly increases the computational load. This paper aims to provide an accessible guideline for running encrypted controllers using an open-source library Lattigo, which s…
▽ More
Encrypted controllers offer secure computation by employing modern cryptosystems to execute control operations directly over encrypted data without decryption. However, incorporating cryptosystems into dynamic controllers significantly increases the computational load. This paper aims to provide an accessible guideline for running encrypted controllers using an open-source library Lattigo, which supports an efficient implementation of Ring-Learing With Errors (LWE) based encrypted controllers, and our explanations are assisted with example codes that are fully available at https://github.com/CDSL-EncryptedControl/CDSL.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
OmniMamba4D: Spatio-temporal Mamba for longitudinal CT lesion segmentation
Authors:
Justin Namuk Kim,
Yiqiao Liu,
Rajath Soans,
Keith Persson,
Sarah Halek,
Michal Tomaszewski,
Jianda Yuan,
Gregory Goldmacher,
Antong Chen
Abstract:
Accurate segmentation of longitudinal CT scans is important for monitoring tumor progression and evaluating treatment responses. However, existing 3D segmentation models solely focus on spatial information. To address this gap, we propose OmniMamba4D, a novel segmentation model designed for 4D medical images (3D images over time). OmniMamba4D utilizes a spatio-temporal tetra-orientated Mamba block…
▽ More
Accurate segmentation of longitudinal CT scans is important for monitoring tumor progression and evaluating treatment responses. However, existing 3D segmentation models solely focus on spatial information. To address this gap, we propose OmniMamba4D, a novel segmentation model designed for 4D medical images (3D images over time). OmniMamba4D utilizes a spatio-temporal tetra-orientated Mamba block to effectively capture both spatial and temporal features. Unlike traditional 3D models, which analyze single-time points, OmniMamba4D processes 4D CT data, providing comprehensive spatio-temporal information on lesion progression. Evaluated on an internal dataset comprising of 3,252 CT scans, OmniMamba4D achieves a competitive Dice score of 0.682, comparable to state-of-the-arts (SOTA) models, while maintaining computational efficiency and better detecting disappeared lesions. This work demonstrates a new framework to leverage spatio-temporal information for longitudinal CT lesion segmentation.
△ Less
Submitted 24 April, 2025; v1 submitted 13 April, 2025;
originally announced April 2025.
-
Asymptotic stabilization under homomorphic encryption: A re-encryption free method
Authors:
Shuai Feng,
Qian Ma,
Junsoo Kim,
Shengyuan Xu
Abstract:
In this paper, we propose methods to encrypted a pre-given dynamic controller with homomorphic encryption, without re-encrypting the control inputs. We first present a preliminary result showing that the coefficients in a pre-given dynamic controller can be scaled up into integers by the zooming-in factor in dynamic quantization, without utilizing re-encryption. However, a sufficiently small zoomi…
▽ More
In this paper, we propose methods to encrypted a pre-given dynamic controller with homomorphic encryption, without re-encrypting the control inputs. We first present a preliminary result showing that the coefficients in a pre-given dynamic controller can be scaled up into integers by the zooming-in factor in dynamic quantization, without utilizing re-encryption. However, a sufficiently small zooming-in factor may not always exist because it requires that the convergence speed of the pre-given closed-loop system should be sufficiently fast. Then, as the main result, we design a new controller approximating the pre-given dynamic controller, in which the zooming-in factor is decoupled from the convergence rate of the pre-given closed-loop system. Therefore, there always exist a (sufficiently small) zooming-in factor of dynamic quantization scaling up all the controller's coefficients to integers, and a finite modulus preventing overflow in cryptosystems. The process is asymptotically stable and the quantizer is not saturated.
△ Less
Submitted 12 April, 2025;
originally announced April 2025.
-
System Identification from Partial Observations under Adversarial Attacks
Authors:
Jihun Kim,
Javad Lavaei
Abstract:
This paper is concerned with the partially observed linear system identification, where the goal is to obtain reasonably accurate estimation of the balanced truncation of the true system up to the order $k$ from output measurements. We consider the challenging case of system identification under adversarial attacks, where the probability of having an attack at each time is $Θ(1/k)$ while the value…
▽ More
This paper is concerned with the partially observed linear system identification, where the goal is to obtain reasonably accurate estimation of the balanced truncation of the true system up to the order $k$ from output measurements. We consider the challenging case of system identification under adversarial attacks, where the probability of having an attack at each time is $Θ(1/k)$ while the value of the attack is arbitrary. We first show that the $l_1$-norm estimator exactly identifies the true Markov parameter matrix for nilpotent systems under any type of attack. We then build on this result to extend it to general systems and show that the estimation error exponentially decays as $k$ grows. The estimated balanced truncation model accordingly shows an exponentially decaying error for the identification of the true system up to the similarity transformation. This work is the first to provide the input-output analysis of the system with partial observations under arbitrary attacks.
△ Less
Submitted 31 March, 2025;
originally announced April 2025.
-
Nonhuman Primate Brain Tissue Segmentation Using a Transfer Learning Approach
Authors:
Zhen Lin,
Hongyu Yuan,
Richard Barcus,
Qing Lyu,
Sucheta Chakravarty,
Megan E. Lipford,
Carol A. Shively,
Suzanne Craft,
Mohammad Kawas,
Jeongchul Kim,
Christopher T. Whitlow
Abstract:
Non-human primates (NHPs) serve as critical models for understanding human brain function and neurological disorders due to their close evolutionary relationship with humans. Accurate brain tissue segmentation in NHPs is critical for understanding neurological disorders, but challenging due to the scarcity of annotated NHP brain MRI datasets, the small size of the NHP brain, the limited resolution…
▽ More
Non-human primates (NHPs) serve as critical models for understanding human brain function and neurological disorders due to their close evolutionary relationship with humans. Accurate brain tissue segmentation in NHPs is critical for understanding neurological disorders, but challenging due to the scarcity of annotated NHP brain MRI datasets, the small size of the NHP brain, the limited resolution of available imaging data and the anatomical differences between human and NHP brains. To address these challenges, we propose a novel approach utilizing STU-Net with transfer learning to leverage knowledge transferred from human brain MRI data to enhance segmentation accuracy in the NHP brain MRI, particularly when training data is limited. The combination of STU-Net and transfer learning effectively delineates complex tissue boundaries and captures fine anatomical details specific to NHP brains. Notably, our method demonstrated improvement in segmenting small subcortical structures such as putamen and thalamus that are challenging to resolve with limited spatial resolution and tissue contrast, and achieved DSC of over 0.88, IoU over 0.8 and HD95 under 7. This study introduces a robust method for multi-class brain tissue segmentation in NHPs, potentially accelerating research in evolutionary neuroscience and preclinical studies of neurological disorders relevant to human health.
△ Less
Submitted 1 April, 2025; v1 submitted 28 March, 2025;
originally announced March 2025.
-
Turning Circle-based Control Barrier Function for Efficient Collision Avoidance of Nonholonomic Vehicles
Authors:
Changyu Lee,
Kiyong Park,
Jinwhan Kim
Abstract:
This paper presents a new control barrier function (CBF) designed to improve the efficiency of collision avoidance for nonholonomic vehicles. Traditional CBFs typically rely on the shortest Euclidean distance to obstacles, overlooking the limited heading change ability of nonholonomic vehicles. This often leads to abrupt maneuvers and excessive speed reductions, which is not desirable and reduces…
▽ More
This paper presents a new control barrier function (CBF) designed to improve the efficiency of collision avoidance for nonholonomic vehicles. Traditional CBFs typically rely on the shortest Euclidean distance to obstacles, overlooking the limited heading change ability of nonholonomic vehicles. This often leads to abrupt maneuvers and excessive speed reductions, which is not desirable and reduces the efficiency of collision avoidance. Our approach addresses these limitations by incorporating the distance to the turning circle, considering the vehicle's limited maneuverability imposed by its nonholonomic constraints. The proposed CBF is integrated with model predictive control (MPC) to generate more efficient trajectories compared to existing methods that rely solely on Euclidean distance-based CBFs. The effectiveness of the proposed method is validated through numerical simulations on unicycle vehicles and experiments with underactuated surface vehicles.
△ Less
Submitted 26 March, 2025;
originally announced March 2025.
-
Rethinking Glaucoma Calibration: Voting-Based Binocular and Metadata Integration
Authors:
Taejin Jeong,
Joohyeok Kim,
Jaehoon Joo,
Yeonwoo Jung,
Hyeonmin Kim,
Seong Jae Hwang
Abstract:
Glaucoma is an incurable ophthalmic disease that damages the optic nerve, leads to vision loss, and ranks among the leading causes of blindness worldwide. Diagnosing glaucoma typically involves fundus photography, optical coherence tomography (OCT), and visual field testing. However, the high cost of OCT often leads to reliance on fundus photography and visual field testing, both of which exhibit…
▽ More
Glaucoma is an incurable ophthalmic disease that damages the optic nerve, leads to vision loss, and ranks among the leading causes of blindness worldwide. Diagnosing glaucoma typically involves fundus photography, optical coherence tomography (OCT), and visual field testing. However, the high cost of OCT often leads to reliance on fundus photography and visual field testing, both of which exhibit inherent inter-observer variability. This stems from glaucoma being a multifaceted disease that influenced by various factors. As a result, glaucoma diagnosis is highly subjective, emphasizing the necessity of calibration, which aligns predicted probabilities with actual disease likelihood. Proper calibration is essential to prevent overdiagnosis or misdiagnosis, which are critical concerns for high-risk diseases. Although AI has significantly improved diagnostic accuracy, overconfidence in models have worsen calibration performance. Recent study has begun focusing on calibration for glaucoma. Nevertheless, previous study has not fully considered glaucoma's systemic nature and the high subjectivity in its diagnostic process. To overcome these limitations, we propose V-ViT (Voting-based ViT), a novel framework that enhances calibration by incorporating disease-specific characteristics. V-ViT integrates binocular data and metadata, reflecting the multi-faceted nature of glaucoma diagnosis. Additionally, we introduce a MC dropout-based Voting System to address high subjectivity. Our approach achieves state-of-the-art performance across all metrics, including accuracy, demonstrating that our proposed methods are effective in addressing calibration issues. We validate our method using a custom dataset including binocular data.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech
Authors:
Ji-Hoon Kim,
Jeongsoo Choi,
Jaehun Kim,
Chaeyoung Jung,
Joon Son Chung
Abstract:
The objective of this study is to generate high-quality speech from silent talking face videos, a task also known as video-to-speech synthesis. A significant challenge in video-to-speech synthesis lies in the substantial modality gap between silent video and multi-faceted speech. In this paper, we propose a novel video-to-speech system that effectively bridges this modality gap, significantly enha…
▽ More
The objective of this study is to generate high-quality speech from silent talking face videos, a task also known as video-to-speech synthesis. A significant challenge in video-to-speech synthesis lies in the substantial modality gap between silent video and multi-faceted speech. In this paper, we propose a novel video-to-speech system that effectively bridges this modality gap, significantly enhancing the quality of synthesized speech. This is achieved by learning of hierarchical representations from video to speech. Specifically, we gradually transform silent video into acoustic feature spaces through three sequential stages -- content, timbre, and prosody modeling. In each stage, we align visual factors -- lip movements, face identity, and facial expressions -- with corresponding acoustic counterparts to ensure the seamless transformation. Additionally, to generate realistic and coherent speech from the visual representations, we employ a flow matching model that estimates direct trajectories from a simple prior distribution to the target speech distribution. Extensive experiments demonstrate that our method achieves exceptional generation quality comparable to real utterances, outperforming existing methods by a significant margin.
△ Less
Submitted 21 March, 2025;
originally announced March 2025.
-
PCLA: A Framework for Testing Autonomous Agents in the CARLA Simulator
Authors:
Masoud Jamshidiyan Tehrani,
Jinhan Kim,
Paolo Tonella
Abstract:
Recent research on testing autonomous driving agents has grown significantly, especially in simulation environments. The CARLA simulator is often the preferred choice, and the autonomous agents from the CARLA Leaderboard challenge are regarded as the best-performing agents within this environment. However, researchers who test these agents, rather than training their own ones from scratch, often f…
▽ More
Recent research on testing autonomous driving agents has grown significantly, especially in simulation environments. The CARLA simulator is often the preferred choice, and the autonomous agents from the CARLA Leaderboard challenge are regarded as the best-performing agents within this environment. However, researchers who test these agents, rather than training their own ones from scratch, often face challenges in utilizing them within customized test environments and scenarios. To address these challenges, we introduce PCLA (Pretrained CARLA Leaderboard Agents), an open-source Python testing framework that includes nine high-performing pre-trained autonomous agents from the Leaderboard challenges. PCLA is the first infrastructure specifically designed for testing various autonomous agents in arbitrary CARLA environments/scenarios. PCLA provides a simple way to deploy Leaderboard agents onto a vehicle without relying on the Leaderboard codebase, it allows researchers to easily switch between agents without requiring modifications to CARLA versions or programming environments, and it is fully compatible with the latest version of CARLA while remaining independent of the Leaderboard's specific CARLA version. PCLA is publicly accessible at https://github.com/MasoudJTehrani/PCLA.
△ Less
Submitted 13 March, 2025; v1 submitted 12 March, 2025;
originally announced March 2025.
-
Merry-Go-Round: Safe Control of Decentralized Multi-Robot Systems with Deadlock Prevention
Authors:
Wonjong Lee,
Joonyeol Sim,
Joonkyung Kim,
Siwon Jo,
Wenhao Luo,
Changjoo Nam
Abstract:
We propose a hybrid approach for decentralized multi-robot navigation that ensures both safety and deadlock prevention. Building on a standard control formulation, we add a lightweight deadlock prevention mechanism by forming temporary "roundabouts" (circular reference paths). Each robot relies only on local, peer-to-peer communication and a controller for base collision avoidance; a roundabout is…
▽ More
We propose a hybrid approach for decentralized multi-robot navigation that ensures both safety and deadlock prevention. Building on a standard control formulation, we add a lightweight deadlock prevention mechanism by forming temporary "roundabouts" (circular reference paths). Each robot relies only on local, peer-to-peer communication and a controller for base collision avoidance; a roundabout is generated or joined on demand to avert deadlocks. Robots in the roundabout travel in one direction until an escape condition is met, allowing them to return to goal-oriented motion. Unlike classical decentralized methods that lack explicit deadlock resolution, our roundabout maneuver ensures system-wide forward progress while preserving safety constraints. Extensive simulations and physical robot experiments show that our method consistently outperforms or matches the success and arrival rates of other decentralized control approaches, particularly in cluttered or high-density scenarios, all with minimal centralized coordination.
△ Less
Submitted 7 March, 2025;
originally announced March 2025.
-
A Risk-aware Bi-level Bidding Strategy for Virtual Power Plant with Power-to-Hydrogen System
Authors:
Jaehyun Yoo,
Jip Kim
Abstract:
This paper presents a risk-aware bi-level bidding strategy for Virtual Power Plant (VPP) that integrates Power-to-Hydrogen (P2H) system, addressing the challenges posed by renewable energy variability and market volatility. By incorporating Conditional Value at Risk (CVaR) within the bi-level optimization framework, the proposed strategy enables VPPs to mitigate financial risks associated with unc…
▽ More
This paper presents a risk-aware bi-level bidding strategy for Virtual Power Plant (VPP) that integrates Power-to-Hydrogen (P2H) system, addressing the challenges posed by renewable energy variability and market volatility. By incorporating Conditional Value at Risk (CVaR) within the bi-level optimization framework, the proposed strategy enables VPPs to mitigate financial risks associated with uncertain market conditions. The upper-level problem seeks to maximize revenue through optimal bidding, while the lower-level problem ensures market-clearing compliance. The integration of the P2H system allows surplus renewable energy to be stored as hydrogen, which is utilized as an energy carrier, thereby increasing market profitability and enhancing resilience against financial risks. The effectiveness of the proposed strategy is validated through a modified IEEE 14 bus system, demonstrating that the inclusion of the P2H system and CVaR-based risk aversion enhances both revenue and financial hedging capability under volatile market conditions.This paper underscores the strategic role of hydrogen storage in VPP operations, contributing to supporting improved profitability and the efficacy of a risk-aware bidding strategy.
△ Less
Submitted 7 March, 2025;
originally announced March 2025.
-
Community Energy Management System for Fast Frequency Response: A Hierarchical Control Approach
Authors:
Joonsung Jung,
Hyunjoong Kim,
Hyunghwan Shin,
Jip Kim
Abstract:
The increase in renewable energy sources (RES) has reduced power system inertia, making frequency stabilization more challenging and highlighting the need for fast frequency response (FFR) resources. While building energy management systems (BEMS) equipped with distributed energy resources (DERs) can provide FFR, individual BEMS alone cannot fully meet demand. To address this, we propose a communi…
▽ More
The increase in renewable energy sources (RES) has reduced power system inertia, making frequency stabilization more challenging and highlighting the need for fast frequency response (FFR) resources. While building energy management systems (BEMS) equipped with distributed energy resources (DERs) can provide FFR, individual BEMS alone cannot fully meet demand. To address this, we propose a community energy management system (CEMS) operational model that minimizes energy costs and generates additional revenue, which is provided FFR through coordinated DERs and building loads under photovoltaic (PV) generation uncertainty. The model incorporates a hierarchical control framework with three levels: Level 1 allocates maximum FFR capacity, Level 2 employs scenario-based stochastic model predictive control (SMPC) to adjust DER operations and ensure FFR provision despite PV uncertainties, and Level 3 performs rapid load adjustments in response to frequency fluctuations detected by a frequency meter. Simulation results on a campus building cluster demonstrate the effectiveness of the proposed model, achieving a 10\% reduction in energy costs and a 24\% increase in FFR capacity, all while maintaining occupant comfort and enhancing frequency stabilization.
△ Less
Submitted 7 March, 2025;
originally announced March 2025.
-
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities
Authors:
Sreyan Ghosh,
Zhifeng Kong,
Sonal Kumar,
S Sakshi,
Jaehyeon Kim,
Wei Ping,
Rafael Valle,
Dinesh Manocha,
Bryan Catanzaro
Abstract:
Understanding and reasoning over non-speech sounds and music are crucial for both humans and AI agents to interact effectively with their environments. In this paper, we introduce Audio Flamingo 2 (AF2), an Audio-Language Model (ALM) with advanced audio understanding and reasoning capabilities. AF2 leverages (i) a custom CLAP model, (ii) synthetic Audio QA data for fine-grained audio reasoning, an…
▽ More
Understanding and reasoning over non-speech sounds and music are crucial for both humans and AI agents to interact effectively with their environments. In this paper, we introduce Audio Flamingo 2 (AF2), an Audio-Language Model (ALM) with advanced audio understanding and reasoning capabilities. AF2 leverages (i) a custom CLAP model, (ii) synthetic Audio QA data for fine-grained audio reasoning, and (iii) a multi-stage curriculum learning strategy. AF2 achieves state-of-the-art performance with only a 3B parameter small language model, surpassing large open-source and proprietary models across over 20 benchmarks. Next, for the first time, we extend audio understanding to long audio segments (30 secs to 5 mins) and propose LongAudio, a large and novel dataset for training ALMs on long audio captioning and question-answering tasks. Fine-tuning AF2 on LongAudio leads to exceptional performance on our proposed LongAudioBench, an expert annotated benchmark for evaluating ALMs on long audio understanding capabilities. We conduct extensive ablation studies to confirm the efficacy of our approach. Project Website: https://research.nvidia.com/labs/adlr/AF2/.
△ Less
Submitted 5 March, 2025;
originally announced March 2025.
-
Deep learning approaches to surgical video segmentation and object detection: A Scoping Review
Authors:
Devanish N. Kamtam,
Joseph B. Shrager,
Satya Deepya Malla,
Nicole Lin,
Juan J. Cardona,
Jake J. Kim,
Clarence Hu
Abstract:
Introduction: Computer vision (CV) has had a transformative impact in biomedical fields such as radiology, dermatology, and pathology. Its real-world adoption in surgical applications, however, remains limited. We review the current state-of-the-art performance of deep learning (DL)-based CV models for segmentation and object detection of anatomical structures in videos obtained during surgical pr…
▽ More
Introduction: Computer vision (CV) has had a transformative impact in biomedical fields such as radiology, dermatology, and pathology. Its real-world adoption in surgical applications, however, remains limited. We review the current state-of-the-art performance of deep learning (DL)-based CV models for segmentation and object detection of anatomical structures in videos obtained during surgical procedures.
Methods: We conducted a scoping review of studies on semantic segmentation and object detection of anatomical structures published between 2014 and 2024 from 3 major databases - PubMed, Embase, and IEEE Xplore. The primary objective was to evaluate the state-of-the-art performance of semantic segmentation in surgical videos. Secondary objectives included examining DL models, progress toward clinical applications, and the specific challenges with segmentation of organs/tissues in surgical videos.
Results: We identified 58 relevant published studies. These focused predominantly on procedures from general surgery [20(34.4%)], colorectal surgery [9(15.5%)], and neurosurgery [8(13.8%)]. Cholecystectomy [14(24.1%)] and low anterior rectal resection [5(8.6%)] were the most common procedures addressed. Semantic segmentation [47(81%)] was the primary CV task. U-Net [14(24.1%)] and DeepLab [13(22.4%)] were the most widely used models. Larger organs such as the liver (Dice score: 0.88) had higher accuracy compared to smaller structures such as nerves (Dice score: 0.49). Models demonstrated real-time inference potential ranging from 5-298 frames-per-second (fps).
Conclusion: This review highlights the significant progress made in DL-based semantic segmentation for surgical videos with real-time applicability, particularly for larger organs. Addressing challenges with smaller structures, data availability, and generalizability remains crucial for future advancements.
△ Less
Submitted 23 February, 2025;
originally announced February 2025.
-
Structure-from-Sherds++: Robust Incremental 3D Reassembly of Axially Symmetric Pots from Unordered and Mixed Fragment Collections
Authors:
Seong Jong Yoo,
Sisung Liu,
Muhammad Zeeshan Arshad,
Jinhyeok Kim,
Young Min Kim,
Yiannis Aloimonos,
Cornelia Fermuller,
Kyungdon Joo,
Jinwook Kim,
Je Hyeong Hong
Abstract:
Reassembling multiple axially symmetric pots from fragmentary sherds is crucial for cultural heritage preservation, yet it poses significant challenges due to thin and sharp fracture surfaces that generate numerous false positive matches and hinder large-scale puzzle solving. Existing global approaches, which optimize all potential fragment pairs simultaneously or data-driven models, are prone to…
▽ More
Reassembling multiple axially symmetric pots from fragmentary sherds is crucial for cultural heritage preservation, yet it poses significant challenges due to thin and sharp fracture surfaces that generate numerous false positive matches and hinder large-scale puzzle solving. Existing global approaches, which optimize all potential fragment pairs simultaneously or data-driven models, are prone to local minima and face scalability issues when multiple pots are intermixed. Motivated by Structure-from-Motion (SfM) for 3D reconstruction from multiple images, we propose an efficient reassembly method for axially symmetric pots based on iterative registration of one sherd at a time, called Structure-from-Sherds++ (SfS++). Our method extends beyond simple replication of incremental SfM and leverages multi-graph beam search to explore multiple registration paths. This allows us to effectively filter out indistinguishable false matches and simultaneously reconstruct multiple pots without requiring prior information such as base or the number of mixed objects. Our approach achieves 87% reassembly accuracy on a dataset of 142 real fragments from 10 different pots, outperforming other methods in handling complex fracture patterns with mixed datasets and achieving state-of-the-art performance. Code and results can be found in our project page https://sj-yoo.info/sfs/.
△ Less
Submitted 18 February, 2025;
originally announced February 2025.
-
Anomaly Detection with LWE Encrypted Control
Authors:
Rijad Alisic,
Junsoo Kim,
Henrik Sandberg
Abstract:
Detecting attacks using encrypted signals is challenging since encryption hides its information content. We present a novel mechanism for anomaly detection over Learning with Errors (LWE) encrypted signals without using decryption, secure channels, nor complex communication schemes. Instead, the detector exploits the homomorphic property of LWE encryption to perform hypothesis tests on transformat…
▽ More
Detecting attacks using encrypted signals is challenging since encryption hides its information content. We present a novel mechanism for anomaly detection over Learning with Errors (LWE) encrypted signals without using decryption, secure channels, nor complex communication schemes. Instead, the detector exploits the homomorphic property of LWE encryption to perform hypothesis tests on transformations of the encrypted samples. The specific transformations are determined by solutions to a hard lattice-based minimization problem. While the test's sensitivity deteriorates with suboptimal solutions, similar to the exponential deterioration of the (related) test that breaks the cryptosystem, we show that the deterioration is polynomial for our test. This rate gap can be exploited to pick parameters that lead to somewhat weaker encryption but large gains in detection capability. Finally, we conclude the paper by presenting a numerical example that simulates anomaly detection, demonstrating the effectiveness of our method in identifying attacks.
△ Less
Submitted 14 February, 2025;
originally announced February 2025.
-
Rate-Splitting Multiple Access for 6G: Prototypes, Experimental Results and Link/System level Simulations
Authors:
Sundar Aditya,
Yong Jin Daniel Kim,
David Vargas,
David Redgate,
Onur Dizdar,
Neil Bhushan,
Xinze Lyu,
Sibo Zhang,
Stephen Wang,
Bruno Clerckx
Abstract:
Rate-Splitting Multiple Access (RSMA) is a powerful and versatile physical layer multiple access technique that generalizes and has better interference management capabilities than 5G-based Space Division Multiple Access (SDMA). It is also a rapidly maturing technology, all of which makes it a natural successor to SDMA in 6G. In this article, we describe RSMA's suitability for 6G by presenting: i)…
▽ More
Rate-Splitting Multiple Access (RSMA) is a powerful and versatile physical layer multiple access technique that generalizes and has better interference management capabilities than 5G-based Space Division Multiple Access (SDMA). It is also a rapidly maturing technology, all of which makes it a natural successor to SDMA in 6G. In this article, we describe RSMA's suitability for 6G by presenting: i) link and system level simulations of RSMA's performance gains over SDMA in realistic environments, and (ii) pioneering experimental results that demonstrate RSMA's gains over SDMA for key use cases like enhanced Mobile Broadband (eMBb), and Integrated Sensing and Communications (ISAC). We also comment on the status of standardization activities for RSMA.
△ Less
Submitted 17 February, 2025; v1 submitted 13 February, 2025;
originally announced February 2025.
-
Improving Lesion Segmentation in Medical Images by Global and Regional Feature Compensation
Authors:
Chuhan Wang,
Zhenghao Chen,
Jean Y. H. Yang,
Jinman Kim
Abstract:
Automated lesion segmentation of medical images has made tremendous improvements in recent years due to deep learning advancements. However, accurately capturing fine-grained global and regional feature representations remains a challenge. Many existing methods obtain suboptimal performance on complex lesion segmentation due to information loss during typical downsampling operations and the insuff…
▽ More
Automated lesion segmentation of medical images has made tremendous improvements in recent years due to deep learning advancements. However, accurately capturing fine-grained global and regional feature representations remains a challenge. Many existing methods obtain suboptimal performance on complex lesion segmentation due to information loss during typical downsampling operations and the insufficient capture of either regional or global features. To address these issues, we propose the Global and Regional Compensation Segmentation Framework (GRCSF), which introduces two key innovations: the Global Compensation Unit (GCU) and the Region Compensation Unit (RCU). The proposed GCU addresses resolution loss in the U-shaped backbone by preserving global contextual features and fine-grained details during multiscale downsampling. Meanwhile, the RCU introduces a self-supervised learning (SSL) residual map generated by Masked Autoencoders (MAE), obtained as pixel-wise differences between reconstructed and original images, to highlight regions with potential lesions. These SSL residual maps guide precise lesion localization and segmentation through a patch-based cross-attention mechanism that integrates regional spatial and pixel-level features. Additionally, the RCU incorporates patch-level importance scoring to enhance feature fusion by leveraging global spatial information from the backbone. Experiments on two publicly available medical image segmentation datasets, including brain stroke lesion and coronary artery calcification datasets, demonstrate that our GRCSF outperforms state-of-the-art methods, confirming its effectiveness across diverse lesion types and its potential as a generalizable lesion segmentation solution.
△ Less
Submitted 11 February, 2025;
originally announced February 2025.
-
Multi-Class Segmentation of Aortic Branches and Zones in Computed Tomography Angiography: The AortaSeg24 Challenge
Authors:
Muhammad Imran,
Jonathan R. Krebs,
Vishal Balaji Sivaraman,
Teng Zhang,
Amarjeet Kumar,
Walker R. Ueland,
Michael J. Fassler,
Jinlong Huang,
Xiao Sun,
Lisheng Wang,
Pengcheng Shi,
Maximilian Rokuss,
Michael Baumgartner,
Yannick Kirchhof,
Klaus H. Maier-Hein,
Fabian Isensee,
Shuolin Liu,
Bing Han,
Bong Thanh Nguyen,
Dong-jin Shin,
Park Ji-Woo,
Mathew Choi,
Kwang-Hyun Uhm,
Sung-Jea Ko,
Chanwoong Lee
, et al. (38 additional authors not shown)
Abstract:
Multi-class segmentation of the aorta in computed tomography angiography (CTA) scans is essential for diagnosing and planning complex endovascular treatments for patients with aortic dissections. However, existing methods reduce aortic segmentation to a binary problem, limiting their ability to measure diameters across different branches and zones. Furthermore, no open-source dataset is currently…
▽ More
Multi-class segmentation of the aorta in computed tomography angiography (CTA) scans is essential for diagnosing and planning complex endovascular treatments for patients with aortic dissections. However, existing methods reduce aortic segmentation to a binary problem, limiting their ability to measure diameters across different branches and zones. Furthermore, no open-source dataset is currently available to support the development of multi-class aortic segmentation methods. To address this gap, we organized the AortaSeg24 MICCAI Challenge, introducing the first dataset of 100 CTA volumes annotated for 23 clinically relevant aortic branches and zones. This dataset was designed to facilitate both model development and validation. The challenge attracted 121 teams worldwide, with participants leveraging state-of-the-art frameworks such as nnU-Net and exploring novel techniques, including cascaded models, data augmentation strategies, and custom loss functions. We evaluated the submitted algorithms using the Dice Similarity Coefficient (DSC) and Normalized Surface Distance (NSD), highlighting the approaches adopted by the top five performing teams. This paper presents the challenge design, dataset details, evaluation metrics, and an in-depth analysis of the top-performing algorithms. The annotated dataset, evaluation code, and implementations of the leading methods are publicly available to support further research. All resources can be accessed at https://aortaseg24.grand-challenge.org.
△ Less
Submitted 7 February, 2025;
originally announced February 2025.
-
Enhancing Free-hand 3D Photoacoustic and Ultrasound Reconstruction using Deep Learning
Authors:
SiYeoul Lee,
SeonHo Kim,
Minkyung Seo,
SeongKyu Park,
Salehin Imrus,
Kambaluru Ashok,
DongEon Lee,
Chunsu Park,
SeonYeong Lee,
Jiye Kim,
Jae-Heung Yoo,
MinWoo Kim
Abstract:
This study introduces a motion-based learning network with a global-local self-attention module (MoGLo-Net) to enhance 3D reconstruction in handheld photoacoustic and ultrasound (PAUS) imaging. Standard PAUS imaging is often limited by a narrow field of view and the inability to effectively visualize complex 3D structures. The 3D freehand technique, which aligns sequential 2D images for 3D reconst…
▽ More
This study introduces a motion-based learning network with a global-local self-attention module (MoGLo-Net) to enhance 3D reconstruction in handheld photoacoustic and ultrasound (PAUS) imaging. Standard PAUS imaging is often limited by a narrow field of view and the inability to effectively visualize complex 3D structures. The 3D freehand technique, which aligns sequential 2D images for 3D reconstruction, faces significant challenges in accurate motion estimation without relying on external positional sensors. MoGLo-Net addresses these limitations through an innovative adaptation of the self-attention mechanism, which effectively exploits the critical regions, such as fully-developed speckle area or high-echogenic tissue area within successive ultrasound images to accurately estimate motion parameters. This facilitates the extraction of intricate features from individual frames. Additionally, we designed a patch-wise correlation operation to generate a correlation volume that is highly correlated with the scanning motion. A custom loss function was also developed to ensure robust learning with minimized bias, leveraging the characteristics of the motion parameters. Experimental evaluations demonstrated that MoGLo-Net surpasses current state-of-the-art methods in both quantitative and qualitative performance metrics. Furthermore, we expanded the application of 3D reconstruction technology beyond simple B-mode ultrasound volumes to incorporate Doppler ultrasound and photoacoustic imaging, enabling 3D visualization of vasculature. The source code for this study is publicly available at: https://github.com/guhong3648/US3D
△ Less
Submitted 5 February, 2025;
originally announced February 2025.
-
Enhancing Feature Tracking Reliability for Visual Navigation using Real-Time Safety Filter
Authors:
Dabin Kim,
Inkyu Jang,
Youngsoo Han,
Sunwoo Hwang,
H. Jin Kim
Abstract:
Vision sensors are extensively used for localizing a robot's pose, particularly in environments where global localization tools such as GPS or motion capture systems are unavailable. In many visual navigation systems, localization is achieved by detecting and tracking visual features or landmarks, which provide information about the sensor's relative pose. For reliable feature tracking and accurat…
▽ More
Vision sensors are extensively used for localizing a robot's pose, particularly in environments where global localization tools such as GPS or motion capture systems are unavailable. In many visual navigation systems, localization is achieved by detecting and tracking visual features or landmarks, which provide information about the sensor's relative pose. For reliable feature tracking and accurate pose estimation, it is crucial to maintain visibility of a sufficient number of features. This requirement can sometimes conflict with the robot's overall task objective. In this paper, we approach it as a constrained control problem. By leveraging the invariance properties of visibility constraints within the robot's kinematic model, we propose a real-time safety filter based on quadratic programming. This filter takes a reference velocity command as input and produces a modified velocity that minimally deviates from the reference while ensuring the information score from the currently visible features remains above a user-specified threshold. Numerical simulations demonstrate that the proposed safety filter preserves the invariance condition and ensures the visibility of more features than the required minimum. We also validated its real-world performance by integrating it into a visual simultaneous localization and mapping (SLAM) algorithm, where it maintained high estimation quality in challenging environments, outperforming a simple tracking controller.
△ Less
Submitted 3 February, 2025;
originally announced February 2025.
-
Distribution-aware Fairness Learning in Medical Image Segmentation From A Control-Theoretic Perspective
Authors:
Yujin Oh,
Pengfei Jin,
Sangjoon Park,
Sekeun Kim,
Siyeop Yoon,
Kyungsang Kim,
Jin Sung Kim,
Xiang Li,
Quanzheng Li
Abstract:
Ensuring fairness in medical image segmentation is critical due to biases in imbalanced clinical data acquisition caused by demographic attributes (e.g., age, sex, race) and clinical factors (e.g., disease severity). To address these challenges, we introduce Distribution-aware Mixture of Experts (dMoE), inspired by optimal control theory. We provide a comprehensive analysis of its underlying mecha…
▽ More
Ensuring fairness in medical image segmentation is critical due to biases in imbalanced clinical data acquisition caused by demographic attributes (e.g., age, sex, race) and clinical factors (e.g., disease severity). To address these challenges, we introduce Distribution-aware Mixture of Experts (dMoE), inspired by optimal control theory. We provide a comprehensive analysis of its underlying mechanisms and clarify dMoE's role in adapting to heterogeneous distributions in medical image segmentation. Furthermore, we integrate dMoE into multiple network architectures, demonstrating its broad applicability across diverse medical image analysis tasks. By incorporating demographic and clinical factors, dMoE achieves state-of-the-art performance on two 2D benchmark datasets and a 3D in-house dataset. Our results highlight the effectiveness of dMoE in mitigating biases from imbalanced distributions, offering a promising approach to bridging control theory and medical image segmentation within fairness learning paradigms. The source code will be made available.
△ Less
Submitted 1 February, 2025;
originally announced February 2025.
-
Noise disturbance and lack of privacy: Modeling acoustic dissatisfaction in open-plan offices
Authors:
Manuj Yadav,
Jungsoo Kim,
Valtteri Hongisto,
Densil Cabrera,
Richard de Dear
Abstract:
Open-plan offices are well-known to be adversely affected by acoustic issues. This study aims to model acoustic dissatisfaction using measurements of room acoustics, sound environment during occupancy, and occupant surveys (n = 349) in 28 offices representing a diverse range of workplace parameters. As latent factors, the contribution of $\textit{lack of privacy}$ (LackPriv) was 25% higher than…
▽ More
Open-plan offices are well-known to be adversely affected by acoustic issues. This study aims to model acoustic dissatisfaction using measurements of room acoustics, sound environment during occupancy, and occupant surveys (n = 349) in 28 offices representing a diverse range of workplace parameters. As latent factors, the contribution of $\textit{lack of privacy}$ (LackPriv) was 25% higher than $\textit{noise disturbance}$ (NseDstrb) in predicting $\textit{acoustic dissatisfaction}$ (AcDsat). Room acoustic metrics based on sound pressure level (SPL) decay of speech ($L_{\text{p,A,s,4m}}$ and $r_{\text{C}}$) were better in predicting these factors than distraction distance ($r_{\text{D}}$) based on speech transmission index. This contradicts previous findings, and the trends for SPL-based metrics in predicting AcDsat and LackPriv go against expectations based on ISO 3382-3. For sound during occupation, $L_{\text{A,90}}$ and psychoacoustic loudness ($N_{\text{90}}$) predicted AcDsat, and a SPL fluctuation metric ($M_{\text{A,eq}}$) predicted LackPriv. However, these metrics were weaker predictors than ISO 3382-3 metrics. Medium-sized offices exhibited higher dissatisfaction than larger ($\geq$50 occupants) offices. Dissatisfaction varied substantially across parameters including ceiling heights, number of workstations, and years of work, but not between offices with fixed seating compared to more flexible and activity-based working configurations. Overall, these findings highlight the complexities in characterizing occupants' perceptions using instrumental acoustic measurements.
△ Less
Submitted 3 May, 2025; v1 submitted 26 January, 2025;
originally announced January 2025.
-
Noise-Agnostic Multitask Whisper Training for Reducing False Alarm Errors in Call-for-Help Detection
Authors:
Myeonghoon Ryu,
June-Woo Kim,
Minseok Oh,
Suji Lee,
Han Park
Abstract:
Keyword spotting is often implemented by keyword classifier to the encoder in acoustic models, enabling the classification of predefined or open vocabulary keywords. Although keyword spotting is a crucial task in various applications and can be extended to call-for-help detection in emergencies, however, the previous method often suffers from scalability limitations due to retraining required to i…
▽ More
Keyword spotting is often implemented by keyword classifier to the encoder in acoustic models, enabling the classification of predefined or open vocabulary keywords. Although keyword spotting is a crucial task in various applications and can be extended to call-for-help detection in emergencies, however, the previous method often suffers from scalability limitations due to retraining required to introduce new keywords or adapt to changing contexts. We explore a simple yet effective approach that leverages off-the-shelf pretrained ASR models to address these challenges, especially in call-for-help detection scenarios. Furthermore, we observed a substantial increase in false alarms when deploying call-for-help detection system in real-world scenarios due to noise introduced by microphones or different environments. To address this, we propose a novel noise-agnostic multitask learning approach that integrates a noise classification head into the ASR encoder. Our method enhances the model's robustness to noisy environments, leading to a significant reduction in false alarms and improved overall call-for-help performance. Despite the added complexity of multitask learning, our approach is computationally efficient and provides a promising solution for call-for-help detection in real-world scenarios.
△ Less
Submitted 20 January, 2025;
originally announced January 2025.
-
DLinear-based Prediction of Remaining Useful Life of Lithium-Ion Batteries: Feature Engineering through Explainable Artificial Intelligence
Authors:
Minsu Kim,
Jaehyun Oh,
Sang-Young Lee,
Junghwan Kim
Abstract:
Accurate prediction of the Remaining Useful Life (RUL) of lithium-ion batteries is essential for ensuring safety, reducing maintenance costs, and optimizing usage. However, predicting RUL is challenging due to the nonlinear characteristics of the degradation caused by complex chemical reactions. Machine learning allows precise predictions by learning the latent functions of degradation relationshi…
▽ More
Accurate prediction of the Remaining Useful Life (RUL) of lithium-ion batteries is essential for ensuring safety, reducing maintenance costs, and optimizing usage. However, predicting RUL is challenging due to the nonlinear characteristics of the degradation caused by complex chemical reactions. Machine learning allows precise predictions by learning the latent functions of degradation relationships based on cycling behavior. This study introduces an accurate RUL prediction approach based on feature engineering and DLinear, applied to the dataset from NASA's Prognostics Center of Excellence. Among the 20 features generated from current, voltage, temperature, and time provided in this dataset, key features contributing to degradation are selected using Pearson correlation coefficient and Shapley values. Shapley value-based feature selection effectively reflects cell-to-cell variability, showing similar importance rankings across all cells. The DLinear-based RUL prediction using key features efficiently captures the time-series trend, demonstrating significantly better performance compared to Long Short-Term Memory and Transformer models.
△ Less
Submitted 20 January, 2025;
originally announced January 2025.
-
Single-Channel Distance-Based Source Separation for Mobile GPU in Outdoor and Indoor Environments
Authors:
Hanbin Bae,
Byungjun Kang,
Jiwon Kim,
Jaeyong Hwang,
Hosang Sung,
Hoon-Young Cho
Abstract:
This study emphasizes the significance of exploring distance-based source separation (DSS) in outdoor environments. Unlike existing studies that primarily focus on indoor settings, the proposed model is designed to capture the unique characteristics of outdoor audio sources. It incorporates advanced techniques, including a two-stage conformer block, a linear relation-aware self-attention (RSA), an…
▽ More
This study emphasizes the significance of exploring distance-based source separation (DSS) in outdoor environments. Unlike existing studies that primarily focus on indoor settings, the proposed model is designed to capture the unique characteristics of outdoor audio sources. It incorporates advanced techniques, including a two-stage conformer block, a linear relation-aware self-attention (RSA), and a TensorFlow Lite GPU delegate. While the linear RSA may not capture physical cues as explicitly as the quadratic RSA, the linear RSA enhances the model's context awareness, leading to improved performance on the DSS that requires an understanding of physical cues in outdoor and indoor environments. The experimental results demonstrated that the proposed model overcomes the limitations of existing approaches and considerably enhances energy efficiency and real-time inference speed on mobile devices.
△ Less
Submitted 6 January, 2025;
originally announced January 2025.
-
SNeRV: Spectra-preserving Neural Representation for Video
Authors:
Jina Kim,
Jihoo Lee,
Je-Won Kang
Abstract:
Neural representation for video (NeRV), which employs a neural network to parameterize video signals, introduces a novel methodology in video representations. However, existing NeRV-based methods have difficulty in capturing fine spatial details and motion patterns due to spectral bias, in which a neural network learns high-frequency (HF) components at a slower rate than low-frequency (LF) compone…
▽ More
Neural representation for video (NeRV), which employs a neural network to parameterize video signals, introduces a novel methodology in video representations. However, existing NeRV-based methods have difficulty in capturing fine spatial details and motion patterns due to spectral bias, in which a neural network learns high-frequency (HF) components at a slower rate than low-frequency (LF) components. In this paper, we propose spectra-preserving NeRV (SNeRV) as a novel approach to enhance implicit video representations by efficiently handling various frequency components. SNeRV uses 2D discrete wavelet transform (DWT) to decompose video into LF and HF features, preserving spatial structures and directly addressing the spectral bias issue. To balance the compactness, we encode only the LF components, while HF components that include fine textures are generated by a decoder. Specialized modules, including a multi-resolution fusion unit (MFU) and a high-frequency restorer (HFR), are integrated into a backbone to facilitate the representation. Furthermore, we extend SNeRV to effectively capture temporal correlations between adjacent video frames, by casting the extension as additional frequency decomposition to a temporal domain. This approach allows us to embed spatio-temporal LF features into the network, using temporally extended up-sampling blocks (TUBs). Experimental results demonstrate that SNeRV outperforms existing NeRV models in capturing fine details and achieves enhanced reconstruction, making it a promising approach in the field of implicit video representations. The codes are available at https://github.com/qwertja/SNeRV.
△ Less
Submitted 3 January, 2025;
originally announced January 2025.
-
AdaptVC: High Quality Voice Conversion with Adaptive Learning
Authors:
Jaehun Kim,
Ji-Hoon Kim,
Yeunju Choi,
Tan Dat Nguyen,
Seongkyu Mun,
Joon Son Chung
Abstract:
The goal of voice conversion is to transform the speech of a source speaker to sound like that of a reference speaker while preserving the original content. A key challenge is to extract disentangled linguistic content from the source and voice style from the reference. While existing approaches leverage various methods to isolate the two, a generalization still requires further attention, especia…
▽ More
The goal of voice conversion is to transform the speech of a source speaker to sound like that of a reference speaker while preserving the original content. A key challenge is to extract disentangled linguistic content from the source and voice style from the reference. While existing approaches leverage various methods to isolate the two, a generalization still requires further attention, especially for robustness in zero-shot scenarios. In this paper, we achieve successful disentanglement of content and speaker features by tuning self-supervised speech features with adapters. The adapters are trained to dynamically encode nuanced features from rich self-supervised features, and the decoder fuses them to produce speech that accurately resembles the reference with minimal loss of content. Moreover, we leverage a conditional flow matching decoder with cross-attention speaker conditioning to further boost the synthesis quality and efficiency. Subjective and objective evaluations in a zero-shot scenario demonstrate that the proposed method outperforms existing models in speech quality and similarity to the reference speech.
△ Less
Submitted 14 January, 2025; v1 submitted 2 January, 2025;
originally announced January 2025.
-
Smooth Reference Command Generation and Control for Transition Flight of VTOL Aircraft Using Time-Varying Optimization
Authors:
Jinrae Kim,
John L. Bullock,
Sheng Cheng,
Naira Hovakimyan
Abstract:
Vertical take-off and landing (VTOL) aircraft pose a challenge in generating reference commands during transition flight. While sparsity between hover and cruise flight modes can be promoted for effective transitions by formulating $\ell_{1}$-norm minimization problems, solving these problems offline pointwise in time can lead to non-smooth reference commands, resulting in abrupt transitions. This…
▽ More
Vertical take-off and landing (VTOL) aircraft pose a challenge in generating reference commands during transition flight. While sparsity between hover and cruise flight modes can be promoted for effective transitions by formulating $\ell_{1}$-norm minimization problems, solving these problems offline pointwise in time can lead to non-smooth reference commands, resulting in abrupt transitions. This study addresses this limitation by proposing a time-varying optimization method that explicitly considers time dependence. By leveraging a prediction-correction interior-point time-varying optimization framework, the proposed method solves an ordinary differential equation to update reference commands continuously over time, enabling smooth reference command generation in real time. Numerical simulations with a two-dimensional Lift+Cruise vehicle validate the effectiveness of the proposed method, demonstrating its ability to generate smooth reference commands online.
△ Less
Submitted 1 January, 2025;
originally announced January 2025.
-
CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation
Authors:
Ji-Hoon Kim,
Hong-Sun Yang,
Yoon-Cheol Ju,
Il-Hwan Kim,
Byeong-Yeol Kim,
Joon Son Chung
Abstract:
The goal of this work is to generate natural speech in multiple languages while maintaining the same speaker identity, a task known as cross-lingual speech synthesis. A key challenge of cross-lingual speech synthesis is the language-speaker entanglement problem, which causes the quality of cross-lingual systems to lag behind that of intra-lingual systems. In this paper, we propose CrossSpeech++, w…
▽ More
The goal of this work is to generate natural speech in multiple languages while maintaining the same speaker identity, a task known as cross-lingual speech synthesis. A key challenge of cross-lingual speech synthesis is the language-speaker entanglement problem, which causes the quality of cross-lingual systems to lag behind that of intra-lingual systems. In this paper, we propose CrossSpeech++, which effectively disentangles language and speaker information and significantly improves the quality of cross-lingual speech synthesis. To this end, we break the complex speech generation pipeline into two simple components: language-dependent and speaker-dependent generators. The language-dependent generator produces linguistic variations that are not biased by specific speaker attributes. The speaker-dependent generator models acoustic variations that characterize speaker identity. By handling each type of information in separate modules, our method can effectively disentangle language and speaker representation. We conduct extensive experiments using various metrics, and demonstrate that CrossSpeech++ achieves significant improvements in cross-lingual speech synthesis, outperforming existing methods by a large margin.
△ Less
Submitted 28 December, 2024;
originally announced December 2024.
-
Advancing Deformable Medical Image Registration with Multi-axis Cross-covariance Attention
Authors:
Mingyuan Meng,
Michael Fulham,
Lei Bi,
Jinman Kim
Abstract:
Deformable image registration is a fundamental requirement for medical image analysis. Recently, transformers have been widely used in deep learning-based registration methods for their ability to capture long-range dependency via self-attention (SA). However, the high computation and memory loads of SA (growing quadratically with the spatial resolution) hinder transformers from processing subtle…
▽ More
Deformable image registration is a fundamental requirement for medical image analysis. Recently, transformers have been widely used in deep learning-based registration methods for their ability to capture long-range dependency via self-attention (SA). However, the high computation and memory loads of SA (growing quadratically with the spatial resolution) hinder transformers from processing subtle textural information in high-resolution image features, e.g., at the full and half image resolutions. This limits deformable registration as the high-resolution textural information is crucial for finding precise pixel-wise correspondence between subtle anatomical structures. Cross-covariance Attention (XCA), as a "transposed" version of SA that operates across feature channels, has complexity growing linearly with the spatial resolution, providing the feasibility of capturing long-range dependency among high-resolution image features. However, existing XCA-based transformers merely capture coarse global long-range dependency, which are unsuitable for deformable image registration relying primarily on fine-grained local correspondence. In this study, we propose to improve existing deep learning-based registration methods by embedding a new XCA mechanism. To this end, we design an XCA-based transformer block optimized for deformable medical image registration, named Multi-Axis XCA (MAXCA). Our MAXCA serves as a general network block that can be embedded into various registration network architectures. It can capture both global and local long-range dependency among high-resolution image features by applying regional and dilated XCA in parallel via a multi-axis design. Extensive experiments on two well-benchmarked inter-/intra-patient registration tasks with seven public medical datasets demonstrate that our MAXCA block enables state-of-the-art registration performance.
△ Less
Submitted 24 December, 2024;
originally announced December 2024.
-
Automatic Left Ventricular Cavity Segmentation via Deep Spatial Sequential Network in 4D Computed Tomography Studies
Authors:
Yuyu Guo,
Lei Bi,
Zhengbin Zhu,
David Dagan Feng,
Ruiyan Zhang,
Qian Wang,
Jinman Kim
Abstract:
Automated segmentation of left ventricular cavity (LVC) in temporal cardiac image sequences (multiple time points) is a fundamental requirement for quantitative analysis of its structural and functional changes. Deep learning based methods for the segmentation of LVC are the state of the art; however, these methods are generally formulated to work on single time points, and fails to exploit the co…
▽ More
Automated segmentation of left ventricular cavity (LVC) in temporal cardiac image sequences (multiple time points) is a fundamental requirement for quantitative analysis of its structural and functional changes. Deep learning based methods for the segmentation of LVC are the state of the art; however, these methods are generally formulated to work on single time points, and fails to exploit the complementary information from the temporal image sequences that can aid in segmentation accuracy and consistency among the images across the time points. Furthermore, these segmentation methods perform poorly in segmenting the end-systole (ES) phase images, where the left ventricle deforms to the smallest irregular shape, and the boundary between the blood chamber and myocardium becomes inconspicuous. To overcome these limitations, we propose a new method to automatically segment temporal cardiac images where we introduce a spatial sequential (SS) network to learn the deformation and motion characteristics of the LVC in an unsupervised manner; these characteristics were then integrated with sequential context information derived from bi-directional learning (BL) where both chronological and reverse-chronological directions of the image sequence were used. Our experimental results on a cardiac computed tomography (CT) dataset demonstrated that our spatial-sequential network with bi-directional learning (SS-BL) method outperformed existing methods for LVC segmentation. Our method was also applied to MRI cardiac dataset and the results demonstrated the generalizability of our method.
△ Less
Submitted 17 December, 2024;
originally announced December 2024.
-
Uncertainty-Aware Hybrid Inference with On-Device Small and Remote Large Language Models
Authors:
Seungeun Oh,
Jinhyuk Kim,
Jihong Park,
Seung-Woo Ko,
Tony Q. S. Quek,
Seong-Lyun Kim
Abstract:
This paper studies a hybrid language model (HLM) architecture that integrates a small language model (SLM) operating on a mobile device with a large language model (LLM) hosted at the base station (BS) of a wireless network. The HLM token generation process follows the speculative inference principle: the SLM's vocabulary distribution is uploaded to the LLM, which either accepts or rejects it, wit…
▽ More
This paper studies a hybrid language model (HLM) architecture that integrates a small language model (SLM) operating on a mobile device with a large language model (LLM) hosted at the base station (BS) of a wireless network. The HLM token generation process follows the speculative inference principle: the SLM's vocabulary distribution is uploaded to the LLM, which either accepts or rejects it, with rejected tokens being resampled by the LLM. While this approach ensures alignment between the vocabulary distributions of the SLM and LLM, it suffers from low token throughput due to uplink transmission and the computation costs of running both language models. To address this, we propose a novel HLM structure coined Uncertainty-aware opportunistic HLM (U-HLM), wherein the SLM locally measures its output uncertainty and skips both uplink transmissions and LLM operations for tokens that are likely to be accepted. This opportunistic skipping is enabled by our empirical finding of a linear correlation between the SLM's uncertainty and the LLM's rejection probability. We analytically derive the uncertainty threshold and evaluate its expected risk of rejection. Simulations show that U-HLM reduces uplink transmissions and LLM computations by 45.93%, while achieving up to 97.54% of the LLM's inference accuracy and 2.54$\times$ faster token throughput than HLM without skipping.
△ Less
Submitted 18 March, 2025; v1 submitted 17 December, 2024;
originally announced December 2024.
-
Imagined Speech State Classification for Robust Brain-Computer Interface
Authors:
Byung-Kwan Ko,
Jun-Young Kim,
Seo-Hyun Lee
Abstract:
This study examines the effectiveness of traditional machine learning classifiers versus deep learning models for detecting the imagined speech using electroencephalogram data. Specifically, we evaluated conventional machine learning techniques such as CSP-SVM and LDA-SVM classifiers alongside deep learning architectures such as EEGNet, ShallowConvNet, and DeepConvNet. Machine learning classifiers…
▽ More
This study examines the effectiveness of traditional machine learning classifiers versus deep learning models for detecting the imagined speech using electroencephalogram data. Specifically, we evaluated conventional machine learning techniques such as CSP-SVM and LDA-SVM classifiers alongside deep learning architectures such as EEGNet, ShallowConvNet, and DeepConvNet. Machine learning classifiers exhibited significantly lower precision and recall, indicating limited feature extraction capabilities and poor generalization between imagined speech and idle states. In contrast, deep learning models, particularly EEGNet, achieved the highest accuracy of 0.7080 and an F1 score of 0.6718, demonstrating their enhanced ability in automatic feature extraction and representation learning, essential for capturing complex neurophysiological patterns. These findings highlight the limitations of conventional machine learning approaches in brain-computer interface (BCI) applications and advocate for adopting deep learning methodologies to achieve more precise and reliable classification of detecting imagined speech. This foundational research contributes to the development of imagined speech-based BCI systems.
△ Less
Submitted 15 December, 2024;
originally announced December 2024.
-
Improving Automatic Fetal Biometry Measurement with Swoosh Activation Function
Authors:
Shijia Zhou,
Euijoon Ahn,
Hao Wang,
Ann Quinton,
Narelle Kennedy,
Pradeeba Sridar,
Ralph Nanan,
Jinman Kim
Abstract:
The measurement of fetal thalamus diameter (FTD) and fetal head circumference (FHC) are crucial in identifying abnormal fetal thalamus development as it may lead to certain neuropsychiatric disorders in later life. However, manual measurements from 2D-US images are laborious, prone to high inter-observer variability, and complicated by the high signal-to-noise ratio nature of the images. Deep lear…
▽ More
The measurement of fetal thalamus diameter (FTD) and fetal head circumference (FHC) are crucial in identifying abnormal fetal thalamus development as it may lead to certain neuropsychiatric disorders in later life. However, manual measurements from 2D-US images are laborious, prone to high inter-observer variability, and complicated by the high signal-to-noise ratio nature of the images. Deep learning-based landmark detection approaches have shown promise in measuring biometrics from US images, but the current state-of-the-art (SOTA) algorithm, BiometryNet, is inadequate for FTD and FHC measurement due to its inability to account for the fuzzy edges of these structures and the complex shape of the FTD structure. To address these inadequacies, we propose a novel Swoosh Activation Function (SAF) designed to enhance the regularization of heatmaps produced by landmark detection algorithms. Our SAF serves as a regularization term to enforce an optimum mean squared error (MSE) level between predicted heatmaps, reducing the dispersiveness of hotspots in predicted heatmaps. Our experimental results demonstrate that SAF significantly improves the measurement performances of FTD and FHC with higher intraclass correlation coefficient scores in FTD and lower mean difference scores in FHC measurement than those of the current SOTA algorithm BiometryNet. Moreover, our proposed SAF is highly generalizable and architecture-agnostic. The SAF's coefficients can be configured for different tasks, making it highly customizable. Our study demonstrates that the SAF activation function is a novel method that can improve measurement accuracy in fetal biometry landmark detection. This improvement has the potential to contribute to better fetal monitoring and improved neonatal outcomes.
△ Less
Submitted 15 December, 2024;
originally announced December 2024.
-
Inertia-aware Unit Commitment and Remuneration Methods for Decarbonized Power System
Authors:
HyunJoong Kim,
Jip Kim
Abstract:
To maintain frequency stability in decarbonized power systems, inertia services from synchronous generators (SGs) and inverter-based resources must be procured. However, designing an inertia-aware system operation poses significant challenges in considering the variability and uncertainty of renewable energy sources (RES) and adopting a remuneration method for inertia provision due to SG commitmen…
▽ More
To maintain frequency stability in decarbonized power systems, inertia services from synchronous generators (SGs) and inverter-based resources must be procured. However, designing an inertia-aware system operation poses significant challenges in considering the variability and uncertainty of renewable energy sources (RES) and adopting a remuneration method for inertia provision due to SG commitment variables. To address this research gap, we renovate the inertia-aware chance constraints unit commitment model by incorporating time-coupling constraints for SGs and joint chance constraints for RES uncertainty. We investigate remuneration methods for inertia provision, including uplift, marginal pricing (MP), approximated convex hull pricing (aCHP), and average incremental cost pricing (AIP), applying these to the renovated model. Numerical experiments show that the model enhances frequency stability during a contingency. Among the remuneration methods, only aCHP guarantees revenue adequacy without requiring uplift while maximizing economic welfare. However, the MP requires the highest level of uplift to adequately compensate generation costs, as the price function fails to account for inertia provision.
△ Less
Submitted 14 December, 2024;
originally announced December 2024.
-
Space-time inverse-scattering of translation-based motion
Authors:
Jeongsoo Kim,
Shwetadwip Chowdhury
Abstract:
In optical diffraction tomography (ODT), a sample's 3D refractive-index (RI) is often reconstructed after illuminating it from multiple angles, with the assumption that the sample remains static throughout data collection. When the sample undergoes dynamic motion during this data-collection process, significant artifacts and distortions compromise the fidelity of the reconstructed images. In this…
▽ More
In optical diffraction tomography (ODT), a sample's 3D refractive-index (RI) is often reconstructed after illuminating it from multiple angles, with the assumption that the sample remains static throughout data collection. When the sample undergoes dynamic motion during this data-collection process, significant artifacts and distortions compromise the fidelity of the reconstructed images. In this study, we develop a space-time inverse-scattering technique for ODT that compensates for the translational motion of multiple-scattering samples during data collection. Our approach involves formulating a joint optimization problem to simultaneously estimate a scattering sample's translational position at each measurement and its motion-corrected 3D RI distribution. Experimental results demonstrate the technique's effectiveness, yielding reconstructions with reduced artifacts, enhanced spatial resolution, and improved quantitative accuracy for samples undergoing continuous translational motion during imaging.
△ Less
Submitted 12 December, 2024;
originally announced December 2024.
-
evS2CP: Real-time Simultaneous Speed and Charging Planner for Connected Electric Vehicles
Authors:
Minwoo Gwon,
Jiwon Kim,
Seungjun Yoo,
Kwang-Ki K. Kim
Abstract:
This paper presents evS2CP, an optimization-based framework for simultaneous speed and charging planning designed for connected electric vehicles (EVs). With EVs emerging as competitive alternatives to internal combustion engine vehicles, overcoming challenges such as limited charging infrastructure is crucial. evS2CP addresses these issues by minimizing the travel time, charging time, and energy…
▽ More
This paper presents evS2CP, an optimization-based framework for simultaneous speed and charging planning designed for connected electric vehicles (EVs). With EVs emerging as competitive alternatives to internal combustion engine vehicles, overcoming challenges such as limited charging infrastructure is crucial. evS2CP addresses these issues by minimizing the travel time, charging time, and energy consumption, providing practical solutions for both human-operated and autonomous vehicles. This framework leverages V2X communication to integrate essential EV planning data, including route geometry, real-time traffic conditions, and charging station availability, while simulating dynamic driving environments using open-web API services. The speed and charging planning problem was initially formulated as a nonlinear programming model, which was then convexified into a quadratic programming model without charging-stop constraints. Additionally, a mixed-integer programming approach was employed to optimize charging station selection and minimize the frequency of charging events. A mixed-integer quadratic programming implementation exhibited exceptional computational efficiency and scalability, effectively solving trip plans over distances exceeding 700 km in a few seconds. Simulations conducted using open-source and commercial solvers validated the framework's near-global optimality, demonstrating its robustness and feasibility for real-world applications in connected EV ecosystems.
△ Less
Submitted 12 December, 2024;
originally announced December 2024.
-
Fundus Image-based Visual Acuity Assessment with PAC-Guarantees
Authors:
Sooyong Jang,
Kuk Jin Jang,
Hyonyoung Choi,
Yong-Seop Han,
Seongjin Lee,
Jin-hyun Kim,
Insup Lee
Abstract:
Timely detection and treatment are essential for maintaining eye health. Visual acuity (VA), which measures the clarity of vision at a distance, is a crucial metric for managing eye health. Machine learning (ML) techniques have been introduced to assist in VA measurement, potentially alleviating clinicians' workloads. However, the inherent uncertainties in ML models make relying solely on them for…
▽ More
Timely detection and treatment are essential for maintaining eye health. Visual acuity (VA), which measures the clarity of vision at a distance, is a crucial metric for managing eye health. Machine learning (ML) techniques have been introduced to assist in VA measurement, potentially alleviating clinicians' workloads. However, the inherent uncertainties in ML models make relying solely on them for VA prediction less than ideal. The VA prediction task involves multiple sources of uncertainty, requiring more robust approaches. A promising method is to build prediction sets or intervals rather than point estimates, offering coverage guarantees through techniques like conformal prediction and Probably Approximately Correct (PAC) prediction sets. Despite the potential, to date, these approaches have not been applied to the VA prediction task.To address this, we propose a method for deriving prediction intervals for estimating visual acuity from fundus images with a PAC guarantee. Our experimental results demonstrate that the PAC guarantees are upheld, with performance comparable to or better than that of two prior works that do not provide such guarantees.
△ Less
Submitted 9 December, 2024;
originally announced December 2024.
-
V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow
Authors:
Jeongsoo Choi,
Ji-Hoon Kim,
Jinyu Li,
Joon Son Chung,
Shujie Liu
Abstract:
In this paper, we introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to generate natural and intelligible speech directly from silent talking face videos. While recent V2S systems have shown promising results on constrained datasets with limited speakers and vocabularies, their performance often degrades on real-world, unconstrained datasets due to the inherent variability and com…
▽ More
In this paper, we introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to generate natural and intelligible speech directly from silent talking face videos. While recent V2S systems have shown promising results on constrained datasets with limited speakers and vocabularies, their performance often degrades on real-world, unconstrained datasets due to the inherent variability and complexity of speech signals. To address these challenges, we decompose the speech signal into manageable subspaces (content, pitch, and speaker information), each representing distinct speech attributes, and predict them directly from the visual input. To generate coherent and realistic speech from these predicted attributes, we employ a rectified flow matching decoder built on a Transformer architecture, which models efficient probabilistic pathways from random noise to the target speech distribution. Extensive experiments demonstrate that V2SFlow significantly outperforms state-of-the-art methods, even surpassing the naturalness of ground truth utterances.
△ Less
Submitted 29 November, 2024;
originally announced November 2024.
-
DMVC-Tracker: Distributed Multi-Agent Trajectory Planning for Target Tracking Using Dynamic Buffered Voronoi and Inter-Visibility Cells
Authors:
Yunwoo Lee,
Jungwon Park,
H. Jin Kim
Abstract:
This letter presents a distributed trajectory planning method for multi-agent aerial tracking. The proposed method uses a Dynamic Buffered Voronoi Cell (DBVC) and a Dynamic Inter-Visibility Cell (DIVC) to formulate the distributed trajectory generation. Specifically, the DBVC and the DIVC are time-variant spaces that prevent mutual collisions and occlusions among agents, while enabling them to mai…
▽ More
This letter presents a distributed trajectory planning method for multi-agent aerial tracking. The proposed method uses a Dynamic Buffered Voronoi Cell (DBVC) and a Dynamic Inter-Visibility Cell (DIVC) to formulate the distributed trajectory generation. Specifically, the DBVC and the DIVC are time-variant spaces that prevent mutual collisions and occlusions among agents, while enabling them to maintain suitable distances from the moving target. We combine the DBVC and the DIVC with an efficient Bernstein polynomial motion primitive-based tracking generation method, which has been refined into a less conservative approach than in our previous work. The proposed algorithm can compute each agent's trajectory within several milliseconds on an Intel i7 desktop. We validate the tracking performance in challenging scenarios, including environments with dozens of obstacles.
△ Less
Submitted 5 March, 2025; v1 submitted 27 November, 2024;
originally announced November 2024.
-
New Test-Time Scenario for Biosignal: Concept and Its Approach
Authors:
Yong-Yeon Jo,
Byeong Tak Lee,
Beom Joon Kim,
Jeong-Ho Hong,
Hak Seung Lee,
Joon-myoung Kwon
Abstract:
Online Test-Time Adaptation (OTTA) enhances model robustness by updating pre-trained models with unlabeled data during testing. In healthcare, OTTA is vital for real-time tasks like predicting blood pressure from biosignals, which demand continuous adaptation. We introduce a new test-time scenario with streams of unlabeled samples and occasional labeled samples. Our framework combines supervised a…
▽ More
Online Test-Time Adaptation (OTTA) enhances model robustness by updating pre-trained models with unlabeled data during testing. In healthcare, OTTA is vital for real-time tasks like predicting blood pressure from biosignals, which demand continuous adaptation. We introduce a new test-time scenario with streams of unlabeled samples and occasional labeled samples. Our framework combines supervised and self-supervised learning, employing a dual-queue buffer and weighted batch sampling to balance data types. Experiments show improved accuracy and adaptability under real-world conditions.
△ Less
Submitted 26 November, 2024;
originally announced November 2024.
-
Minimizing Conservatism in Safety-Critical Control for Input-Delayed Systems via Adaptive Delay Estimation
Authors:
Yitaek Kim,
Ersin Das,
Jeeseop Kim,
Aaron D. Ames,
Joel W. Burdick,
Christoffer Sloth
Abstract:
Input delays affect systems such as teleoperation and wirelessly autonomous connected vehicles, and may lead to safety violations. One promising way to ensure safety in the presence of delay is to employ control barrier functions (CBFs), and extensions thereof that account for uncertainty: delay adaptive CBFs (DaCBFs). This paper proposes an online adaptive safety control framework for reducing th…
▽ More
Input delays affect systems such as teleoperation and wirelessly autonomous connected vehicles, and may lead to safety violations. One promising way to ensure safety in the presence of delay is to employ control barrier functions (CBFs), and extensions thereof that account for uncertainty: delay adaptive CBFs (DaCBFs). This paper proposes an online adaptive safety control framework for reducing the conservatism of DaCBFs. The main idea is to reduce the maximum delay estimation error bound so that the state prediction error bound is monotonically non-increasing. To this end, we first leverage the estimation error bound of a disturbance observer to bound the state prediction error. Second, we design two nonlinear programs to update the maximum delay estimation error bound satisfying the prediction error bound, and subsequently update the maximum state prediction error bound used in DaCBFs. The proposed method ensures the maximum state prediction error bound is monotonically non-increasing, yielding less conservatism in DaCBFs. We verify the proposed method in an automated connected truck application, showing that the proposed method reduces the conservatism of DaCBFs.
△ Less
Submitted 26 November, 2024;
originally announced November 2024.