-
IMPACT: Industrial Machine Perception via Acoustic Cognitive Transformer
Authors:
Changheon Han,
Yuseop Sim,
Hoin Jung,
Jiho Lee,
Hojun Lee,
Yun Seok Kang,
Sucheol Woo,
Garam Kim,
Hyung Wook Park,
Martin Byung-Guk Jun
Abstract:
Acoustic signals from industrial machines offer valuable insights for anomaly detection, predictive maintenance, and operational efficiency enhancement. However, existing task-specific, supervised learning methods often scale poorly and fail to generalize across diverse industrial scenarios, whose acoustic characteristics are distinct from general audio. Furthermore, the scarcity of accessible, la…
▽ More
Acoustic signals from industrial machines offer valuable insights for anomaly detection, predictive maintenance, and operational efficiency enhancement. However, existing task-specific, supervised learning methods often scale poorly and fail to generalize across diverse industrial scenarios, whose acoustic characteristics are distinct from general audio. Furthermore, the scarcity of accessible, large-scale datasets and pretrained models tailored for industrial audio impedes community-driven research and benchmarking. To address these challenges, we introduce DINOS (Diverse INdustrial Operation Sounds), a large-scale open-access dataset. DINOS comprises over 74,149 audio samples (exceeding 1,093 hours) collected from various industrial acoustic scenarios. We also present IMPACT (Industrial Machine Perception via Acoustic Cognitive Transformer), a novel foundation model for industrial machine sound analysis. IMPACT is pretrained on DINOS in a self-supervised manner. By jointly optimizing utterance and frame-level losses, it captures both global semantics and fine-grained temporal structures. This makes its representations suitable for efficient fine-tuning on various industrial downstream tasks with minimal labeled data. Comprehensive benchmarking across 30 distinct downstream tasks (spanning four machine types) demonstrates that IMPACT outperforms existing models on 24 tasks, establishing its superior effectiveness and robustness, while providing a new performance benchmark for future research.
△ Less
Submitted 8 July, 2025;
originally announced July 2025.
-
What's Making That Sound Right Now? Video-centric Audio-Visual Localization
Authors:
Hahyeon Choi,
Junhoo Lee,
Nojun Kwak
Abstract:
Audio-Visual Localization (AVL) aims to identify sound-emitting sources within a visual scene. However, existing studies focus on image-level audio-visual associations, failing to capture temporal dynamics. Moreover, they assume simplified scenarios where sound sources are always visible and involve only a single object. To address these limitations, we propose AVATAR, a video-centric AVL benchmar…
▽ More
Audio-Visual Localization (AVL) aims to identify sound-emitting sources within a visual scene. However, existing studies focus on image-level audio-visual associations, failing to capture temporal dynamics. Moreover, they assume simplified scenarios where sound sources are always visible and involve only a single object. To address these limitations, we propose AVATAR, a video-centric AVL benchmark that incorporates high-resolution temporal information. AVATAR introduces four distinct scenarios -- Single-sound, Mixed-sound, Multi-entity, and Off-screen -- enabling a more comprehensive evaluation of AVL models. Additionally, we present TAVLO, a novel video-centric AVL model that explicitly integrates temporal information. Experimental results show that conventional methods struggle to track temporal variations due to their reliance on global audio features and frame-level mappings. In contrast, TAVLO achieves robust and precise audio-visual alignment by leveraging high-resolution temporal modeling. Our work empirically demonstrates the importance of temporal dynamics in AVL and establishes a new standard for video-centric audio-visual localization.
△ Less
Submitted 8 July, 2025; v1 submitted 7 July, 2025;
originally announced July 2025.
-
Learning Humanoid Arm Motion via Centroidal Momentum Regularized Multi-Agent Reinforcement Learning
Authors:
Ho Jae Lee,
Se Hwan Jeon,
Sangbae Kim
Abstract:
Humans naturally swing their arms during locomotion to regulate whole-body dynamics, reduce angular momentum, and help maintain balance. Inspired by this principle, we present a limb-level multi-agent reinforcement learning (RL) framework that enables coordinated whole-body control of humanoid robots through emergent arm motion. Our approach employs separate actor-critic structures for the arms an…
▽ More
Humans naturally swing their arms during locomotion to regulate whole-body dynamics, reduce angular momentum, and help maintain balance. Inspired by this principle, we present a limb-level multi-agent reinforcement learning (RL) framework that enables coordinated whole-body control of humanoid robots through emergent arm motion. Our approach employs separate actor-critic structures for the arms and legs, trained with centralized critics but decentralized actors that share only base states and centroidal angular momentum (CAM) observations, allowing each agent to specialize in task-relevant behaviors through modular reward design. The arm agent guided by CAM tracking and damping rewards promotes arm motions that reduce overall angular momentum and vertical ground reaction moments, contributing to improved balance during locomotion or under external perturbations. Comparative studies with single-agent and alternative multi-agent baselines further validate the effectiveness of our approach. Finally, we deploy the learned policy on a humanoid platform, achieving robust performance across diverse locomotion tasks, including flat-ground walking, rough terrain traversal, and stair climbing.
△ Less
Submitted 5 July, 2025;
originally announced July 2025.
-
EdgeSRIE: A hybrid deep learning framework for real-time speckle reduction and image enhancement on portable ultrasound systems
Authors:
Hyunwoo Cho,
Jongsoo Lee,
Jinbum Kang,
Yangmo Yoo
Abstract:
Speckle patterns in ultrasound images often obscure anatomical details, leading to diagnostic uncertainty. Recently, various deep learning (DL)-based techniques have been introduced to effectively suppress speckle; however, their high computational costs pose challenges for low-resource devices, such as portable ultrasound systems. To address this issue, EdgeSRIE, which is a lightweight hybrid DL…
▽ More
Speckle patterns in ultrasound images often obscure anatomical details, leading to diagnostic uncertainty. Recently, various deep learning (DL)-based techniques have been introduced to effectively suppress speckle; however, their high computational costs pose challenges for low-resource devices, such as portable ultrasound systems. To address this issue, EdgeSRIE, which is a lightweight hybrid DL framework for real-time speckle reduction and image enhancement in portable ultrasound imaging, is introduced. The proposed framework consists of two main branches: an unsupervised despeckling branch, which is trained by minimizing a loss function between speckled images, and a deblurring branch, which restores blurred images to sharp images. For hardware implementation, the trained network is quantized to 8-bit integer precision and deployed on a low-resource system-on-chip (SoC) with limited power consumption. In the performance evaluation with phantom and in vivo analyses, EdgeSRIE achieved the highest contrast-to-noise ratio (CNR) and average gradient magnitude (AGM) compared with the other baselines (different 2-rule-based methods and other 4-DL-based methods). Furthermore, EdgeSRIE enabled real-time inference at over 60 frames per second while satisfying computational requirements (< 20K parameters) on actual portable ultrasound hardware. These results demonstrated the feasibility of EdgeSRIE for real-time, high-quality ultrasound imaging in resource-limited environments.
△ Less
Submitted 5 July, 2025;
originally announced July 2025.
-
On the Relationship between Accent Strength and Articulatory Features
Authors:
Kevin Huang,
Sean Foley,
Jihwan Lee,
Yoonjeong Lee,
Dani Byrd,
Shrikanth Narayanan
Abstract:
This paper explores the relationship between accent strength and articulatory features inferred from acoustic speech. To quantify accent strength, we compare phonetic transcriptions with transcriptions based on dictionary-based references, computing phoneme-level difference as a measure of accent strength. The proposed framework leverages recent self-supervised learning articulatory inversion tech…
▽ More
This paper explores the relationship between accent strength and articulatory features inferred from acoustic speech. To quantify accent strength, we compare phonetic transcriptions with transcriptions based on dictionary-based references, computing phoneme-level difference as a measure of accent strength. The proposed framework leverages recent self-supervised learning articulatory inversion techniques to estimate articulatory features. Analyzing a corpus of read speech from American and British English speakers, this study examines correlations between derived articulatory parameters and accent strength proxies, associating systematic articulatory differences with indexed accent strength. Results indicate that tongue positioning patterns distinguish the two dialects, with notable differences inter-dialects in rhotic and low back vowels. These findings contribute to automated accent analysis and articulatory modeling for speech processing applications.
△ Less
Submitted 3 July, 2025;
originally announced July 2025.
-
An Adaptive Estimation Approach based on Fisher Information to Overcome the Challenges of LFP Battery SOC Estimation
Authors:
Junzhe Shi,
Shida Jiang,
Shengyu Tao,
Jaewong Lee,
Manashita Borah,
Scott Moura
Abstract:
Robust and Real-time State of Charge (SOC) estimation is essential for Lithium Iron Phosphate (LFP) batteries, which are widely used in electric vehicles (EVs) and energy storage systems due to safety and longevity. However, the flat Open Circuit Voltage (OCV)-SOC curve makes this task particularly challenging. This challenge is complicated by hysteresis effects, and real-world conditions such as…
▽ More
Robust and Real-time State of Charge (SOC) estimation is essential for Lithium Iron Phosphate (LFP) batteries, which are widely used in electric vehicles (EVs) and energy storage systems due to safety and longevity. However, the flat Open Circuit Voltage (OCV)-SOC curve makes this task particularly challenging. This challenge is complicated by hysteresis effects, and real-world conditions such as current bias, voltage quantization errors, and temperature that must be considered in the battery management system use. In this paper, we proposed an adaptive estimation approach to overcome the challenges of LFPSOC estimation. Specifically, the method uses an adaptive fisher information fusion strategy that adaptively combines the SOC estimation from two different models, which are Coulomb counting and equivalent circuit model-based parameter identification. The effectiveness of this strategy is rationalized by the information richness excited by external cycling signals. A 3D OCV-H-SOC map that captures the relationship between OCV, hysteresis, and SOC was proposed as the backbone, and can be generalizable to other widely adopted parameter-identification methods. Extensive validation under ideal and real-world use scenarios, including SOC-OCV flat zones, current bias, voltage quantization errors, low temperatures, and insufficient current excitations, have been performed using 4 driving profiles, i.e., the Orange County Transit Bus Cycle, the California Unified Cycle, the US06 Drive Cycle, and the New York City Cycle, where the results demonstrate superiority over the state-of-the-art unscented Kalman filter, long short-term memory networks and transformer in all validation cases.
△ Less
Submitted 1 July, 2025;
originally announced July 2025.
-
SegmentAnyMuscle: A universal muscle segmentation model across different locations in MRI
Authors:
Roy Colglazier,
Jisoo Lee,
Haoyu Dong,
Hanxue Gu,
Yaqian Chen,
Joseph Cao,
Zafer Yildiz,
Zhonghao Liu,
Nicholas Konz,
Jichen Yang,
Jikai Zhang,
Yuwen Chen,
Lin Li,
Adrian Camarena,
Maciej A. Mazurowski
Abstract:
The quantity and quality of muscles are increasingly recognized as important predictors of health outcomes. While MRI offers a valuable modality for such assessments, obtaining precise quantitative measurements of musculature remains challenging. This study aimed to develop a publicly available model for muscle segmentation in MRIs and demonstrate its applicability across various anatomical locati…
▽ More
The quantity and quality of muscles are increasingly recognized as important predictors of health outcomes. While MRI offers a valuable modality for such assessments, obtaining precise quantitative measurements of musculature remains challenging. This study aimed to develop a publicly available model for muscle segmentation in MRIs and demonstrate its applicability across various anatomical locations and imaging sequences. A total of 362 MRIs from 160 patients at a single tertiary center (Duke University Health System, 2016-2020) were included, with 316 MRIs from 114 patients used for model development. The model was tested on two separate sets: one with 28 MRIs representing common sequence types, achieving an average Dice Similarity Coefficient (DSC) of 88.45%, and another with 18 MRIs featuring less frequent sequences and abnormalities such as muscular atrophy, hardware, and significant noise, achieving 86.21% DSC. These results demonstrate the feasibility of a fully automated deep learning algorithm for segmenting muscles on MRI across diverse settings. The public release of this model enables consistent, reproducible research into the relationship between musculature and health.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
Performance improvement of spatial semantic segmentation with enriched audio features and agent-based error correction for DCASE 2025 Challenge Task 4
Authors:
Jongyeon Park,
Joonhee Lee,
Do-Hyeon Lim,
Hong Kook Kim,
Hyeongcheol Geum,
Jeong Eun Lim
Abstract:
This technical report presents submission systems for Task 4 of the DCASE 2025 Challenge. This model incorporates additional audio features (spectral roll-off and chroma features) into the embedding feature extracted from the mel-spectral feature to im-prove the classification capabilities of an audio-tagging model in the spatial semantic segmentation of sound scenes (S5) system. This approach is…
▽ More
This technical report presents submission systems for Task 4 of the DCASE 2025 Challenge. This model incorporates additional audio features (spectral roll-off and chroma features) into the embedding feature extracted from the mel-spectral feature to im-prove the classification capabilities of an audio-tagging model in the spatial semantic segmentation of sound scenes (S5) system. This approach is motivated by the fact that mixed audio often contains subtle cues that are difficult to capture with mel-spectrograms alone. Thus, these additional features offer alterna-tive perspectives for the model. Second, an agent-based label correction system is applied to the outputs processed by the S5 system. This system reduces false positives, improving the final class-aware signal-to-distortion ratio improvement (CA-SDRi) metric. Finally, we refine the training dataset to enhance the classi-fication accuracy of low-performing classes by removing irrele-vant samples and incorporating external data. That is, audio mix-tures are generated from a limited number of data points; thus, even a small number of out-of-class data points could degrade model performance. The experiments demonstrate that the submit-ted systems employing these approaches relatively improve CA-SDRi by up to 14.7% compared to the baseline of DCASE 2025 Challenge Task 4.
△ Less
Submitted 26 June, 2025;
originally announced June 2025.
-
Fine-Tuning and Prompt Engineering of LLMs, for the Creation of Multi-Agent AI for Addressing Sustainable Protein Production Challenges
Authors:
Alexander D. Kalian,
Jaewook Lee,
Stefan P. Johannesson,
Lennart Otte,
Christer Hogstrand,
Miao Guo
Abstract:
The global demand for sustainable protein sources has accelerated the need for intelligent tools that can rapidly process and synthesise domain-specific scientific knowledge. In this study, we present a proof-of-concept multi-agent Artificial Intelligence (AI) framework designed to support sustainable protein production research, with an initial focus on microbial protein sources. Our Retrieval-Au…
▽ More
The global demand for sustainable protein sources has accelerated the need for intelligent tools that can rapidly process and synthesise domain-specific scientific knowledge. In this study, we present a proof-of-concept multi-agent Artificial Intelligence (AI) framework designed to support sustainable protein production research, with an initial focus on microbial protein sources. Our Retrieval-Augmented Generation (RAG)-oriented system consists of two GPT-based LLM agents: (1) a literature search agent that retrieves relevant scientific literature on microbial protein production for a specified microbial strain, and (2) an information extraction agent that processes the retrieved content to extract relevant biological and chemical information. Two parallel methodologies, fine-tuning and prompt engineering, were explored for agent optimisation. Both methods demonstrated effectiveness at improving the performance of the information extraction agent in terms of transformer-based cosine similarity scores between obtained and ideal outputs. Mean cosine similarity scores were increased by up to 25%, while universally reaching mean scores of $\geq 0.89$ against ideal output text. Fine-tuning overall improved the mean scores to a greater extent (consistently of $\geq 0.94$) compared to prompt engineering, although lower statistical uncertainties were observed with the latter approach. A user interface was developed and published for enabling the use of the multi-agent AI system, alongside preliminary exploration of additional chemical safety-based search capabilities
△ Less
Submitted 25 June, 2025;
originally announced June 2025.
-
Vo-Ve: An Explainable Voice-Vector for Speaker Identity Evaluation
Authors:
Jaejun Lee,
Kyogu Lee
Abstract:
In this paper, we propose Vo-Ve, a novel voice-vector embedding that captures speaker identity. Unlike conventional speaker embeddings, Vo-Ve is explainable, as it contains the probabilities of explicit voice attribute classes. Through extensive analysis, we demonstrate that Vo-Ve not only evaluates speaker similarity competitively with conventional techniques but also provides an interpretable ex…
▽ More
In this paper, we propose Vo-Ve, a novel voice-vector embedding that captures speaker identity. Unlike conventional speaker embeddings, Vo-Ve is explainable, as it contains the probabilities of explicit voice attribute classes. Through extensive analysis, we demonstrate that Vo-Ve not only evaluates speaker similarity competitively with conventional techniques but also provides an interpretable explanation in terms of voice attributes. We strongly believe that Vo-Ve can enhance evaluation schemes across various speech tasks due to its high-level explainability.
△ Less
Submitted 24 June, 2025;
originally announced June 2025.
-
DiffO: Single-step Diffusion for Image Compression at Ultra-Low Bitrates
Authors:
Chanung Park,
Joo Chan Lee,
Jong Hwan Ko
Abstract:
Although image compression is fundamental to visual data processing and has inspired numerous standard and learned codecs, these methods still suffer severe quality degradation at extremely low bits per pixel. While recent diffusion based models provided enhanced generative performance at low bitrates, they still yields limited perceptual quality and prohibitive decoding latency due to multiple de…
▽ More
Although image compression is fundamental to visual data processing and has inspired numerous standard and learned codecs, these methods still suffer severe quality degradation at extremely low bits per pixel. While recent diffusion based models provided enhanced generative performance at low bitrates, they still yields limited perceptual quality and prohibitive decoding latency due to multiple denoising steps. In this paper, we propose the first single step diffusion model for image compression (DiffO) that delivers high perceptual quality and fast decoding at ultra low bitrates. DiffO achieves these goals by coupling two key innovations: (i) VQ Residual training, which factorizes a structural base code and a learned residual in latent space, capturing both global geometry and high frequency details; and (ii) rate adaptive noise modulation, which tunes denoising strength on the fly to match the desired bitrate. Extensive experiments show that DiffO surpasses state of the art compression performance while improving decoding speed by about 50x compared to prior diffusion-based methods, greatly improving the practicality of generative codecs. The code will be available at https://github.com/Freemasti/DiffO.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
ASAP-FE: Energy-Efficient Feature Extraction Enabling Multi-Channel Keyword Spotting on Edge Processors
Authors:
Jongin Choi,
Jina Park,
Woojoo Lee,
Jae-Jin Lee,
Massoud Pedram
Abstract:
Multi-channel keyword spotting (KWS) has become crucial for voice-based applications in edge environments. However, its substantial computational and energy requirements pose significant challenges. We introduce ASAP-FE (Agile Sparsity-Aware Parallelized-Feature Extractor), a hardware-oriented front-end designed to address these challenges. Our framework incorporates three key innovations: (1) Hal…
▽ More
Multi-channel keyword spotting (KWS) has become crucial for voice-based applications in edge environments. However, its substantial computational and energy requirements pose significant challenges. We introduce ASAP-FE (Agile Sparsity-Aware Parallelized-Feature Extractor), a hardware-oriented front-end designed to address these challenges. Our framework incorporates three key innovations: (1) Half-overlapped Infinite Impulse Response (IIR) Framing: This reduces redundant data by approximately 25% while maintaining essential phoneme transition cues. (2) Sparsity-aware Data Reduction: We exploit frame-level sparsity to achieve an additional 50% data reduction by combining frame skipping with stride-based filtering. (3) Dynamic Parallel Processing: We introduce a parameterizable filter cluster and a priority-based scheduling algorithm that allows parallel execution of IIR filtering tasks, reducing latency and optimizing energy efficiency. ASAP-FE is implemented with various filter cluster sizes on edge processors, with functionality verified on FPGA prototypes and designs synthesized at 45 nm. Experimental results using TC-ResNet8, DS-CNN, and KWT-1 demonstrate that ASAP-FE reduces the average workload by 62.73% while supporting real-time processing for up to 32 channels. Compared to a conventional fully overlapped baseline, ASAP-FE achieves less than a 1% accuracy drop (e.g., 96.22% vs. 97.13% for DS-CNN), which is well within acceptable limits for edge AI. By adjusting the number of filter modules, our design optimizes the trade-off between performance and energy, with 15 parallel filters providing optimal performance for up to 25 channels. Overall, ASAP-FE offers a practical and efficient solution for multi-channel KWS on energy-constrained edge devices.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
Ground Reaction Force Estimation via Time-aware Knowledge Distillation
Authors:
Eun Som Jeon,
Sinjini Mitra,
Jisoo Lee,
Omik M. Save,
Ankita Shukla,
Hyunglae Lee,
Pavan Turaga
Abstract:
Human gait analysis with wearable sensors has been widely used in various applications, such as daily life healthcare, rehabilitation, physical therapy, and clinical diagnostics and monitoring. In particular, ground reaction force (GRF) provides critical information about how the body interacts with the ground during locomotion. Although instrumented treadmills have been widely used as the gold st…
▽ More
Human gait analysis with wearable sensors has been widely used in various applications, such as daily life healthcare, rehabilitation, physical therapy, and clinical diagnostics and monitoring. In particular, ground reaction force (GRF) provides critical information about how the body interacts with the ground during locomotion. Although instrumented treadmills have been widely used as the gold standard for measuring GRF during walking, their lack of portability and high cost make them impractical for many applications. As an alternative, low-cost, portable, wearable insole sensors have been utilized to measure GRF; however, these sensors are susceptible to noise and disturbance and are less accurate than treadmill measurements. To address these challenges, we propose a Time-aware Knowledge Distillation framework for GRF estimation from insole sensor data. This framework leverages similarity and temporal features within a mini-batch during the knowledge distillation process, effectively capturing the complementary relationships between features and the sequential properties of the target and input data. The performance of the lightweight models distilled through this framework was evaluated by comparing GRF estimations from insole sensor data against measurements from an instrumented treadmill. Empirical results demonstrated that Time-aware Knowledge Distillation outperforms current baselines in GRF estimation from wearable sensor data.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
Multi-Platform Methane Plume Detection via Model and Domain Adaptation
Authors:
Vassiliki Mancoridis,
Brian Bue,
Jake H. Lee,
Andrew K. Thorpe,
Daniel Cusworth,
Alana Ayasse,
Philip G. Brodrick,
Riley Duren
Abstract:
Prioritizing methane for near-term climate action is crucial due to its significant impact on global warming. Previous work used columnwise matched filter products from the airborne AVIRIS-NG imaging spectrometer to detect methane plume sources; convolutional neural networks (CNNs) discerned anthropogenic methane plumes from false positive enhancements. However, as an increasing number of remote s…
▽ More
Prioritizing methane for near-term climate action is crucial due to its significant impact on global warming. Previous work used columnwise matched filter products from the airborne AVIRIS-NG imaging spectrometer to detect methane plume sources; convolutional neural networks (CNNs) discerned anthropogenic methane plumes from false positive enhancements. However, as an increasing number of remote sensing platforms are used for methane plume detection, there is a growing need to address cross-platform alignment. In this work, we describe model- and data-driven machine learning approaches that leverage airborne observations to improve spaceborne methane plume detection, reconciling the distributional shifts inherent with performing the same task across platforms. We develop a spaceborne methane plume classifier using data from the EMIT imaging spectroscopy mission. We refine classifiers trained on airborne imagery from AVIRIS-NG campaigns using transfer learning, outperforming the standalone spaceborne model. Finally, we use CycleGAN, an unsupervised image-to-image translation technique, to align the data distributions between airborne and spaceborne contexts. Translating spaceborne EMIT data to the airborne AVIRIS-NG domain using CycleGAN and applying airborne classifiers directly yields the best plume detection results. This methodology is useful not only for data simulation, but also for direct data alignment. Though demonstrated on the task of methane plume detection, our work more broadly demonstrates a data-driven approach to align related products obtained from distinct remote sensing instruments.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech
Authors:
Helin Wang,
Jiarui Hai,
Dading Chong,
Karan Thakkar,
Tiantian Feng,
Dongchao Yang,
Junhyeok Lee,
Laureano Moro Velazquez,
Jesus Villalba,
Zengyi Qin,
Shrikanth Narayanan,
Mounya Elhiali,
Najim Dehak
Abstract:
Recent advancements in generative artificial intelligence have significantly transformed the field of style-captioned text-to-speech synthesis (CapTTS). However, adapting CapTTS to real-world applications remains challenging due to the lack of standardized, comprehensive datasets and limited research on downstream tasks built upon CapTTS. To address these gaps, we introduce CapSpeech, a new benchm…
▽ More
Recent advancements in generative artificial intelligence have significantly transformed the field of style-captioned text-to-speech synthesis (CapTTS). However, adapting CapTTS to real-world applications remains challenging due to the lack of standardized, comprehensive datasets and limited research on downstream tasks built upon CapTTS. To address these gaps, we introduce CapSpeech, a new benchmark designed for a series of CapTTS-related tasks, including style-captioned text-to-speech synthesis with sound events (CapTTS-SE), accent-captioned TTS (AccCapTTS), emotion-captioned TTS (EmoCapTTS), and text-to-speech synthesis for chat agent (AgentTTS). CapSpeech comprises over 10 million machine-annotated audio-caption pairs and nearly 0.36 million human-annotated audio-caption pairs. In addition, we introduce two new datasets collected and recorded by a professional voice actor and experienced audio engineers, specifically for the AgentTTS and CapTTS-SE tasks. Alongside the datasets, we conduct comprehensive experiments using both autoregressive and non-autoregressive models on CapSpeech. Our results demonstrate high-fidelity and highly intelligible speech synthesis across a diverse range of speaking styles. To the best of our knowledge, CapSpeech is the largest available dataset offering comprehensive annotations for CapTTS-related tasks. The experiments and findings further provide valuable insights into the challenges of developing CapTTS systems.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
Few-step Adversarial Schrödinger Bridge for Generative Speech Enhancement
Authors:
Seungu Han,
Sungho Lee,
Juheon Lee,
Kyogu Lee
Abstract:
Deep generative models have recently been employed for speech enhancement to generate perceptually valid clean speech on large-scale datasets. Several diffusion models have been proposed, and more recently, a tractable Schrödinger Bridge has been introduced to transport between the clean and noisy speech distributions. However, these models often suffer from an iterative reverse process and requir…
▽ More
Deep generative models have recently been employed for speech enhancement to generate perceptually valid clean speech on large-scale datasets. Several diffusion models have been proposed, and more recently, a tractable Schrödinger Bridge has been introduced to transport between the clean and noisy speech distributions. However, these models often suffer from an iterative reverse process and require a large number of sampling steps -- more than 50. Our investigation reveals that the performance of baseline models significantly degrades when the number of sampling steps is reduced, particularly under low-SNR conditions. We propose integrating Schrödinger Bridge with GANs to effectively mitigate this issue, achieving high-quality outputs on full-band datasets while substantially reducing the required sampling steps. Experimental results demonstrate that our proposed model outperforms existing baselines, even with a single inference step, in both denoising and dereverberation tasks.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
CF-DETR: Coarse-to-Fine Transformer for Real-Time Object Detection
Authors:
Woojin Shin,
Donghwa Kang,
Byeongyun Park,
Brent Byunghoon Kang,
Jinkyu Lee,
Hyeongboo Baek
Abstract:
Detection Transformers (DETR) are increasingly adopted in autonomous vehicle (AV) perception systems due to their superior accuracy over convolutional networks. However, concurrently executing multiple DETR tasks presents significant challenges in meeting firm real-time deadlines (R1) and high accuracy requirements (R2), particularly for safety-critical objects, while navigating the inherent laten…
▽ More
Detection Transformers (DETR) are increasingly adopted in autonomous vehicle (AV) perception systems due to their superior accuracy over convolutional networks. However, concurrently executing multiple DETR tasks presents significant challenges in meeting firm real-time deadlines (R1) and high accuracy requirements (R2), particularly for safety-critical objects, while navigating the inherent latency-accuracy trade-off under resource constraints. Existing real-time DNN scheduling approaches often treat models generically, failing to leverage Transformer-specific properties for efficient resource allocation. To address these challenges, we propose CF-DETR, an integrated system featuring a novel coarse-to-fine Transformer architecture and a dedicated real-time scheduling framework NPFP**. CF-DETR employs three key strategies (A1: coarse-to-fine inference, A2: selective fine inference, A3: multi-level batch inference) that exploit Transformer properties to dynamically adjust patch granularity and attention scope based on object criticality, aiming to satisfy R2. The NPFP** scheduling framework (A4) orchestrates these adaptive mechanisms A1-A3. It partitions each DETR task into a safety-critical coarse subtask for guaranteed critical object detection within its deadline (ensuring R1), and an optional fine subtask for enhanced overall accuracy (R2), while managing individual and batched execution. Our extensive evaluations on server, GPU-enabled embedded platforms, and actual AV platforms demonstrate that CF-DETR, under an NPFP** policy, successfully meets strict timing guarantees for critical operations and achieves significantly higher overall and critical object detection accuracy compared to existing baselines across diverse AV workloads.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
A Novel Deep Learning Framework for Efficient Multichannel Acoustic Feedback Control
Authors:
Yuan-Kuei Wu,
Juan Azcarreta,
Kashyap Patel,
Buye Xu,
Jung-Suk Lee,
Sanha Lee,
Ashutosh Pandey
Abstract:
This study presents a deep-learning framework for controlling multichannel acoustic feedback in audio devices. Traditional digital signal processing methods struggle with convergence when dealing with highly correlated noise such as feedback. We introduce a Convolutional Recurrent Network that efficiently combines spatial and temporal processing, significantly enhancing speech enhancement capabili…
▽ More
This study presents a deep-learning framework for controlling multichannel acoustic feedback in audio devices. Traditional digital signal processing methods struggle with convergence when dealing with highly correlated noise such as feedback. We introduce a Convolutional Recurrent Network that efficiently combines spatial and temporal processing, significantly enhancing speech enhancement capabilities with lower computational demands. Our approach utilizes three training methods: In-a-Loop Training, Teacher Forcing, and a Hybrid strategy with a Multichannel Wiener Filter, optimizing performance in complex acoustic environments. This scalable framework offers a robust solution for real-world applications, making significant advances in Acoustic Feedback Control technology.
△ Less
Submitted 29 May, 2025; v1 submitted 21 May, 2025;
originally announced May 2025.
-
Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits
Authors:
Tiantian Feng,
Jihwan Lee,
Anfeng Xu,
Yoonjeong Lee,
Thanathai Lertpetchpun,
Xuan Shi,
Helin Wang,
Thomas Thebaud,
Laureano Moro-Velazquez,
Dani Byrd,
Najim Dehak,
Shrikanth Narayanan
Abstract:
We introduce Vox-Profile, a comprehensive benchmark to characterize rich speaker and speech traits using speech foundation models. Unlike existing works that focus on a single dimension of speaker traits, Vox-Profile provides holistic and multi-dimensional profiles that reflect both static speaker traits (e.g., age, sex, accent) and dynamic speech properties (e.g., emotion, speech flow). This benc…
▽ More
We introduce Vox-Profile, a comprehensive benchmark to characterize rich speaker and speech traits using speech foundation models. Unlike existing works that focus on a single dimension of speaker traits, Vox-Profile provides holistic and multi-dimensional profiles that reflect both static speaker traits (e.g., age, sex, accent) and dynamic speech properties (e.g., emotion, speech flow). This benchmark is grounded in speech science and linguistics, developed with domain experts to accurately index speaker and speech characteristics. We report benchmark experiments using over 15 publicly available speech datasets and several widely used speech foundation models that target various static and dynamic speaker and speech properties. In addition to benchmark experiments, we showcase several downstream applications supported by Vox-Profile. First, we show that Vox-Profile can augment existing speech recognition datasets to analyze ASR performance variability. Vox-Profile is also used as a tool to evaluate the performance of speech generation systems. Finally, we assess the quality of our automated profiles through comparison with human evaluation and show convergent validity. Vox-Profile is publicly available at: https://github.com/tiantiaf0627/vox-profile-release.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
Articulatory Feature Prediction from Surface EMG during Speech Production
Authors:
Jihwan Lee,
Kevin Huang,
Kleanthis Avramidis,
Simon Pistrosch,
Monica Gonzalez-Machorro,
Yoonjeong Lee,
Björn Schuller,
Louis Goldstein,
Shrikanth Narayanan
Abstract:
We present a model for predicting articulatory features from surface electromyography (EMG) signals during speech production. The proposed model integrates convolutional layers and a Transformer block, followed by separate predictors for articulatory features. Our approach achieves a high prediction correlation of approximately 0.9 for most articulatory features. Furthermore, we demonstrate that t…
▽ More
We present a model for predicting articulatory features from surface electromyography (EMG) signals during speech production. The proposed model integrates convolutional layers and a Transformer block, followed by separate predictors for articulatory features. Our approach achieves a high prediction correlation of approximately 0.9 for most articulatory features. Furthermore, we demonstrate that these predicted articulatory features can be decoded into intelligible speech waveforms. To our knowledge, this is the first method to decode speech waveforms from surface EMG via articulatory features, offering a novel approach to EMG-based speech synthesis. Additionally, we analyze the relationship between EMG electrode placement and articulatory feature predictability, providing knowledge-driven insights for optimizing EMG electrode configurations. The source code and decoded speech samples are publicly available.
△ Less
Submitted 28 May, 2025; v1 submitted 19 May, 2025;
originally announced May 2025.
-
RoVo: Robust Voice Protection Against Unauthorized Speech Synthesis with Embedding-Level Perturbations
Authors:
Seungmin Kim,
Sohee Park,
Donghyun Kim,
Jisu Lee,
Daeseon Choi
Abstract:
With the advancement of AI-based speech synthesis technologies such as Deep Voice, there is an increasing risk of voice spoofing attacks, including voice phishing and fake news, through unauthorized use of others' voices. Existing defenses that inject adversarial perturbations directly into audio signals have limited effectiveness, as these perturbations can easily be neutralized by speech enhance…
▽ More
With the advancement of AI-based speech synthesis technologies such as Deep Voice, there is an increasing risk of voice spoofing attacks, including voice phishing and fake news, through unauthorized use of others' voices. Existing defenses that inject adversarial perturbations directly into audio signals have limited effectiveness, as these perturbations can easily be neutralized by speech enhancement methods. To overcome this limitation, we propose RoVo (Robust Voice), a novel proactive defense technique that injects adversarial perturbations into high-dimensional embedding vectors of audio signals, reconstructing them into protected speech. This approach effectively defends against speech synthesis attacks and also provides strong resistance to speech enhancement models, which represent a secondary attack threat.
In extensive experiments, RoVo increased the Defense Success Rate (DSR) by over 70% compared to unprotected speech, across four state-of-the-art speech synthesis models. Specifically, RoVo achieved a DSR of 99.5% on a commercial speaker-verification API, effectively neutralizing speech synthesis attack. Moreover, RoVo's perturbations remained robust even under strong speech enhancement conditions, outperforming traditional methods. A user study confirmed that RoVo preserves both naturalness and usability of protected speech, highlighting its effectiveness in complex and evolving threat scenarios.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
On-Device LLM for Context-Aware Wi-Fi Roaming
Authors:
Ju-Hyung Lee,
Yanqing Lu,
Klaus Doppler
Abstract:
Roaming in Wireless LAN (Wi-Fi) is a critical yet challenging task for maintaining seamless connectivity in dynamic mobile environments. Conventional threshold-based or heuristic schemes often fail, leading to either sticky or excessive handovers. We introduce the first cross-layer use of an on-device large language model (LLM): high-level reasoning in the application layer that issues real-time a…
▽ More
Roaming in Wireless LAN (Wi-Fi) is a critical yet challenging task for maintaining seamless connectivity in dynamic mobile environments. Conventional threshold-based or heuristic schemes often fail, leading to either sticky or excessive handovers. We introduce the first cross-layer use of an on-device large language model (LLM): high-level reasoning in the application layer that issues real-time actions executed in the PHY/MAC stack. The LLM addresses two tasks: (i) context-aware AP selection, where structured prompts fuse environmental cues (e.g., location, time) to choose the best BSSID; and (ii) dynamic threshold adjustment, where the model adaptively decides when to roam. To satisfy the tight latency and resource budgets of edge hardware, we apply a suite of optimizations-chain-of-thought prompting, parameter-efficient fine-tuning, and quantization. Experiments on indoor and outdoor datasets show that our approach surpasses legacy heuristics and DRL baselines, achieving a strong balance between roaming stability and signal quality. These findings underscore the promise of application-layer LLM reasoning for lower-layer wireless control in future edge systems.
△ Less
Submitted 20 May, 2025; v1 submitted 7 May, 2025;
originally announced May 2025.
-
FLOWER: Flow-Based Estimated Gaussian Guidance for General Speech Restoration
Authors:
Da-Hee Yang,
Jaeuk Lee,
Joon-Hyuk Chang
Abstract:
We introduce FLOWER, a novel conditioning method designed for speech restoration that integrates Gaussian guidance into generative frameworks. By transforming clean speech into a predefined prior distribution (e.g., Gaussian distribution) using a normalizing flow network, FLOWER extracts critical information to guide generative models. This guidance is incorporated into each block of the generativ…
▽ More
We introduce FLOWER, a novel conditioning method designed for speech restoration that integrates Gaussian guidance into generative frameworks. By transforming clean speech into a predefined prior distribution (e.g., Gaussian distribution) using a normalizing flow network, FLOWER extracts critical information to guide generative models. This guidance is incorporated into each block of the generative network, enabling precise restoration control. Experimental results demonstrate the effectiveness of FLOWER in improving performance across various general speech restoration tasks.
△ Less
Submitted 3 May, 2025;
originally announced May 2025.
-
Improving Editability in Image Generation with Layer-wise Memory
Authors:
Daneul Kim,
Jaeah Lee,
Jaesik Park
Abstract:
Most real-world image editing tasks require multiple sequential edits to achieve desired results. Current editing approaches, primarily designed for single-object modifications, struggle with sequential editing: especially with maintaining previous edits along with adapting new objects naturally into the existing content. These limitations significantly hinder complex editing scenarios where multi…
▽ More
Most real-world image editing tasks require multiple sequential edits to achieve desired results. Current editing approaches, primarily designed for single-object modifications, struggle with sequential editing: especially with maintaining previous edits along with adapting new objects naturally into the existing content. These limitations significantly hinder complex editing scenarios where multiple objects need to be modified while preserving their contextual relationships. We address this fundamental challenge through two key proposals: enabling rough mask inputs that preserve existing content while naturally integrating new elements and supporting consistent editing across multiple modifications. Our framework achieves this through layer-wise memory, which stores latent representations and prompt embeddings from previous edits. We propose Background Consistency Guidance that leverages memorized latents to maintain scene coherence and Multi-Query Disentanglement in cross-attention that ensures natural adaptation to existing content. To evaluate our method, we present a new benchmark dataset incorporating semantic alignment metrics and interactive editing scenarios. Through comprehensive experiments, we demonstrate superior performance in iterative image editing tasks with minimal user effort, requiring only rough masks while maintaining high-quality results throughout multiple editing steps.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
Stabilization by Controllers Having Integer Coefficients
Authors:
Joowon Lee,
Donggil Lee,
Junsoo Kim
Abstract:
The system property of ``having integer coefficients,'' that is, a transfer function has an integer monic polynomial as its denominator, is significant in the field of encrypted control as it is required for a dynamic controller to be realized over encrypted data. This paper shows that there always exists a controller with integer coefficients stabilizing a given discrete-time linear time-invarian…
▽ More
The system property of ``having integer coefficients,'' that is, a transfer function has an integer monic polynomial as its denominator, is significant in the field of encrypted control as it is required for a dynamic controller to be realized over encrypted data. This paper shows that there always exists a controller with integer coefficients stabilizing a given discrete-time linear time-invariant plant. A constructive algorithm to obtain such a controller is provided, along with numerical examples. Furthermore, the proposed method is applied to converting a pre-designed controller to have integer coefficients, while the original performance is preserved in the sense that the transfer function of the closed-loop system remains unchanged.
△ Less
Submitted 1 May, 2025;
originally announced May 2025.
-
Efficient and robust 3D blind harmonization for large domain gaps
Authors:
Hwihun Jeong,
Hayeon Lee,
Se Young Chun,
Jongho Lee
Abstract:
Blind harmonization has emerged as a promising technique for MR image harmonization to achieve scale-invariant representations, requiring only target domain data (i.e., no source domain data necessary). However, existing methods face limitations such as inter-slice heterogeneity in 3D, moderate image quality, and limited performance for a large domain gap. To address these challenges, we introduce…
▽ More
Blind harmonization has emerged as a promising technique for MR image harmonization to achieve scale-invariant representations, requiring only target domain data (i.e., no source domain data necessary). However, existing methods face limitations such as inter-slice heterogeneity in 3D, moderate image quality, and limited performance for a large domain gap. To address these challenges, we introduce BlindHarmonyDiff, a novel blind 3D harmonization framework that leverages an edge-to-image model tailored specifically to harmonization. Our framework employs a 3D rectified flow trained on target domain images to reconstruct the original image from an edge map, then yielding a harmonized image from the edge of a source domain image. We propose multi-stride patch training for efficient 3D training and a refinement module for robust inference by suppressing hallucination. Extensive experiments demonstrate that BlindHarmonyDiff outperforms prior arts by harmonizing diverse source domain images to the target domain, achieving higher correspondence to the target domain characteristics. Downstream task-based quality assessments such as tissue segmentation and age prediction on diverse MR scanners further confirm the effectiveness of our approach and demonstrate the capability of our robust and generalizable blind harmonization.
△ Less
Submitted 30 April, 2025;
originally announced May 2025.
-
An Addendum to NeBula: Towards Extending TEAM CoSTAR's Solution to Larger Scale Environments
Authors:
Ali Agha,
Kyohei Otsu,
Benjamin Morrell,
David D. Fan,
Sung-Kyun Kim,
Muhammad Fadhil Ginting,
Xianmei Lei,
Jeffrey Edlund,
Seyed Fakoorian,
Amanda Bouman,
Fernando Chavez,
Taeyeon Kim,
Gustavo J. Correa,
Maira Saboia,
Angel Santamaria-Navarro,
Brett Lopez,
Boseong Kim,
Chanyoung Jung,
Mamoru Sobue,
Oriana Claudia Peltzer,
Joshua Ott,
Robert Trybula,
Thomas Touma,
Marcel Kaufmann,
Tiago Stegun Vaquero
, et al. (64 additional authors not shown)
Abstract:
This paper presents an appendix to the original NeBula autonomy solution developed by the TEAM CoSTAR (Collaborative SubTerranean Autonomous Robots), participating in the DARPA Subterranean Challenge. Specifically, this paper presents extensions to NeBula's hardware, software, and algorithmic components that focus on increasing the range and scale of the exploration environment. From the algorithm…
▽ More
This paper presents an appendix to the original NeBula autonomy solution developed by the TEAM CoSTAR (Collaborative SubTerranean Autonomous Robots), participating in the DARPA Subterranean Challenge. Specifically, this paper presents extensions to NeBula's hardware, software, and algorithmic components that focus on increasing the range and scale of the exploration environment. From the algorithmic perspective, we discuss the following extensions to the original NeBula framework: (i) large-scale geometric and semantic environment mapping; (ii) an adaptive positioning system; (iii) probabilistic traversability analysis and local planning; (iv) large-scale POMDP-based global motion planning and exploration behavior; (v) large-scale networking and decentralized reasoning; (vi) communication-aware mission planning; and (vii) multi-modal ground-aerial exploration solutions. We demonstrate the application and deployment of the presented systems and solutions in various large-scale underground environments, including limestone mine exploration scenarios as well as deployment in the DARPA Subterranean challenge.
△ Less
Submitted 18 April, 2025;
originally announced April 2025.
-
Documentation on Encrypted Dynamic Control Simulation Code using Ring-LWE based Cryptosystems
Authors:
Yeongjun Jang,
Joowon Lee,
Junsoo Kim
Abstract:
Encrypted controllers offer secure computation by employing modern cryptosystems to execute control operations directly over encrypted data without decryption. However, incorporating cryptosystems into dynamic controllers significantly increases the computational load. This paper aims to provide an accessible guideline for running encrypted controllers using an open-source library Lattigo, which s…
▽ More
Encrypted controllers offer secure computation by employing modern cryptosystems to execute control operations directly over encrypted data without decryption. However, incorporating cryptosystems into dynamic controllers significantly increases the computational load. This paper aims to provide an accessible guideline for running encrypted controllers using an open-source library Lattigo, which supports an efficient implementation of Ring-Learing With Errors (LWE) based encrypted controllers, and our explanations are assisted with example codes that are fully available at https://github.com/CDSL-EncryptedControl/CDSL.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
The Tenth NTIRE 2025 Efficient Super-Resolution Challenge Report
Authors:
Bin Ren,
Hang Guo,
Lei Sun,
Zongwei Wu,
Radu Timofte,
Yawei Li,
Yao Zhang,
Xinning Chai,
Zhengxue Cheng,
Yingsheng Qin,
Yucai Yang,
Li Song,
Hongyuan Yu,
Pufan Xu,
Cheng Wan,
Zhijuan Huang,
Peng Guo,
Shuyuan Cui,
Chenjun Li,
Xuehai Hu,
Pan Pan,
Xin Zhang,
Heng Zhang,
Qing Luo,
Linyan Jiang
, et al. (122 additional authors not shown)
Abstract:
This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the…
▽ More
This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the $\operatorname{DIV2K\_LSDIR\_test}$ dataset. A robust participation saw \textbf{244} registered entrants, with \textbf{43} teams submitting valid entries. This report meticulously analyzes these methods and results, emphasizing groundbreaking advancements in state-of-the-art single-image ESR techniques. The analysis highlights innovative approaches and establishes benchmarks for future research in the field.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
Self-Supervised Autoencoder Network for Robust Heart Rate Extraction from Noisy Photoplethysmogram: Applying Blind Source Separation to Biosignal Analysis
Authors:
Matthew B. Webster,
Dongheon Lee,
Joonnyong Lee
Abstract:
Biosignals can be viewed as mixtures measuring particular physiological events, and blind source separation (BSS) aims to extract underlying source signals from mixtures. This paper proposes a self-supervised multi-encoder autoencoder (MEAE) to separate heartbeat-related source signals from photoplethysmogram (PPG), enhancing heart rate (HR) detection in noisy PPG data. The MEAE is trained on PPG…
▽ More
Biosignals can be viewed as mixtures measuring particular physiological events, and blind source separation (BSS) aims to extract underlying source signals from mixtures. This paper proposes a self-supervised multi-encoder autoencoder (MEAE) to separate heartbeat-related source signals from photoplethysmogram (PPG), enhancing heart rate (HR) detection in noisy PPG data. The MEAE is trained on PPG signals from a large open polysomnography database without any pre-processing or data selection. The trained network is then applied to a noisy PPG dataset collected during the daily activities of nine subjects. The extracted heartbeat-related source signal significantly improves HR detection as compared to the original PPG. The absence of pre-processing and the self-supervised nature of the proposed method, combined with its strong performance, highlight the potential of MEAE for BSS in biosignal analysis.
△ Less
Submitted 4 June, 2025; v1 submitted 12 April, 2025;
originally announced April 2025.
-
Performance Analysis of Deep Learning Models for Femur Segmentation in MRI Scan
Authors:
Mengyuan Liu,
Yixiao Chen,
Anning Tian,
Xinmeng Wu,
Mozhi Shen,
Tianchou Gong,
Jeongkyu Lee
Abstract:
Convolutional neural networks like U-Net excel in medical image segmentation, while attention mechanisms and KAN enhance feature extraction. Meta's SAM 2 uses Vision Transformers for prompt-based segmentation without fine-tuning. However, biases in these models impact generalization with limited data. In this study, we systematically evaluate and compare the performance of three CNN-based models,…
▽ More
Convolutional neural networks like U-Net excel in medical image segmentation, while attention mechanisms and KAN enhance feature extraction. Meta's SAM 2 uses Vision Transformers for prompt-based segmentation without fine-tuning. However, biases in these models impact generalization with limited data. In this study, we systematically evaluate and compare the performance of three CNN-based models, i.e., U-Net, Attention U-Net, and U-KAN, and one transformer-based model, i.e., SAM 2 for segmenting femur bone structures in MRI scan. The dataset comprises 11,164 MRI scans with detailed annotations of femoral regions. Performance is assessed using the Dice Similarity Coefficient, which ranges from 0.932 to 0.954. Attention U-Net achieves the highest overall scores, while U-KAN demonstrated superior performance in anatomical regions with a smaller region of interest, leveraging its enhanced learning capacity to improve segmentation accuracy.
△ Less
Submitted 5 April, 2025;
originally announced April 2025.
-
A Robust Method for Fault Detection and Severity Estimation in Mechanical Vibration Data
Authors:
Youngjae Jeon,
Eunho Heo,
Jinmo Lee,
Taewon Uhm,
Dongjin Lee
Abstract:
This paper proposes a robust method for fault detection and severity estimation in multivariate time-series data to enhance predictive maintenance of mechanical systems. We use the Temporal Graph Convolutional Network (T-GCN) model to capture both spatial and temporal dependencies among variables. This enables accurate future state predictions under varying operational conditions. To address the c…
▽ More
This paper proposes a robust method for fault detection and severity estimation in multivariate time-series data to enhance predictive maintenance of mechanical systems. We use the Temporal Graph Convolutional Network (T-GCN) model to capture both spatial and temporal dependencies among variables. This enables accurate future state predictions under varying operational conditions. To address the challenge of fluctuating anomaly scores that reduce fault severity estimation accuracy, we introduce a novel fault severity index based on the mean and standard deviation of anomaly scores. This generates a continuous and reliable severity measurement. We validate the proposed method using two experimental datasets: an open IMS bearing dataset and data collected from a fanjet electric propulsion system. Results demonstrate that our method significantly reduces abrupt fluctuations and inconsistencies in anomaly scores. This provides a more dependable foundation for maintenance planning and risk management in safety-critical applications.
△ Less
Submitted 4 April, 2025;
originally announced April 2025.
-
Egocentric Conformal Prediction for Safe and Efficient Navigation in Dynamic Cluttered Environments
Authors:
Jaeuk Shin,
Jungjin Lee,
Insoon Yang
Abstract:
Conformal prediction (CP) has emerged as a powerful tool in robotics and control, thanks to its ability to calibrate complex, data-driven models with formal guarantees. However, in robot navigation tasks, existing CP-based methods often decouple prediction from control, evaluating models without considering whether prediction errors actually compromise safety. Consequently, ego-vehicles may become…
▽ More
Conformal prediction (CP) has emerged as a powerful tool in robotics and control, thanks to its ability to calibrate complex, data-driven models with formal guarantees. However, in robot navigation tasks, existing CP-based methods often decouple prediction from control, evaluating models without considering whether prediction errors actually compromise safety. Consequently, ego-vehicles may become overly conservative or even immobilized when all potential trajectories appear infeasible. To address this issue, we propose a novel CP-based navigation framework that responds exclusively to safety-critical prediction errors. Our approach introduces egocentric score functions that quantify how much closer obstacles are to a candidate vehicle position than anticipated. These score functions are then integrated into a model predictive control scheme, wherein each candidate state is individually evaluated for safety. Combined with an adaptive CP mechanism, our framework dynamically adjusts to changes in obstacle motion without resorting to unnecessary conservatism. Theoretical analyses indicate that our method outperforms existing CP-based approaches in terms of cost-efficiency while maintaining the desired safety levels, as further validated through experiments on real-world datasets featuring densely populated pedestrian environments.
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
Performance analysis of metasurface-based spatial multimode transmission for 6G wireless communications
Authors:
Ju Yong Lee,
Seung-Won Keum,
Sang Min Oh,
Dang-Oh Kim,
Dong-Ho Cho
Abstract:
In 6th generation wireless communication technology, it is important to utilize space resources efficiently. Recently, holographic multiple-input multiple-output (HMIMO) and meta-surface technology have attracted attention as technologies that maximize space utilization for 6G mobile communications. However, studies on HMIMO communications are still in an initial stage and its fundamental limits a…
▽ More
In 6th generation wireless communication technology, it is important to utilize space resources efficiently. Recently, holographic multiple-input multiple-output (HMIMO) and meta-surface technology have attracted attention as technologies that maximize space utilization for 6G mobile communications. However, studies on HMIMO communications are still in an initial stage and its fundamental limits are yet to be unveiled. It is well known that the Fourier transform relationship can be obtained using a lens in the optical field, but research to apply it to the mobile communication field is in the early stages. In this paper, we show that the Fourier transform relationship between signals can be obtained when two metasurfaces are aligned or unaligned, and analyze the transmission and reception power, and the maximum number of spatial multimodes that can be transmitted. In addition, to reduce transmission complexity, we propose a spatial multimode transmission system using three metasurfaces and analyze signal characteristics on the meta-surfaces. In numerical results, we provide the performance of spatial multimode transmission in case of using rectangular and Gaussian signals.
△ Less
Submitted 31 March, 2025;
originally announced April 2025.
-
Speculative End-Turn Detector for Efficient Speech Chatbot Assistant
Authors:
Hyunjong Ok,
Suho Yoo,
Jaeho Lee
Abstract:
Spoken dialogue systems powered by large language models have demonstrated remarkable abilities in understanding human speech and generating appropriate spoken responses. However, these systems struggle with end-turn detection (ETD) -- the ability to distinguish between user turn completion and hesitation. This limitation often leads to premature or delayed responses, disrupting the flow of spoken…
▽ More
Spoken dialogue systems powered by large language models have demonstrated remarkable abilities in understanding human speech and generating appropriate spoken responses. However, these systems struggle with end-turn detection (ETD) -- the ability to distinguish between user turn completion and hesitation. This limitation often leads to premature or delayed responses, disrupting the flow of spoken conversations. In this paper, we introduce the ETD Dataset, the first public dataset for end-turn detection. The ETD dataset consists of both synthetic speech data generated with text-to-speech models and real-world speech data collected from web sources. We also propose SpeculativeETD, a novel collaborative inference framework that balances efficiency and accuracy to improve real-time ETD in resource-constrained environments. Our approach jointly employs a lightweight GRU-based model, which rapidly detects the non-speaking units in real-time on local devices, and a high-performance Wav2vec-based model running on the server to make a more challenging classification of distinguishing turn ends from mere pauses. Experiments demonstrate that the proposed SpeculativeETD significantly improves ETD accuracy while keeping the required computations low. Datasets and code will be available after the review.
△ Less
Submitted 30 March, 2025;
originally announced March 2025.
-
SupertonicTTS: Towards Highly Scalable and Efficient Text-to-Speech System
Authors:
Hyeongju Kim,
Jinhyeok Yang,
Yechan Yu,
Seunghun Ji,
Jacob Morton,
Frederik Bous,
Joon Byun,
Juheon Lee
Abstract:
We present a novel text-to-speech (TTS) system, namely SupertonicTTS, for improved scalability and efficiency in speech synthesis. SupertonicTTS comprises three components: a speech autoencoder for continuous latent representation, a text-to-latent module leveraging flow-matching for text-to-latent mapping, and an utterance-level duration predictor. To enable a lightweight architecture, we employ…
▽ More
We present a novel text-to-speech (TTS) system, namely SupertonicTTS, for improved scalability and efficiency in speech synthesis. SupertonicTTS comprises three components: a speech autoencoder for continuous latent representation, a text-to-latent module leveraging flow-matching for text-to-latent mapping, and an utterance-level duration predictor. To enable a lightweight architecture, we employ a low-dimensional latent space, temporal compression of latents, and ConvNeXt blocks. We further simplify the TTS pipeline by operating directly on raw character-level text and employing cross-attention for text-speech alignment, thus eliminating the need for grapheme-to-phoneme (G2P) modules and external aligners. In addition, we introduce context-sharing batch expansion that accelerates loss convergence and stabilizes text-speech alignment. Experimental results demonstrate that SupertonicTTS achieves competitive performance while significantly reducing architectural complexity and computational overhead compared to contemporary TTS models. Audio samples demonstrating the capabilities of SupertonicTTS are available at: https://supertonictts.github.io/.
△ Less
Submitted 16 May, 2025; v1 submitted 29 March, 2025;
originally announced March 2025.
-
Multispectral Demosaicing via Dual Cameras
Authors:
SaiKiran Tedla,
Junyong Lee,
Beixuan Yang,
Mahmoud Afifi,
Michael S. Brown
Abstract:
Multispectral (MS) images capture detailed scene information across a wide range of spectral bands, making them invaluable for applications requiring rich spectral data. Integrating MS imaging into multi camera devices, such as smartphones, has the potential to enhance both spectral applications and RGB image quality. A critical step in processing MS data is demosaicing, which reconstructs color i…
▽ More
Multispectral (MS) images capture detailed scene information across a wide range of spectral bands, making them invaluable for applications requiring rich spectral data. Integrating MS imaging into multi camera devices, such as smartphones, has the potential to enhance both spectral applications and RGB image quality. A critical step in processing MS data is demosaicing, which reconstructs color information from the mosaic MS images captured by the camera. This paper proposes a method for MS image demosaicing specifically designed for dual-camera setups where both RGB and MS cameras capture the same scene. Our approach leverages co-captured RGB images, which typically have higher spatial fidelity, to guide the demosaicing of lower-fidelity MS images. We introduce the Dual-camera RGB-MS Dataset - a large collection of paired RGB and MS mosaiced images with ground-truth demosaiced outputs - that enables training and evaluation of our method. Experimental results demonstrate that our method achieves state-of-the-art accuracy compared to existing techniques.
△ Less
Submitted 8 April, 2025; v1 submitted 27 March, 2025;
originally announced March 2025.
-
Validation and Calibration of Energy Models with Real Vehicle Data from Chassis Dynamometer Experiments
Authors:
Joy Carpio,
Sulaiman Almatrudi,
Nour Khoudari,
Zhe Fu,
Kenneth Butts,
Jonathan Lee,
Benjamin Seibold,
Alexandre Bayen
Abstract:
Accurate estimation of vehicle fuel consumption typically requires detailed modeling of complex internal powertrain dynamics, often resulting in computationally intensive simulations. However, many transportation applications-such as traffic flow modeling, optimization, and control-require simplified models that are fast, interpretable, and easy to implement, while still maintaining fidelity to ph…
▽ More
Accurate estimation of vehicle fuel consumption typically requires detailed modeling of complex internal powertrain dynamics, often resulting in computationally intensive simulations. However, many transportation applications-such as traffic flow modeling, optimization, and control-require simplified models that are fast, interpretable, and easy to implement, while still maintaining fidelity to physical energy behavior. This work builds upon a recently developed model reduction pipeline that derives physics-like energy models from high-fidelity Autonomie vehicle simulations. These reduced models preserve essential vehicle dynamics, enabling realistic fuel consumption estimation with minimal computational overhead. While the reduced models have demonstrated strong agreement with their Autonomie counterparts, previous validation efforts have been confined to simulation environments. This study extends the validation by comparing the reduced energy model's outputs against real-world vehicle data. Focusing on the MidSUV category, we tune the baseline Autonomie model to closely replicate the characteristics of a Toyota RAV4. We then assess the accuracy of the resulting reduced model in estimating fuel consumption under actual drive conditions. Our findings suggest that, when the reference Autonomie model is properly calibrated, the simplified model produced by the reduction pipeline can provide reliable, semi-principled fuel rate estimates suitable for large-scale transportation applications.
△ Less
Submitted 26 March, 2025;
originally announced March 2025.
-
Bridging the Sim-to-real Gap: A Control Framework for Imitation Learning of Model Predictive Control
Authors:
Seungtaek Kim,
Jonghyup Lee,
Kyoungseok Han,
Seibum B. Choi
Abstract:
To address the computational challenges of Model Predictive Control (MPC), recent research has studied using imitation learning to approximate the MPC to a computationally efficient Deep Neural Network (DNN). However, this introduces a common issue in learning-based control, the simulation-to-reality (sim-to-real) gap, and Domain Randomization (DR) has been widely used to mitigate this gap by intr…
▽ More
To address the computational challenges of Model Predictive Control (MPC), recent research has studied using imitation learning to approximate the MPC to a computationally efficient Deep Neural Network (DNN). However, this introduces a common issue in learning-based control, the simulation-to-reality (sim-to-real) gap, and Domain Randomization (DR) has been widely used to mitigate this gap by introducing perturbations in the source domain. However, DR inevitably leads to low data collection efficiency and an overly conservative control strategy. This study proposes a new control framework that addresses this issue from a control perspective inspired by Robust Tube MPC. The framework ensures the DNN operates in the same environment as the source domain, handling the sim-to-real gap with great data collection efficiency. Moreover, a parameter governor is introduced to address the DNN's inability to adapt to variations in model parameters, enabling the system to satisfy MPC constraints more robustly under changing conditions. The proposed framework was validated through two case studies: cart-pole control and vehicle collision avoidance control, which analyzed the principles of the proposed framework in detail and demonstrated its application to a vehicle control case.
△ Less
Submitted 3 July, 2025; v1 submitted 24 March, 2025;
originally announced March 2025.
-
Imagine to Hear: Auditory Knowledge Generation can be an Effective Assistant for Language Models
Authors:
Suho Yoo,
Hyunjong Ok,
Jaeho Lee
Abstract:
Language models pretrained on text-only corpora often struggle with tasks that require auditory commonsense knowledge. Previous work addresses this problem by augmenting the language model to retrieve knowledge from external audio databases. This approach has several limitations, such as the potential lack of relevant audio in databases and the high costs associated with constructing the databases…
▽ More
Language models pretrained on text-only corpora often struggle with tasks that require auditory commonsense knowledge. Previous work addresses this problem by augmenting the language model to retrieve knowledge from external audio databases. This approach has several limitations, such as the potential lack of relevant audio in databases and the high costs associated with constructing the databases. To address these issues, we propose Imagine to Hear, a novel approach that dynamically generates auditory knowledge using generative models. Our framework detects multiple audio-related textual spans from the given prompt and generates corresponding auditory knowledge. We develop several mechanisms to efficiently process multiple auditory knowledge, including a CLAP-based rejection sampler and a language-audio fusion module. Our experiments show that our method achieves state-of-the-art performance on AuditoryBench without relying on external databases, highlighting the effectiveness of our generation-based approach.
△ Less
Submitted 8 June, 2025; v1 submitted 21 March, 2025;
originally announced March 2025.
-
Revival: Collaborative Artistic Creation through Human-AI Interactions in Musical Creativity
Authors:
Keon Ju M. Lee,
Philippe Pasquier,
Jun Yuri
Abstract:
Revival is an innovative live audiovisual performance and music improvisation by our artist collective K-Phi-A, blending human and AI musicianship to create electronic music with audio-reactive visuals. The performance features real-time co-creative improvisation between a percussionist, an electronic music artist, and AI musical agents. Trained in works by deceased composers and the collective's…
▽ More
Revival is an innovative live audiovisual performance and music improvisation by our artist collective K-Phi-A, blending human and AI musicianship to create electronic music with audio-reactive visuals. The performance features real-time co-creative improvisation between a percussionist, an electronic music artist, and AI musical agents. Trained in works by deceased composers and the collective's compositions, these agents dynamically respond to human input and emulate complex musical styles. An AI-driven visual synthesizer, guided by a human VJ, produces visuals that evolve with the musical landscape. Revival showcases the potential of AI and human collaboration in improvisational artistic creation.
△ Less
Submitted 19 January, 2025;
originally announced March 2025.
-
PD-Skygroundhook Controller for Semi-Active Suspension System Using Magnetorheological Fluid Dampers
Authors:
Hansol Lim,
Jee Won Lee,
Seung-Bok Choi,
Jongseong Brad Choi
Abstract:
This paper presents a Proportional-Derivative (PD) Skygroundhook controller for magnetorheological (MR) dampers in semi-active suspensions. Traditional skyhook, Groundhook, and hybrid Skygroundhook controllers are well-known for their ability to reduce body and wheel vibrations; however, each approach has limitations in handling a broad frequency spectrum and often relies on abrupt switching. By a…
▽ More
This paper presents a Proportional-Derivative (PD) Skygroundhook controller for magnetorheological (MR) dampers in semi-active suspensions. Traditional skyhook, Groundhook, and hybrid Skygroundhook controllers are well-known for their ability to reduce body and wheel vibrations; however, each approach has limitations in handling a broad frequency spectrum and often relies on abrupt switching. By adding a derivative action to the classical Skygroundhook logic, the proposed PD-Skygroundhook method enhances high-frequency damping and stabilizes transition behaviors. By leveraging the fast response of MR dampers, our controller adjusts the damper force continuously in real time to match the desired damping force of PD-Skygroundhook controller with efficient computation. Experimental evaluations under bump excitations and sine-sweeping tests demonstrate a significant reduction in sprung mass acceleration and unsprung mass acceleration, outperforming standard Skygroundhook in both ride comfort and road handling. These results highlight that the derivative action effectively reduces resonance peaks and smooths out force transitions of regular Skygroundhook. Our method offers a robust alternative to more computationally demanding semi-active controllers.
△ Less
Submitted 17 March, 2025;
originally announced March 2025.
-
ValSub: Subsampling Validation Data to Mitigate Forgetting during ASR Personalization
Authors:
Haaris Mehmood,
Karthikeyan Saravanan,
Pablo Peso Parada,
David Tuckey,
Mete Ozay,
Gil Ho Lee,
Jungin Lee,
Seokyeong Jung
Abstract:
Automatic Speech Recognition (ASR) is widely used within consumer devices such as mobile phones. Recently, personalization or on-device model fine-tuning has shown that adaptation of ASR models towards target user speech improves their performance over rare words or accented speech. Despite these gains, fine-tuning on user data (target domain) risks the personalized model to forget knowledge about…
▽ More
Automatic Speech Recognition (ASR) is widely used within consumer devices such as mobile phones. Recently, personalization or on-device model fine-tuning has shown that adaptation of ASR models towards target user speech improves their performance over rare words or accented speech. Despite these gains, fine-tuning on user data (target domain) risks the personalized model to forget knowledge about its original training distribution (source domain) i.e. catastrophic forgetting, leading to subpar general ASR performance. A simple and efficient approach to combat catastrophic forgetting is to measure forgetting via a validation set that represents the source domain distribution. However, such validation sets are large and impractical for mobile devices. Towards this, we propose a novel method to subsample a substantially large validation set into a smaller one while maintaining the ability to estimate forgetting. We demonstrate the efficacy of such a dataset in mitigating forgetting by utilizing it to dynamically determine the number of ideal fine-tuning epochs. When measuring the deviations in per user fine-tuning epochs against a 50x larger validation set (oracle), our method achieves a lower mean-absolute-error (3.39) compared to randomly selected subsets of the same size (3.78-8.65). Unlike random baselines, our method consistently tracks the oracle's behaviour across three different forgetting thresholds.
△ Less
Submitted 7 April, 2025; v1 submitted 12 March, 2025;
originally announced March 2025.
-
GlaGAN: A Generative Unsupervised Model for High-Precision Segmentation of Retinal Main Vessels toward Early Detection of Glaucoma
Authors:
Cheng Huang,
Weizheng Xie,
Tsengdar J. Lee,
Jui-Kai Wang,
Karanjit Kooner,
Ning Zhang,
Jia Zhang
Abstract:
Structural changes in the main retinal blood vessels are critical biomarkers for glaucoma onset and progression. Identifying these vessels is essential for vascular modeling yet highly challenging. This paper introduces GlaGAN, an unsupervised generative AI model for segmenting main blood vessels in Optical Coherence Tomography Angiography (OCTA) images. The process begins with the Space Colonizat…
▽ More
Structural changes in the main retinal blood vessels are critical biomarkers for glaucoma onset and progression. Identifying these vessels is essential for vascular modeling yet highly challenging. This paper introduces GlaGAN, an unsupervised generative AI model for segmenting main blood vessels in Optical Coherence Tomography Angiography (OCTA) images. The process begins with the Space Colonization Algorithm (SCA) to rapidly generate vessel skeletons, including radius estimations. By synergistically integrating generative adversarial networks (GANs) with biostatistical modeling of vessel radii, GlaGAN efficiently reconstructs 2D and 3D representations, achieving nearly 100\% segmentation accuracy without requiring labeled data or high-performance computing resources. To address data scarcity, we also present GSS-RetVein, a high-definition mixed 2D/3D glaucoma retinal dataset featuring clear capillary structures. Designed for robustness testing, GSS-RetVein incorporates controlled noise while maintaining sharp capillary boundaries in 2D and enhancing 3D vascular reconstruction for blood flow prediction and glaucoma progression simulations. Experimental results demonstrate GSS-RetVein outperforms existing datasets in evaluating main vessel segmentation. Code and dataset are available: https://github.com/VikiXie/SatMar8.
△ Less
Submitted 7 July, 2025; v1 submitted 9 March, 2025;
originally announced March 2025.
-
Rethinking Static Line Rating for Economic and Efficient Power Operation in South Korea
Authors:
Junseon Park,
Junhyun Lee,
Hyeongon Park
Abstract:
In South Korea, power grid is currently operated based on the static line rating (SLR) method, where the transmission
line capacity is determined based on extreme weather conditions. However, with global warming, there is a concern
that the temperatures during summer may exceed the SLR criteria, posing safety risks. On the other hand, the conservative estimates used for winter conditions limit…
▽ More
In South Korea, power grid is currently operated based on the static line rating (SLR) method, where the transmission
line capacity is determined based on extreme weather conditions. However, with global warming, there is a concern
that the temperatures during summer may exceed the SLR criteria, posing safety risks. On the other hand, the conservative estimates used for winter conditions limit the utilization of renewable energy. Proposals to install new lines face
significant financial and environmental hurdles, complicating efforts to adapt to these changing conditions. Dynamic
Line Rating (DLR) offers a real-time solution but requires extensive weather monitoring and complex integration. This
paper proposes a novel method that improves on SLR by analyzing historical data to refine line rating criteria on a
monthly, seasonal, and semi-annual basis. Through simulations, we show our approach significantly enhances cost effectiveness and reliability of the power system, achieving efficiencies close to DLR with existing infrastructure. This
method offers a practical alternative to overcome the limitations of SLR and the implementation challenges of DLR.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Acoustic Anomaly Detection on UAM Propeller Defect with Acoustic dataset for Crack of drone Propeller (ADCP)
Authors:
Juho Lee,
Donghyun Yoon,
Gumoon Jeong,
Hyeoncheol Kim
Abstract:
The imminent commercialization of UAM requires stable, AI-based maintenance systems to ensure safety for both passengers and pedestrians. This paper presents a methodology for non-destructively detecting cracks in UAM propellers using drone propeller sound datasets. Normal operating sounds were recorded, and abnormal sounds (categorized as ripped and broken) were differentiated by varying the micr…
▽ More
The imminent commercialization of UAM requires stable, AI-based maintenance systems to ensure safety for both passengers and pedestrians. This paper presents a methodology for non-destructively detecting cracks in UAM propellers using drone propeller sound datasets. Normal operating sounds were recorded, and abnormal sounds (categorized as ripped and broken) were differentiated by varying the microphone-propeller angle and throttle power. Our novel approach integrates FFT and STFT preprocessing techniques to capture both global frequency patterns and local time-frequency variations, thereby enhancing anomaly detection performance. The constructed Acoustic Dataset for Crack of Drone Propeller (ADCP) demonstrates the potential for detecting propeller cracks and lays the groundwork for future UAM maintenance applications.
△ Less
Submitted 2 March, 2025;
originally announced March 2025.
-
The GigaMIDI Dataset with Features for Expressive Music Performance Detection
Authors:
Keon Ju Maverick Lee,
Jeff Ens,
Sara Adkins,
Pedro Sarmento,
Mathieu Barthet,
Philippe Pasquier
Abstract:
The Musical Instrument Digital Interface (MIDI), introduced in 1983, revolutionized music production by allowing computers and instruments to communicate efficiently. MIDI files encode musical instructions compactly, facilitating convenient music sharing. They benefit Music Information Retrieval (MIR), aiding in research on music understanding, computational musicology, and generative music. The G…
▽ More
The Musical Instrument Digital Interface (MIDI), introduced in 1983, revolutionized music production by allowing computers and instruments to communicate efficiently. MIDI files encode musical instructions compactly, facilitating convenient music sharing. They benefit Music Information Retrieval (MIR), aiding in research on music understanding, computational musicology, and generative music. The GigaMIDI dataset contains over 1.4 million unique MIDI files, encompassing 1.8 billion MIDI note events and over 5.3 million MIDI tracks. GigaMIDI is currently the largest collection of symbolic music in MIDI format available for research purposes under fair dealing. Distinguishing between non-expressive and expressive MIDI tracks is challenging, as MIDI files do not inherently make this distinction. To address this issue, we introduce a set of innovative heuristics for detecting expressive music performance. These include the Distinctive Note Velocity Ratio (DNVR) heuristic, which analyzes MIDI note velocity; the Distinctive Note Onset Deviation Ratio (DNODR) heuristic, which examines deviations in note onset times; and the Note Onset Median Metric Level (NOMML) heuristic, which evaluates onset positions relative to metric levels. Our evaluation demonstrates these heuristics effectively differentiate between non-expressive and expressive MIDI tracks. Furthermore, after evaluation, we create the most substantial expressive MIDI dataset, employing our heuristic, NOMML. This curated iteration of GigaMIDI encompasses expressively-performed instrument tracks detected by NOMML, containing all General MIDI instruments, constituting 31% of the GigaMIDI dataset, totalling 1,655,649 tracks.
△ Less
Submitted 24 February, 2025;
originally announced February 2025.
-
MC2SleepNet: Multi-modal Cross-masking with Contrastive Learning for Sleep Stage Classification
Authors:
Younghoon Na,
Hyun Keun Ahn,
Hyun-Kyung Lee,
Yoongeol Lee,
Seung Hun Oh,
Hongkwon Kim,
Jeong-Gun Lee
Abstract:
Sleep profoundly affects our health, and sleep deficiency or disorders can cause physical and mental problems. Despite significant findings from previous studies, challenges persist in optimizing deep learning models, especially in multi-modal learning for high-accuracy sleep stage classification. Our research introduces MC2SleepNet (Multi-modal Cross-masking with Contrastive learning for Sleep st…
▽ More
Sleep profoundly affects our health, and sleep deficiency or disorders can cause physical and mental problems. Despite significant findings from previous studies, challenges persist in optimizing deep learning models, especially in multi-modal learning for high-accuracy sleep stage classification. Our research introduces MC2SleepNet (Multi-modal Cross-masking with Contrastive learning for Sleep stage classification Network). It aims to facilitate the effective collaboration between Convolutional Neural Networks (CNNs) and Transformer architectures for multi-modal training with the help of contrastive learning and cross-masking. Raw single channel EEG signals and corresponding spectrogram data provide differently characterized modalities for multi-modal learning. Our MC2SleepNet has achieved state-of-the-art performance with an accuracy of both 84.6% on the SleepEDF-78 and 88.6% accuracy on the Sleep Heart Health Study (SHHS). These results demonstrate the effective generalization of our proposed network across both small and large datasets.
△ Less
Submitted 26 February, 2025; v1 submitted 13 February, 2025;
originally announced February 2025.
-
KAD: No More FAD! An Effective and Efficient Evaluation Metric for Audio Generation
Authors:
Yoonjin Chung,
Pilsun Eu,
Junwon Lee,
Keunwoo Choi,
Juhan Nam,
Ben Sangbae Chon
Abstract:
Although being widely adopted for evaluating generated audio signals, the Fréchet Audio Distance (FAD) suffers from significant limitations, including reliance on Gaussian assumptions, sensitivity to sample size, and high computational complexity. As an alternative, we introduce the Kernel Audio Distance (KAD), a novel, distribution-free, unbiased, and computationally efficient metric based on Max…
▽ More
Although being widely adopted for evaluating generated audio signals, the Fréchet Audio Distance (FAD) suffers from significant limitations, including reliance on Gaussian assumptions, sensitivity to sample size, and high computational complexity. As an alternative, we introduce the Kernel Audio Distance (KAD), a novel, distribution-free, unbiased, and computationally efficient metric based on Maximum Mean Discrepancy (MMD). Through analysis and empirical validation, we demonstrate KAD's advantages: (1) faster convergence with smaller sample sizes, enabling reliable evaluation with limited data; (2) lower computational cost, with scalable GPU acceleration; and (3) stronger alignment with human perceptual judgments. By leveraging advanced embeddings and characteristic kernels, KAD captures nuanced differences between real and generated audio. Open-sourced in the kadtk toolkit, KAD provides an efficient, reliable, and perceptually aligned benchmark for evaluating generative audio models.
△ Less
Submitted 9 March, 2025; v1 submitted 21 February, 2025;
originally announced February 2025.
-
Blind Massive MIMO for Dense IoT Networks
Authors:
Jeongjae Lee,
Songnam Hong
Abstract:
In this paper, we investigate the downlink communication challenges in heavy-load Internet-of-Things (IoT) networks supported by frequency-division-duplexing (FDD) millimeter-wave (mmWave) massive multiple-input multiple-output (MIMO) systems. The excessive overhead required for obtaining channel state information at the transmitter (CSIT) is essential to achieve high spectral efficiency through c…
▽ More
In this paper, we investigate the downlink communication challenges in heavy-load Internet-of-Things (IoT) networks supported by frequency-division-duplexing (FDD) millimeter-wave (mmWave) massive multiple-input multiple-output (MIMO) systems. The excessive overhead required for obtaining channel state information at the transmitter (CSIT) is essential to achieve high spectral efficiency through conventional massive MIMO techniques; however, it hinders the deployment of ultra-reliable low-latency communications (URLLC) and leads to significant energy expenditure, particularly in dense IoT networks. To address this challenge, we propose an innovative CSIT-Free MIMO precoding method, referred to as CIRculant information Classification via Linear Estimation (CIRCLE). Our major contribution is the design of a CSIT-independent (or deterministic) precoding, which is constructed by leveraging the circulant permutation of the discrete Fourier transform (DFT) matrix. This design enables interference-free signal combining at the IoT devices. Through theoretical analysis and simulations, we verify the effectiveness of the proposed CIRCLE method.
△ Less
Submitted 15 February, 2025;
originally announced February 2025.