-
InfiniteAudio: Infinite-Length Audio Generation with Consistency
Authors:
Chaeyoung Jung,
Hojoon Ki,
Ji-Hoon Kim,
Junmo Kim,
Joon Son Chung
Abstract:
This paper presents InfiniteAudio, a simple yet effective strategy for generating infinite-length audio using diffusion-based text-to-audio methods. Current approaches face memory constraints because the output size increases with input length, making long duration generation challenging. A common workaround is to concatenate short audio segments, but this often leads to inconsistencies due to the…
▽ More
This paper presents InfiniteAudio, a simple yet effective strategy for generating infinite-length audio using diffusion-based text-to-audio methods. Current approaches face memory constraints because the output size increases with input length, making long duration generation challenging. A common workaround is to concatenate short audio segments, but this often leads to inconsistencies due to the lack of shared temporal context. To address this, InfiniteAudio integrates seamlessly into existing pipelines without additional training. It introduces two key techniques: FIFO sampling, a first-in, first-out inference strategy with fixed-size inputs, and curved denoising, which selectively prioritizes key diffusion steps for efficiency. Experiments show that InfiniteAudio achieves comparable or superior performance across all metrics. Audio samples are available on our project page.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
Attention-Aided MMSE for OFDM Channel Estimation: Learning Linear Filters with Attention
Authors:
TaeJun Ha,
Chaehyun Jung,
Hyeonuk Kim,
Jeongwoo Park,
Jeonghun Park
Abstract:
In orthogonal frequency division multiplexing (OFDM), accurate channel estimation is crucial. Classical signal processing based approaches, such as minimum mean-squared error (MMSE) estimation, often require second-order statistics that are difficult to obtain in practice. Recent deep neural networks based methods have been introduced to address this; yet they often suffer from high complexity. Th…
▽ More
In orthogonal frequency division multiplexing (OFDM), accurate channel estimation is crucial. Classical signal processing based approaches, such as minimum mean-squared error (MMSE) estimation, often require second-order statistics that are difficult to obtain in practice. Recent deep neural networks based methods have been introduced to address this; yet they often suffer from high complexity. This paper proposes an Attention-aided MMSE (A-MMSE), a novel model-based DNN framework that learns the optimal MMSE filter via the Attention Transformer. Once trained, the A-MMSE estimates the channel through a single linear operation for channel estimation, eliminating nonlinear activations during inference and thus reducing computational complexity. To enhance the learning efficiency of the A-MMSE, we develop a two-stage Attention encoder, designed to effectively capture the channel correlation structure. Additionally, a rank-adaptive extension of the proposed A-MMSE allows flexible trade-offs between complexity and channel estimation accuracy. Extensive simulations with 3GPP TDL channel models demonstrate that the proposed A-MMSE consistently outperforms other baseline methods in terms of normalized MSE across a wide range of SNR conditions. In particular, the A-MMSE and its rank-adaptive extension establish a new frontier in the performance complexity trade-off, redefining the standard for practical channel estimation methods.
△ Less
Submitted 31 May, 2025;
originally announced June 2025.
-
SEED: Speaker Embedding Enhancement Diffusion Model
Authors:
KiHyun Nam,
Jungwoo Heo,
Jee-weon Jung,
Gangin Park,
Chaeyoung Jung,
Ha-Jin Yu,
Joon Son Chung
Abstract:
A primary challenge when deploying speaker recognition systems in real-world applications is performance degradation caused by environmental mismatch. We propose a diffusion-based method that takes speaker embeddings extracted from a pre-trained speaker recognition model and generates refined embeddings. For training, our approach progressively adds Gaussian noise to both clean and noisy speaker e…
▽ More
A primary challenge when deploying speaker recognition systems in real-world applications is performance degradation caused by environmental mismatch. We propose a diffusion-based method that takes speaker embeddings extracted from a pre-trained speaker recognition model and generates refined embeddings. For training, our approach progressively adds Gaussian noise to both clean and noisy speaker embeddings extracted from clean and noisy speech, respectively, via forward process of a diffusion model, and then reconstructs them to clean embeddings in the reverse process. While inferencing, all embeddings are regenerated via diffusion process. Our method needs neither speaker label nor any modification to the existing speaker recognition pipeline. Experiments on evaluation sets simulating environment mismatch scenarios show that our method can improve recognition accuracy by up to 19.6% over baseline models while retaining performance on conventional scenarios. We publish our code here https://github.com/kaistmm/seed-pytorch
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
An Addendum to NeBula: Towards Extending TEAM CoSTAR's Solution to Larger Scale Environments
Authors:
Ali Agha,
Kyohei Otsu,
Benjamin Morrell,
David D. Fan,
Sung-Kyun Kim,
Muhammad Fadhil Ginting,
Xianmei Lei,
Jeffrey Edlund,
Seyed Fakoorian,
Amanda Bouman,
Fernando Chavez,
Taeyeon Kim,
Gustavo J. Correa,
Maira Saboia,
Angel Santamaria-Navarro,
Brett Lopez,
Boseong Kim,
Chanyoung Jung,
Mamoru Sobue,
Oriana Claudia Peltzer,
Joshua Ott,
Robert Trybula,
Thomas Touma,
Marcel Kaufmann,
Tiago Stegun Vaquero
, et al. (64 additional authors not shown)
Abstract:
This paper presents an appendix to the original NeBula autonomy solution developed by the TEAM CoSTAR (Collaborative SubTerranean Autonomous Robots), participating in the DARPA Subterranean Challenge. Specifically, this paper presents extensions to NeBula's hardware, software, and algorithmic components that focus on increasing the range and scale of the exploration environment. From the algorithm…
▽ More
This paper presents an appendix to the original NeBula autonomy solution developed by the TEAM CoSTAR (Collaborative SubTerranean Autonomous Robots), participating in the DARPA Subterranean Challenge. Specifically, this paper presents extensions to NeBula's hardware, software, and algorithmic components that focus on increasing the range and scale of the exploration environment. From the algorithmic perspective, we discuss the following extensions to the original NeBula framework: (i) large-scale geometric and semantic environment mapping; (ii) an adaptive positioning system; (iii) probabilistic traversability analysis and local planning; (iv) large-scale POMDP-based global motion planning and exploration behavior; (v) large-scale networking and decentralized reasoning; (vi) communication-aware mission planning; and (vii) multi-modal ground-aerial exploration solutions. We demonstrate the application and deployment of the presented systems and solutions in various large-scale underground environments, including limestone mine exploration scenarios as well as deployment in the DARPA Subterranean challenge.
△ Less
Submitted 18 April, 2025;
originally announced April 2025.
-
From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech
Authors:
Ji-Hoon Kim,
Jeongsoo Choi,
Jaehun Kim,
Chaeyoung Jung,
Joon Son Chung
Abstract:
The objective of this study is to generate high-quality speech from silent talking face videos, a task also known as video-to-speech synthesis. A significant challenge in video-to-speech synthesis lies in the substantial modality gap between silent video and multi-faceted speech. In this paper, we propose a novel video-to-speech system that effectively bridges this modality gap, significantly enha…
▽ More
The objective of this study is to generate high-quality speech from silent talking face videos, a task also known as video-to-speech synthesis. A significant challenge in video-to-speech synthesis lies in the substantial modality gap between silent video and multi-faceted speech. In this paper, we propose a novel video-to-speech system that effectively bridges this modality gap, significantly enhancing the quality of synthesized speech. This is achieved by learning of hierarchical representations from video to speech. Specifically, we gradually transform silent video into acoustic feature spaces through three sequential stages -- content, timbre, and prosody modeling. In each stage, we align visual factors -- lip movements, face identity, and facial expressions -- with corresponding acoustic counterparts to ensure the seamless transformation. Additionally, to generate realistic and coherent speech from the visual representations, we employ a flow matching model that estimates direct trajectories from a simple prior distribution to the target speech distribution. Extensive experiments demonstrate that our method achieves exceptional generation quality comparable to real utterances, outperforming existing methods by a significant margin.
△ Less
Submitted 21 March, 2025;
originally announced March 2025.
-
Multidimensional Swarm Flight Approach For Chasing Unauthorized UAVs Leveraging Asynchronous Deep Learning
Authors:
Tae-Won Ban,
Kyu-Min Kang,
Bang Chul Jung
Abstract:
This paper introduces a novel unmanned aerial vehicles (UAV) chasing system designed to track and chase unauthorized UAVs, significantly enhancing their neutralization effectiveness.
This paper introduces a novel unmanned aerial vehicles (UAV) chasing system designed to track and chase unauthorized UAVs, significantly enhancing their neutralization effectiveness.
△ Less
Submitted 4 February, 2025;
originally announced February 2025.
-
Cracks in concrete
Authors:
Tin Barisin,
Christian Jung,
Anna Nowacka,
Claudia Redenbach,
Katja Schladitz
Abstract:
Finding and properly segmenting cracks in images of concrete is a challenging task. Cracks are thin and rough and being air filled do yield a very weak contrast in 3D images obtained by computed tomography. Enhancing and segmenting dark lower-dimensional structures is already demanding. The heterogeneous concrete matrix and the size of the images further increase the complexity. ML methods have pr…
▽ More
Finding and properly segmenting cracks in images of concrete is a challenging task. Cracks are thin and rough and being air filled do yield a very weak contrast in 3D images obtained by computed tomography. Enhancing and segmenting dark lower-dimensional structures is already demanding. The heterogeneous concrete matrix and the size of the images further increase the complexity. ML methods have proven to solve difficult segmentation problems when trained on enough and well annotated data. However, so far, there is not much 3D image data of cracks available at all, let alone annotated. Interactive annotation is error-prone as humans can easily tell cats from dogs or roads without from roads with cars but have a hard time deciding whether a thin and dark structure seen in a 2D slice continues in the next one. Training networks by synthetic, simulated images is an elegant way out, bears however its own challenges. In this contribution, we describe how to generate semi-synthetic image data to train CNN like the well known 3D U-Net or random forests for segmenting cracks in 3D images of concrete. The thickness of real cracks varies widely, both, within one crack as well as from crack to crack in the same sample. The segmentation method should therefore be invariant with respect to scale changes. We introduce the so-called RieszNet, designed for exactly this purpose. Finally, we discuss how to generalize the ML crack segmentation methods to other concrete types.
△ Less
Submitted 30 January, 2025;
originally announced January 2025.
-
VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis
Authors:
Jaemin Jung,
Junseok Ahn,
Chaeyoung Jung,
Tan Dat Nguyen,
Youngjoon Jang,
Joon Son Chung
Abstract:
We present VoiceDiT, a multi-modal generative model for producing environment-aware speech and audio from text and visual prompts. While aligning speech with text is crucial for intelligible speech, achieving this alignment in noisy conditions remains a significant and underexplored challenge in the field. To address this, we present a novel audio generation pipeline named VoiceDiT. This pipeline…
▽ More
We present VoiceDiT, a multi-modal generative model for producing environment-aware speech and audio from text and visual prompts. While aligning speech with text is crucial for intelligible speech, achieving this alignment in noisy conditions remains a significant and underexplored challenge in the field. To address this, we present a novel audio generation pipeline named VoiceDiT. This pipeline includes three key components: (1) the creation of a large-scale synthetic speech dataset for pre-training and a refined real-world speech dataset for fine-tuning, (2) the Dual-DiT, a model designed to efficiently preserve aligned speech information while accurately reflecting environmental conditions, and (3) a diffusion-based Image-to-Audio Translator that allows the model to bridge the gap between audio and image, facilitating the generation of environmental sound that aligns with the multi-modal prompts. Extensive experimental results demonstrate that VoiceDiT outperforms previous models on real-world datasets, showcasing significant improvements in both audio quality and modality integration.
△ Less
Submitted 26 December, 2024;
originally announced December 2024.
-
Improving Factuality of 3D Brain MRI Report Generation with Paired Image-domain Retrieval and Text-domain Augmentation
Authors:
Junhyeok Lee,
Yujin Oh,
Dahyoun Lee,
Hyon Keun Joh,
Chul-Ho Sohn,
Sung Hyun Baik,
Cheol Kyu Jung,
Jung Hyun Park,
Kyu Sung Choi,
Byung-Hoon Kim,
Jong Chul Ye
Abstract:
Acute ischemic stroke (AIS) requires time-critical management, with hours of delayed intervention leading to an irreversible disability of the patient. Since diffusion weighted imaging (DWI) using the magnetic resonance image (MRI) plays a crucial role in the detection of AIS, automated prediction of AIS from DWI has been a research topic of clinical importance. While text radiology reports contai…
▽ More
Acute ischemic stroke (AIS) requires time-critical management, with hours of delayed intervention leading to an irreversible disability of the patient. Since diffusion weighted imaging (DWI) using the magnetic resonance image (MRI) plays a crucial role in the detection of AIS, automated prediction of AIS from DWI has been a research topic of clinical importance. While text radiology reports contain the most relevant clinical information from the image findings, the difficulty of mapping across different modalities has limited the factuality of conventional direct DWI-to-report generation methods. Here, we propose paired image-domain retrieval and text-domain augmentation (PIRTA), a cross-modal retrieval-augmented generation (RAG) framework for providing clinician-interpretative AIS radiology reports with improved factuality. PIRTA mitigates the need for learning cross-modal mapping, which poses difficulty in image-to-text generation, by casting the cross-modal mapping problem as an in-domain retrieval of similar DWI images that have paired ground-truth text radiology reports. By exploiting the retrieved radiology reports to augment the report generation process of the query image, we show by experiments with extensive in-house and public datasets that PIRTA can accurately retrieve relevant reports from 3D DWI images. This approach enables the generation of radiology reports with significantly higher accuracy compared to direct image-to-text generation using state-of-the-art multimodal language models.
△ Less
Submitted 23 November, 2024;
originally announced November 2024.
-
Panopticus: Omnidirectional 3D Object Detection on Resource-constrained Edge Devices
Authors:
Jeho Lee,
Chanyoung Jung,
Jiwon Kim,
Hojung Cha
Abstract:
3D object detection with omnidirectional views enables safety-critical applications such as mobile robot navigation. Such applications increasingly operate on resource-constrained edge devices, facilitating reliable processing without privacy concerns or network delays. To enable cost-effective deployment, cameras have been widely adopted as a low-cost alternative to LiDAR sensors. However, the co…
▽ More
3D object detection with omnidirectional views enables safety-critical applications such as mobile robot navigation. Such applications increasingly operate on resource-constrained edge devices, facilitating reliable processing without privacy concerns or network delays. To enable cost-effective deployment, cameras have been widely adopted as a low-cost alternative to LiDAR sensors. However, the compute-intensive workload to achieve high performance of camera-based solutions remains challenging due to the computational limitations of edge devices. In this paper, we present Panopticus, a carefully designed system for omnidirectional and camera-based 3D detection on edge devices. Panopticus employs an adaptive multi-branch detection scheme that accounts for spatial complexities. To optimize the accuracy within latency limits, Panopticus dynamically adjusts the model's architecture and operations based on available edge resources and spatial characteristics. We implemented Panopticus on three edge devices and conducted experiments across real-world environments based on the public self-driving dataset and our mobile 360° camera dataset. Experiment results showed that Panopticus improves accuracy by 62% on average given the strict latency objective of 33ms. Also, Panopticus achieves a 2.1{\times} latency reduction on average compared to baselines.
△ Less
Submitted 2 October, 2024;
originally announced October 2024.
-
Personalised Outfit Recommendation via History-aware Transformers
Authors:
Myong Chol Jung,
Julien Monteil,
Philip Schulz,
Volodymyr Vaskovych
Abstract:
We present the history-aware transformer (HAT), a transformer-based model that uses shoppers' purchase history to personalise outfit predictions. The aim of this work is to recommend outfits that are internally coherent while matching an individual shopper's style and taste. To achieve this, we stack two transformer models, one that produces outfit representations and another one that processes th…
▽ More
We present the history-aware transformer (HAT), a transformer-based model that uses shoppers' purchase history to personalise outfit predictions. The aim of this work is to recommend outfits that are internally coherent while matching an individual shopper's style and taste. To achieve this, we stack two transformer models, one that produces outfit representations and another one that processes the history of purchased outfits for a given shopper. We use these models to score an outfit's compatibility in the context of a shopper's preferences as inferred from their previous purchases. During training, the model learns to discriminate between purchased and random outfits using 3 losses: the focal loss for outfit compatibility typically used in the literature, a contrastive loss to bring closer learned outfit embeddings from a shopper's history, and an adaptive margin loss to facilitate learning from weak negatives. Together, these losses enable the model to make personalised recommendations based on a shopper's purchase history.
Our experiments on the IQON3000 and Polyvore datasets show that HAT outperforms strong baselines on the outfit Compatibility Prediction (CP) and the Fill In The Blank (FITB) tasks. The model improves AUC for the CP hard task by 15.7% (IQON3000) and 19.4% (Polyvore) compared to previous SOTA results. It further improves accuracy on the FITB hard task by 6.5% and 9.7%, respectively. We provide ablation studies on the personalisation, constrastive loss, and adaptive margin loss that highlight the importance of these modelling choices.
△ Less
Submitted 26 September, 2024; v1 submitted 28 June, 2024;
originally announced July 2024.
-
FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching
Authors:
Chaeyoung Jung,
Suyeon Lee,
Ji-Hoon Kim,
Joon Son Chung
Abstract:
This work proposes an efficient method to enhance the quality of corrupted speech signals by leveraging both acoustic and visual cues. While existing diffusion-based approaches have demonstrated remarkable quality, their applicability is limited by slow inference speeds and computational complexity. To address this issue, we present FlowAVSE which enhances the inference speed and reduces the numbe…
▽ More
This work proposes an efficient method to enhance the quality of corrupted speech signals by leveraging both acoustic and visual cues. While existing diffusion-based approaches have demonstrated remarkable quality, their applicability is limited by slow inference speeds and computational complexity. To address this issue, we present FlowAVSE which enhances the inference speed and reduces the number of learnable parameters without degrading the output quality. In particular, we employ a conditional flow matching algorithm that enables the generation of high-quality speech in a single sampling step. Moreover, we increase efficiency by optimizing the underlying U-net architecture of diffusion-based systems. Our experiments demonstrate that FlowAVSE achieves 22 times faster inference speed and reduces the model size by half while maintaining the output quality. The demo page is available at: https://cyongong.github.io/FlowAVSE.github.io/
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Enhancing Battlefield Awareness: An Aerial RIS-assisted ISAC System with Deep Reinforcement Learning
Authors:
Hyunsang Cho,
Seonghoon Yoo,
Bang Chul Jung,
Joonhyuk Kang
Abstract:
This paper considers a joint communication and sensing technique for enhancing situational awareness in practical battlefield scenarios. In particular, we propose an aerial reconfigurable intelligent surface (ARIS)-assisted integrated sensing and communication (ISAC) system consisting of a single access point (AP), an ARIS, multiple users, and a sensing target. With deep reinforcement learning (DR…
▽ More
This paper considers a joint communication and sensing technique for enhancing situational awareness in practical battlefield scenarios. In particular, we propose an aerial reconfigurable intelligent surface (ARIS)-assisted integrated sensing and communication (ISAC) system consisting of a single access point (AP), an ARIS, multiple users, and a sensing target. With deep reinforcement learning (DRL), we jointly optimize the transmit beamforming of the AP, the RIS phase shifts, and the trajectory of the ARIS under signal-to-interference-noise ratio (SINR) constraints. Numerical results demonstrate that the proposed technique outperforms the conventional benchmark schemes by suppressing the self-interference and clutter echo signals or optimizing the RIS phase shifts.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Model Predictive Guidance for Fuel-Optimal Landing of Reusable Launch Vehicles
Authors:
Ki-Wook Jung,
Sang-Don Lee,
Cheol-Goo Jung,
Chang-Hun Lee
Abstract:
This paper introduces a landing guidance strategy for reusable launch vehicles (RLVs) using a model predictive approach based on sequential convex programming (SCP). The proposed approach devises two distinct optimal control problems (OCPs): planning a fuel-optimal landing trajectory that accommodates practical path constraints specific to RLVs, and determining real-time optimal tracking commands.…
▽ More
This paper introduces a landing guidance strategy for reusable launch vehicles (RLVs) using a model predictive approach based on sequential convex programming (SCP). The proposed approach devises two distinct optimal control problems (OCPs): planning a fuel-optimal landing trajectory that accommodates practical path constraints specific to RLVs, and determining real-time optimal tracking commands. This dual optimization strategy allows for reduced computational load through adjustable prediction horizon lengths in the tracking task, achieving near closed-loop performance. Enhancements in model fidelity for the tracking task are achieved through an alternative rotational dynamics representation, enabling a more stable numerical solution of the OCP and accounting for vehicle transient dynamics. Furthermore, modifications of aerodynamic force in both planning and tracking phases are proposed, tailored for thrust-vector-controlled RLVs, to reduce the fidelity gap without adding computational complexity. Extensive 6-DOF simulation experiments validate the effectiveness and improved guidance performance of the proposed algorithm.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
WaveDH: Wavelet Sub-bands Guided ConvNet for Efficient Image Dehazing
Authors:
Seongmin Hwang,
Daeyoung Han,
Cheolkon Jung,
Moongu Jeon
Abstract:
The surge in interest regarding image dehazing has led to notable advancements in deep learning-based single image dehazing approaches, exhibiting impressive performance in recent studies. Despite these strides, many existing methods fall short in meeting the efficiency demands of practical applications. In this paper, we introduce WaveDH, a novel and compact ConvNet designed to address this effic…
▽ More
The surge in interest regarding image dehazing has led to notable advancements in deep learning-based single image dehazing approaches, exhibiting impressive performance in recent studies. Despite these strides, many existing methods fall short in meeting the efficiency demands of practical applications. In this paper, we introduce WaveDH, a novel and compact ConvNet designed to address this efficiency gap in image dehazing. Our WaveDH leverages wavelet sub-bands for guided up-and-downsampling and frequency-aware feature refinement. The key idea lies in utilizing wavelet decomposition to extract low-and-high frequency components from feature levels, allowing for faster processing while upholding high-quality reconstruction. The downsampling block employs a novel squeeze-and-attention scheme to optimize the feature downsampling process in a structurally compact manner through wavelet domain learning, preserving discriminative features while discarding noise components. In our upsampling block, we introduce a dual-upsample and fusion mechanism to enhance high-frequency component awareness, aiding in the reconstruction of high-frequency details. Departing from conventional dehazing methods that treat low-and-high frequency components equally, our feature refinement block strategically processes features with a frequency-aware approach. By employing a coarse-to-fine methodology, it not only refines the details at frequency levels but also significantly optimizes computational costs. The refinement is performed in a maximum 8x downsampled feature space, striking a favorable efficiency-vs-accuracy trade-off. Extensive experiments demonstrate that our method, WaveDH, outperforms many state-of-the-art methods on several image dehazing benchmarks with significantly reduced computational costs. Our code is available at https://github.com/AwesomeHwang/WaveDH.
△ Less
Submitted 17 January, 2025; v1 submitted 1 April, 2024;
originally announced April 2024.
-
Real-Time Systems Optimization with Black-box Constraints and Hybrid Variables
Authors:
Sen Wang,
Dong Li,
Shao-Yu Huang,
Xuanliang Deng,
Ashrarul H. Sifat,
Changhee Jung,
Ryan Williams,
Haibo Zeng
Abstract:
When optimizing real-time systems, designers often face a challenging problem where the schedulability constraints are non-convex, non-continuous, or lack an analytical form to understand their properties. Although the optimization framework NORTH proposed in previous work is general (it works with arbitrary schedulability analysis) and scalable, it can only handle problems with continuous variabl…
▽ More
When optimizing real-time systems, designers often face a challenging problem where the schedulability constraints are non-convex, non-continuous, or lack an analytical form to understand their properties. Although the optimization framework NORTH proposed in previous work is general (it works with arbitrary schedulability analysis) and scalable, it can only handle problems with continuous variables, which limits its application. In this paper, we extend the applications of the framework NORTH to problems with a hybrid of continuous and discrete variables. This is achieved in a coordinate-descent method, where the continuous and discrete variables are optimized separately during iterations. The new framework, NORTH+, improves around 20% solution quality than NORTH in experiments.
△ Less
Submitted 21 January, 2024;
originally announced January 2024.
-
Joint Optimization of Continuous Variables and Priority Assignments for Real-Time Systems with Black-box Schedulability Constraints
Authors:
Sen Wang,
Dong Li,
Shao-Yu Huang,
Xuanliang Deng,
Ashrarul H. Sifat,
Changhee Jung,
Ryan Williams,
Haibo Zeng
Abstract:
In real-time systems optimization, designers often face a challenging problem posed by the non-convex and non-continuous schedulability conditions, which may even lack an analytical form to understand their properties. To tackle this challenging problem, we treat the schedulability analysis as a black box that only returns true/false results. We propose a general and scalable framework to optimize…
▽ More
In real-time systems optimization, designers often face a challenging problem posed by the non-convex and non-continuous schedulability conditions, which may even lack an analytical form to understand their properties. To tackle this challenging problem, we treat the schedulability analysis as a black box that only returns true/false results. We propose a general and scalable framework to optimize real-time systems, named Numerical Optimizer with Real-Time Highlight (NORTH). NORTH is built upon the gradient-based active-set methods from the numerical optimization literature but with new methods to manage active constraints for the non-differentiable schedulability constraints. In addition, we also generalize NORTH to NORTH+, to collaboratively optimize certain types of discrete variables (e.g., priority assignments, categorical variables) with continuous variables based on numerical optimization algorithms. We demonstrate the algorithm performance with two example applications: energy minimization based on dynamic voltage and frequency scaling (DVFS), and optimization of control system performance. In these experiments, NORTH achieved $10^2$ to $10^5$ times speed improvements over state-of-the-art methods while maintaining similar or better solution quality. NORTH+ outperforms NORTH by 30% with similar algorithm scalability. Both NORTH and NORTH+ support black-box schedulability analysis, ensuring broad applicability.
△ Less
Submitted 18 March, 2025; v1 submitted 6 January, 2024;
originally announced January 2024.
-
Optimizing Logical Execution Time Model for Both Determinism and Low Latency
Authors:
Sen Wang,
Dong Li,
Ashrarul H. Sifat,
Shao-Yu Huang,
Xuanliang Deng,
Changhee Jung,
Ryan Williams,
Haibo Zeng
Abstract:
The Logical Execution Time (LET) programming model has recently received considerable attention, particularly because of its timing and dataflow determinism. In LET, task computation appears always to take the same amount of time (called the task's LET interval), and the task reads (resp. writes) at the beginning (resp. end) of the interval. Compared to other communication mechanisms, such as impl…
▽ More
The Logical Execution Time (LET) programming model has recently received considerable attention, particularly because of its timing and dataflow determinism. In LET, task computation appears always to take the same amount of time (called the task's LET interval), and the task reads (resp. writes) at the beginning (resp. end) of the interval. Compared to other communication mechanisms, such as implicit communication and Dynamic Buffer Protocol (DBP), LET performs worse on many metrics, such as end-to-end latency (including reaction time and data age) and time disparity jitter. Compared with the default LET setting, the flexible LET (fLET) model shrinks the LET interval while still guaranteeing schedulability by introducing the virtual offset to defer the read operation and using the virtual deadline to move up the write operation. Therefore, fLET has the potential to significantly improve the end-to-end timing performance while keeping the benefits of deterministic behavior on timing and dataflow.
To fully realize the potential of fLET, we consider the problem of optimizing the assignments of its virtual offsets and deadlines. We propose new abstractions to describe the task communication pattern and new optimization algorithms to explore the solution space efficiently. The algorithms leverage the linearizability of communication patterns and utilize symbolic operations to achieve efficient optimization while providing a theoretical guarantee. The framework supports optimizing multiple performance metrics and guarantees bounded suboptimality when optimizing end-to-end latency. Experimental results show that our optimization algorithms improve upon the default LET and its existing extensions and significantly outperform implicit communication and DBP in terms of various metrics, such as end-to-end latency, time disparity, and its jitter.
△ Less
Submitted 7 March, 2024; v1 submitted 30 October, 2023;
originally announced October 2023.
-
Seeing Through the Conversation: Audio-Visual Speech Separation based on Diffusion Model
Authors:
Suyeon Lee,
Chaeyoung Jung,
Youngjoon Jang,
Jaehun Kim,
Joon Son Chung
Abstract:
The objective of this work is to extract target speaker's voice from a mixture of voices using visual cues. Existing works on audio-visual speech separation have demonstrated their performance with promising intelligibility, but maintaining naturalness remains a challenge. To address this issue, we propose AVDiffuSS, an audio-visual speech separation model based on a diffusion mechanism known for…
▽ More
The objective of this work is to extract target speaker's voice from a mixture of voices using visual cues. Existing works on audio-visual speech separation have demonstrated their performance with promising intelligibility, but maintaining naturalness remains a challenge. To address this issue, we propose AVDiffuSS, an audio-visual speech separation model based on a diffusion mechanism known for its capability in generating natural samples. For an effective fusion of the two modalities for diffusion, we also propose a cross-attention-based feature fusion mechanism. This mechanism is specifically tailored for the speech domain to integrate the phonetic information from audio-visual correspondence in speech generation. In this way, the fusion process maintains the high temporal resolution of the features, without excessive computational requirements. We demonstrate that the proposed framework achieves state-of-the-art results on two benchmarks, including VoxCeleb2 and LRS3, producing speech with notably better naturalness.
△ Less
Submitted 30 October, 2023;
originally announced October 2023.
-
TalkNCE: Improving Active Speaker Detection with Talk-Aware Contrastive Learning
Authors:
Chaeyoung Jung,
Suyeon Lee,
Kihyun Nam,
Kyeongha Rho,
You Jin Kim,
Youngjoon Jang,
Joon Son Chung
Abstract:
The goal of this work is Active Speaker Detection (ASD), a task to determine whether a person is speaking or not in a series of video frames. Previous works have dealt with the task by exploring network architectures while learning effective representations has been less explored. In this work, we propose TalkNCE, a novel talk-aware contrastive loss. The loss is only applied to part of the full se…
▽ More
The goal of this work is Active Speaker Detection (ASD), a task to determine whether a person is speaking or not in a series of video frames. Previous works have dealt with the task by exploring network architectures while learning effective representations has been less explored. In this work, we propose TalkNCE, a novel talk-aware contrastive loss. The loss is only applied to part of the full segments where a person on the screen is actually speaking. This encourages the model to learn effective representations through the natural correspondence of speech and facial movements. Our loss can be jointly optimized with the existing objectives for training ASD models without the need for additional supervision or training data. The experiments demonstrate that our loss can be easily integrated into the existing ASD frameworks, improving their performance. Our method achieves state-of-the-art performances on AVA-ActiveSpeaker and ASW datasets.
△ Less
Submitted 21 September, 2023;
originally announced September 2023.
-
An Autonomous System for Head-to-Head Race: Design, Implementation and Analysis; Team KAIST at the Indy Autonomous Challenge
Authors:
Chanyoung Jung,
Andrea Finazzi,
Hyunki Seong,
Daegyu Lee,
Seungwook Lee,
Bosung Kim,
Gyuri Gang,
Seungil Han,
David Hyunchul Shim
Abstract:
While the majority of autonomous driving research has concentrated on everyday driving scenarios, further safety and performance improvements of autonomous vehicles require a focus on extreme driving conditions. In this context, autonomous racing is a new area of research that has been attracting considerable interest recently. Due to the fact that a vehicle is driven by its perception, planning,…
▽ More
While the majority of autonomous driving research has concentrated on everyday driving scenarios, further safety and performance improvements of autonomous vehicles require a focus on extreme driving conditions. In this context, autonomous racing is a new area of research that has been attracting considerable interest recently. Due to the fact that a vehicle is driven by its perception, planning, and control limits during racing, numerous research and development issues arise. This paper provides a comprehensive overview of the autonomous racing system built by team KAIST for the Indy Autonomous Challenge (IAC). Our autonomy stack consists primarily of a multi-modal perception module, a high-speed overtaking planner, a resilient control stack, and a system status manager. We present the details of all components of our autonomy solution, including algorithms, implementation, and unit test results. In addition, this paper outlines the design principles and the results of a systematical analysis. Even though our design principles are derived from the unique application domain of autonomous racing, they can also be applied to a variety of safety-critical, high-cost-of-failure robotics applications. The proposed system was integrated into a full-scale autonomous race car (Dallara AV-21) and field-tested extensively. As a result, team KAIST was one of three teams who qualified and participated in the official IAC race events without any accidents. Our proposed autonomous system successfully completed all missions, including overtaking at speeds of around $220 km/h$ in the IAC@CES2022, the world's first autonomous 1:1 head-to-head race.
△ Less
Submitted 16 March, 2023;
originally announced March 2023.
-
A Multi-View Learning Approach to Enhance Automatic 12-Lead ECG Diagnosis Performance
Authors:
Jae-Won Choi,
Dae-Yong Hong,
Chan Jung,
Eugene Hwang,
Sung-Hyuk Park,
Seung-Young Roh
Abstract:
The performances of commonly used electrocardiogram (ECG) diagnosis models have recently improved with the introduction of deep learning (DL). However, the impact of various combinations of multiple DL components and/or the role of data augmentation techniques on the diagnosis have not been sufficiently investigated. This study proposes an ensemble-based multi-view learning approach with an ECG au…
▽ More
The performances of commonly used electrocardiogram (ECG) diagnosis models have recently improved with the introduction of deep learning (DL). However, the impact of various combinations of multiple DL components and/or the role of data augmentation techniques on the diagnosis have not been sufficiently investigated. This study proposes an ensemble-based multi-view learning approach with an ECG augmentation technique to achieve a higher performance than traditional automatic 12-lead ECG diagnosis methods. The data analysis results show that the proposed model reports an F1 score of 0.840, which outperforms existing state-ofthe-art methods in the literature.
△ Less
Submitted 30 July, 2022;
originally announced August 2022.
-
A Resilient Navigation and Path Planning System for High-speed Autonomous Race Car
Authors:
Daegyu Lee,
Chanyoung Jung,
Andrea Finazzi,
Hyunki Seong,
D. Hyunchul Shim
Abstract:
This paper describes a resilient navigation and planning system used in the Indy Autonomous Challenge (IAC) competition. The IAC is a competition where full-scale race cars run autonomously on Indianapolis Motor Speedway(IMS) up to 290 km/h (180 mph). Race cars will experience severe vibrations. Especially at high speeds. These vibrations can degrade standard localization algorithms based on preci…
▽ More
This paper describes a resilient navigation and planning system used in the Indy Autonomous Challenge (IAC) competition. The IAC is a competition where full-scale race cars run autonomously on Indianapolis Motor Speedway(IMS) up to 290 km/h (180 mph). Race cars will experience severe vibrations. Especially at high speeds. These vibrations can degrade standard localization algorithms based on precision GPS-aided inertial measurement units. Degraded localization can lead to serious problems, including collisions. Therefore, we propose a resilient navigation system that enables a race car to stay within the track in the event of localization failures. Our navigation system uses a multi-sensor fusion-based Kalman filter. We detect degradation of the navigation solution using probabilistic approaches to computing optimal measurement values for the correction step of our Kalman filter. In addition, an optimal path planning algorithm for obstacle avoidance is proposed. In this challenge, the track has static obstacles on the track. The vehicle is required to avoid them with minimal time loss. By taking the original optimal racing line, obstacles, and vehicle dynamics into account, we propose a road-graph-based path planning algorithm to ensure that our race car can perform efficient obstacle avoidance. The proposed localization system was successfully validated to show its capability to prevent localization failures in the event of faulty GPS measurements during the historic world's first autonomous racing at IMS. Owing to our robust navigation and planning algorithm, we were able to finish the race as one of the top four teams while the remaining five teams failed to finish due to collisions or out-of-track violations.
△ Less
Submitted 15 September, 2022; v1 submitted 25 July, 2022;
originally announced July 2022.
-
Attention mechanisms for physiological signal deep learning: which attention should we take?
Authors:
Seong-A Park,
Hyung-Chul Lee,
Chul-Woo Jung,
Hyun-Lim Yang
Abstract:
Attention mechanisms are widely used to dramatically improve deep learning model performance in various fields. However, their general ability to improve the performance of physiological signal deep learning model is immature. In this study, we experimentally analyze four attention mechanisms (e.g., squeeze-and-excitation, non-local, convolutional block attention module, and multi-head self-attent…
▽ More
Attention mechanisms are widely used to dramatically improve deep learning model performance in various fields. However, their general ability to improve the performance of physiological signal deep learning model is immature. In this study, we experimentally analyze four attention mechanisms (e.g., squeeze-and-excitation, non-local, convolutional block attention module, and multi-head self-attention) and three convolutional neural network (CNN) architectures (e.g., VGG, ResNet, and Inception) for two representative physiological signal prediction tasks: the classification for predicting hypotension and the regression for predicting cardiac output (CO). We evaluated multiple combinations for performance and convergence of physiological signal deep learning model. Accordingly, the CNN models with the spatial attention mechanism showed the best performance in the classification problem, whereas the channel attention mechanism achieved the lowest error in the regression problem. Moreover, the performance and convergence of the CNN models with attention mechanisms were better than stand-alone self-attention models in both problems. Hence, we verified that convolutional operation and attention mechanisms are complementary and provide faster convergence time, despite the stand-alone self-attention models requiring fewer parameters.
△ Less
Submitted 4 July, 2022;
originally announced July 2022.
-
Patch-wise Deep Metric Learning for Unsupervised Low-Dose CT Denoising
Authors:
Chanyong Jung,
Joonhyung Lee,
Sunkyoung You,
Jong Chul Ye
Abstract:
The acquisition conditions for low-dose and high-dose CT images are usually different, so that the shifts in the CT numbers often occur. Accordingly, unsupervised deep learning-based approaches, which learn the target image distribution, often introduce CT number distortions and result in detrimental effects in diagnostic performance. To address this, here we propose a novel unsupervised learning…
▽ More
The acquisition conditions for low-dose and high-dose CT images are usually different, so that the shifts in the CT numbers often occur. Accordingly, unsupervised deep learning-based approaches, which learn the target image distribution, often introduce CT number distortions and result in detrimental effects in diagnostic performance. To address this, here we propose a novel unsupervised learning approach for lowdose CT reconstruction using patch-wise deep metric learning. The key idea is to learn embedding space by pulling the positive pairs of image patches which shares the same anatomical structure, and pushing the negative pairs which have same noise level each other. Thereby, the network is trained to suppress the noise level, while retaining the original global CT number distributions even after the image translation. Experimental results confirm that our deep metric learning plays a critical role in producing high quality denoised images without CT number shift.
△ Less
Submitted 13 July, 2022; v1 submitted 5 July, 2022;
originally announced July 2022.
-
SVBR-NET: A Non-Blind Spatially Varying Defocus Blur Removal Network
Authors:
Ali Karaali,
Claudio Rosito Jung
Abstract:
Defocus blur is a physical consequence of the optical sensors used in most cameras. Although it can be used as a photographic style, it is commonly viewed as an image degradation modeled as the convolution of a sharp image with a spatially-varying blur kernel. Motivated by the advance of blur estimation methods in the past years, we propose a non-blind approach for image deblurring that can deal w…
▽ More
Defocus blur is a physical consequence of the optical sensors used in most cameras. Although it can be used as a photographic style, it is commonly viewed as an image degradation modeled as the convolution of a sharp image with a spatially-varying blur kernel. Motivated by the advance of blur estimation methods in the past years, we propose a non-blind approach for image deblurring that can deal with spatially-varying kernels. We introduce two encoder-decoder sub-networks that are fed with the blurry image and the estimated blur map, respectively, and produce as output the deblurred (deconvolved) image. Each sub-network presents several skip connections that allow data propagation from layers spread apart, and also inter-subnetwork skip connections that ease the communication between the modules. The network is trained with synthetically blur kernels that are augmented to emulate blur maps produced by existing blur estimation methods, and our experimental results show that our method works well when combined with a variety of blur estimation methods.
△ Less
Submitted 26 June, 2022;
originally announced June 2022.
-
Internal Calibration Process Using Chirp Pulses with Application of the Adam Learning Algorithm
Authors:
Junho Kweon,
Chan-Yong Jung,
Kyung-Bin Bae,
Seong-Ook Park
Abstract:
We propose a new internal calibration process using chirp pulses. Our method is utilized to mitigate thermal drift, which is unwanted changes and usually occurs in active elements such as a high power amplifier and low noise amplifier. The proposed method has advantages from two distinct aspects: calibration signal and algorithm. In respect to the calibration signal, our method does not contain an…
▽ More
We propose a new internal calibration process using chirp pulses. Our method is utilized to mitigate thermal drift, which is unwanted changes and usually occurs in active elements such as a high power amplifier and low noise amplifier. The proposed method has advantages from two distinct aspects: calibration signal and algorithm. In respect to the calibration signal, our method does not contain an additional signal source because chirp pulses, which are normally used for remote sensing, are used as calibration signals. Moreover, our methods solve the ambiguity problem of analyzing a phase shift which occurs when sinusoidal signals are used as calibration signals. In regards to the algorithm, the Adam learning algorithm avoids learning in the wrong direction, unlike the conventional gradient descent.
Using our method, mathematical forms of received signals are acquired successfully. Our method shows better effectivity compared to the conventional gradient descent algorithm. After compensation, the maximum differences of gain and phase become 0.06 dB and 2.42 degrees, respectively.
△ Less
Submitted 3 December, 2020;
originally announced December 2020.
-
Edge and Identity Preserving Network for Face Super-Resolution
Authors:
Jonghyun Kim,
Gen Li,
Inyong Yun,
Cheolkon Jung,
Joongkyu Kim
Abstract:
Face super-resolution (SR) has become an indispensable function in security solutions such as video surveillance and identification system, but the distortion in facial components is a great challenge in it. Most state-of-the-art methods have utilized facial priors with deep neural networks. These methods require extra labels, longer training time, and larger computation memory. In this paper, we…
▽ More
Face super-resolution (SR) has become an indispensable function in security solutions such as video surveillance and identification system, but the distortion in facial components is a great challenge in it. Most state-of-the-art methods have utilized facial priors with deep neural networks. These methods require extra labels, longer training time, and larger computation memory. In this paper, we propose a novel Edge and Identity Preserving Network for Face SR Network, named as EIPNet, to minimize the distortion by utilizing a lightweight edge block and identity information. We present an edge block to extract perceptual edge information, and concatenate it to the original feature maps in multiple scales. This structure progressively provides edge information in reconstruction to aggregate local and global structural information. Moreover, we define an identity loss function to preserve identification of SR images. The identity loss function compares feature distributions between SR images and their ground truth to recover identities in SR images. In addition, we provide a luminance-chrominance error (LCE) to separately infer brightness and color information in SR images. The LCE method not only reduces the dependency of color information by dividing brightness and color components but also enables our network to reflect differences between SR images and their ground truth in two color spaces of RGB and YUV. The proposed method facilitates the proposed SR network to elaborately restore facial components and generate high quality 8x scaled SR images with a lightweight network structure. Furthermore, our network is able to reconstruct an 128x128 SR image with 215 fps on a GTX 1080Ti GPU. Extensive experiments demonstrate that our network qualitatively and quantitatively outperforms state-of-the-art methods on two challenging datasets: CelebA and VGGFace2.
△ Less
Submitted 30 March, 2021; v1 submitted 27 August, 2020;
originally announced August 2020.
-
W-Net: A CNN-based Architecture for White Blood Cells Image Classification
Authors:
Changhun Jung,
Mohammed Abuhamad,
Jumabek Alikhanov,
Aziz Mohaisen,
Kyungja Han,
DaeHun Nyang
Abstract:
Computer-aided methods for analyzing white blood cells (WBC) have become widely popular due to the complexity of the manual process. Recent works have shown highly accurate segmentation and detection of white blood cells from microscopic blood images. However, the classification of the observed cells is still a challenge and highly demanded as the distribution of the five types reflects on the con…
▽ More
Computer-aided methods for analyzing white blood cells (WBC) have become widely popular due to the complexity of the manual process. Recent works have shown highly accurate segmentation and detection of white blood cells from microscopic blood images. However, the classification of the observed cells is still a challenge and highly demanded as the distribution of the five types reflects on the condition of the immune system. This work proposes W-Net, a CNN-based method for WBC classification. We evaluate W-Net on a real-world large-scale dataset, obtained from The Catholic University of Korea, that includes 6,562 real images of the five WBC types. W-Net achieves an average accuracy of 97%.
△ Less
Submitted 2 October, 2019;
originally announced October 2019.
-
Optimal Transport driven CycleGAN for Unsupervised Learning in Inverse Problems
Authors:
Byeongsu Sim,
Gyutaek Oh,
Jeongsol Kim,
Chanyong Jung,
Jong Chul Ye
Abstract:
To improve the performance of classical generative adversarial network (GAN), Wasserstein generative adversarial networks (W-GAN) was developed as a Kantorovich dual formulation of the optimal transport (OT) problem using Wasserstein-1 distance. However, it was not clear how cycleGAN-type generative models can be derived from the optimal transport theory. Here we show that a novel cycleGAN archite…
▽ More
To improve the performance of classical generative adversarial network (GAN), Wasserstein generative adversarial networks (W-GAN) was developed as a Kantorovich dual formulation of the optimal transport (OT) problem using Wasserstein-1 distance. However, it was not clear how cycleGAN-type generative models can be derived from the optimal transport theory. Here we show that a novel cycleGAN architecture can be derived as a Kantorovich dual OT formulation if a penalized least square (PLS) cost with deep learning-based inverse path penalty is used as a transportation cost. One of the most important advantages of this formulation is that depending on the knowledge of the forward problem, distinct variations of cycleGAN architecture can be derived: for example, one with two pairs of generators and discriminators, and the other with only a single pair of generator and discriminator. Even for the two generator cases, we show that the structural knowledge of the forward operator can lead to a simpler generator architecture which significantly simplifies the neural network training. The new cycleGAN formulation, what we call the OT-cycleGAN, have been applied for various biomedical imaging problems, such as accelerated magnetic resonance imaging (MRI), super-resolution microscopy, and low-dose x-ray computed tomography (CT). Experimental results confirm the efficacy and flexibility of the theory.
△ Less
Submitted 30 August, 2020; v1 submitted 25 September, 2019;
originally announced September 2019.
-
Adversarial Defense by Suppressing High-frequency Components
Authors:
Zhendong Zhang,
Cheolkon Jung,
Xiaolong Liang
Abstract:
Recent works show that deep neural networks trained on image classification dataset bias towards textures. Those models are easily fooled by applying small high-frequency perturbations to clean images. In this paper, we learn robust image classification models by removing high-frequency components. Specifically, we develop a differentiable high-frequency suppression module based on discrete Fourie…
▽ More
Recent works show that deep neural networks trained on image classification dataset bias towards textures. Those models are easily fooled by applying small high-frequency perturbations to clean images. In this paper, we learn robust image classification models by removing high-frequency components. Specifically, we develop a differentiable high-frequency suppression module based on discrete Fourier transform (DFT). Combining with adversarial training, we won the 5th place in the IJCAI-2019 Alibaba Adversarial AI Challenge. Our code is available online.
△ Less
Submitted 3 September, 2019; v1 submitted 18 August, 2019;
originally announced August 2019.
-
Attention-Aware Linear Depthwise Convolution for Single Image Super-Resolution
Authors:
Seongmin Hwang,
Gwanghuyn Yu,
Cheolkon Jung,
Jinyoung Kim
Abstract:
Although deep convolutional neural networks (CNNs) have obtained outstanding performance in image superresolution (SR), their computational cost increases geometrically as CNN models get deeper and wider. Meanwhile, the features of intermediate layers are treated equally across the channel, thus hindering the representational capability of CNNs. In this paper, we propose an attention-aware linear…
▽ More
Although deep convolutional neural networks (CNNs) have obtained outstanding performance in image superresolution (SR), their computational cost increases geometrically as CNN models get deeper and wider. Meanwhile, the features of intermediate layers are treated equally across the channel, thus hindering the representational capability of CNNs. In this paper, we propose an attention-aware linear depthwise network to address the problems for single image SR, named ALDNet. Specifically, linear depthwise convolution allows CNN-based SR models to preserve useful information for reconstructing a super-resolved image while reducing computational burden. Furthermore, we design an attention-aware branch that enhances the representation ability of depthwise convolution layers by making full use of depthwise filter interdependency. Experiments on publicly available benchmark datasets show that ALDNet achieves superior performance to traditional depthwise separable convolutions in terms of quantitative measurements and visual quality.
△ Less
Submitted 29 November, 2019; v1 submitted 7 August, 2019;
originally announced August 2019.