-
Multimodal, Multi-Disease Medical Imaging Foundation Model (MerMED-FM)
Authors:
Yang Zhou,
Chrystie Wan Ning Quek,
Jun Zhou,
Yan Wang,
Yang Bai,
Yuhe Ke,
Jie Yao,
Laura Gutierrez,
Zhen Ling Teo,
Darren Shu Jeng Ting,
Brian T. Soetikno,
Christopher S. Nielsen,
Tobias Elze,
Zengxiang Li,
Linh Le Dinh,
Lionel Tim-Ee Cheng,
Tran Nguyen Tuan Anh,
Chee Leong Cheng,
Tien Yin Wong,
Nan Liu,
Iain Beehuat Tan,
Tony Kiat Hon Lim,
Rick Siow Mong Goh,
Yong Liu,
Daniel Shu Wei Ting
Abstract:
Current artificial intelligence models for medical imaging are predominantly single modality and single disease. Attempts to create multimodal and multi-disease models have resulted in inconsistent clinical accuracy. Furthermore, training these models typically requires large, labour-intensive, well-labelled datasets. We developed MerMED-FM, a state-of-the-art multimodal, multi-specialty foundatio…
▽ More
Current artificial intelligence models for medical imaging are predominantly single modality and single disease. Attempts to create multimodal and multi-disease models have resulted in inconsistent clinical accuracy. Furthermore, training these models typically requires large, labour-intensive, well-labelled datasets. We developed MerMED-FM, a state-of-the-art multimodal, multi-specialty foundation model trained using self-supervised learning and a memory module. MerMED-FM was trained on 3.3 million medical images from over ten specialties and seven modalities, including computed tomography (CT), chest X-rays (CXR), ultrasound (US), pathology patches, color fundus photography (CFP), optical coherence tomography (OCT) and dermatology images. MerMED-FM was evaluated across multiple diseases and compared against existing foundational models. Strong performance was achieved across all modalities, with AUROCs of 0.988 (OCT); 0.982 (pathology); 0.951 (US); 0.943 (CT); 0.931 (skin); 0.894 (CFP); 0.858 (CXR). MerMED-FM has the potential to be a highly adaptable, versatile, cross-specialty foundation model that enables robust medical imaging interpretation across diverse medical disciplines.
△ Less
Submitted 30 June, 2025;
originally announced July 2025.
-
MedRegion-CT: Region-Focused Multimodal LLM for Comprehensive 3D CT Report Generation
Authors:
Sunggu Kyung,
Jinyoung Seo,
Hyunseok Lim,
Dongyeong Kim,
Hyungbin Park,
Jimin Sung,
Jihyun Kim,
Wooyoung Jo,
Yoojin Nam,
Namkug Kim
Abstract:
The recent release of RadGenome-Chest CT has significantly advanced CT-based report generation. However, existing methods primarily focus on global features, making it challenging to capture region-specific details, which may cause certain abnormalities to go unnoticed. To address this, we propose MedRegion-CT, a region-focused Multi-Modal Large Language Model (MLLM) framework, featuring three key…
▽ More
The recent release of RadGenome-Chest CT has significantly advanced CT-based report generation. However, existing methods primarily focus on global features, making it challenging to capture region-specific details, which may cause certain abnormalities to go unnoticed. To address this, we propose MedRegion-CT, a region-focused Multi-Modal Large Language Model (MLLM) framework, featuring three key innovations. First, we introduce Region Representative ($R^2$) Token Pooling, which utilizes a 2D-wise pretrained vision model to efficiently extract 3D CT features. This approach generates global tokens representing overall slice features and region tokens highlighting target areas, enabling the MLLM to process comprehensive information effectively. Second, a universal segmentation model generates pseudo-masks, which are then processed by a mask encoder to extract region-centric features. This allows the MLLM to focus on clinically relevant regions, using six predefined region masks. Third, we leverage segmentation results to extract patient-specific attributions, including organ size, diameter, and locations. These are converted into text prompts, enriching the MLLM's understanding of patient-specific contexts. To ensure rigorous evaluation, we conducted benchmark experiments on report generation using the RadGenome-Chest CT. MedRegion-CT achieved state-of-the-art performance, outperforming existing methods in natural language generation quality and clinical relevance while maintaining interpretability. The code for our framework is publicly available.
△ Less
Submitted 29 June, 2025;
originally announced June 2025.
-
Neural Spectral Band Generation for Audio Coding
Authors:
Woongjib Choi,
Byeong Hyeon Kim,
Hyungseob Lim,
Inseon Jang,
Hong-Goo Kang
Abstract:
Spectral band replication (SBR) enables bit-efficient coding by generating high-frequency bands from the low-frequency ones. However, it only utilizes coarse spectral features upon a subband-wise signal replication, limiting adaptability to diverse acoustic signals. In this paper, we explore the efficacy of a deep neural network (DNN)-based generative approach for coding the high-frequency bands,…
▽ More
Spectral band replication (SBR) enables bit-efficient coding by generating high-frequency bands from the low-frequency ones. However, it only utilizes coarse spectral features upon a subband-wise signal replication, limiting adaptability to diverse acoustic signals. In this paper, we explore the efficacy of a deep neural network (DNN)-based generative approach for coding the high-frequency bands, which we call neural spectral band generation (n-SBG). Specifically, we propose a DNN-based encoder-decoder structure to extract and quantize the side information related to the high-frequency components and generate the components given both the side information and the decoded core-band signals. The whole coding pipeline is optimized with generative adversarial criteria to enable the generation of perceptually plausible sound. From experiments using AAC as the core codec, we show that the proposed method achieves a better perceptual quality than HE-AAC-v1 with much less side information.
△ Less
Submitted 28 July, 2025; v1 submitted 7 June, 2025;
originally announced June 2025.
-
HyperspectralMAE: The Hyperspectral Imagery Classification Model using Fourier-Encoded Dual-Branch Masked Autoencoder
Authors:
Wooyoung Jeong,
Hyun Jae Park,
Seonghun Jeong,
Jong Wook Jang,
Tae Hoon Lim,
Dae Seoung Kim
Abstract:
Hyperspectral imagery provides rich spectral detail but poses unique challenges because of its high dimensionality in both spatial and spectral domains. We propose \textit{HyperspectralMAE}, a Transformer-based foundation model for hyperspectral data that employs a \textit{dual masking} strategy: during pre-training we randomly occlude 50\% of spatial patches and 50\% of spectral bands. This force…
▽ More
Hyperspectral imagery provides rich spectral detail but poses unique challenges because of its high dimensionality in both spatial and spectral domains. We propose \textit{HyperspectralMAE}, a Transformer-based foundation model for hyperspectral data that employs a \textit{dual masking} strategy: during pre-training we randomly occlude 50\% of spatial patches and 50\% of spectral bands. This forces the model to learn representations capable of reconstructing missing information across both dimensions. To encode spectral order, we introduce learnable harmonic Fourier positional embeddings based on wavelength. The reconstruction objective combines mean-squared error (MSE) with the spectral angle mapper (SAM) to balance pixel-level accuracy and spectral-shape fidelity.
The resulting model contains about $1.8\times10^{8}$ parameters and produces 768-dimensional embeddings, giving it sufficient capacity for transfer learning. We pre-trained HyperspectralMAE on two large hyperspectral corpora -- NASA EO-1 Hyperion ($\sim$1\,600 scenes, $\sim$$3\times10^{11}$ pixel spectra) and DLR EnMAP Level-0 ($\sim$1\,300 scenes, $\sim$$3\times10^{11}$ pixel spectra) -- and fine-tuned it for land-cover classification on the Indian Pines benchmark. HyperspectralMAE achieves state-of-the-art transfer-learning accuracy on Indian Pines, confirming that masked dual-dimensional pre-training yields robust spectral-spatial representations. These results demonstrate that dual masking and wavelength-aware embeddings advance hyperspectral image reconstruction and downstream analysis.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
PD-Skygroundhook Controller for Semi-Active Suspension System Using Magnetorheological Fluid Dampers
Authors:
Hansol Lim,
Jee Won Lee,
Seung-Bok Choi,
Jongseong Brad Choi
Abstract:
This paper presents a Proportional-Derivative (PD) Skygroundhook controller for magnetorheological (MR) dampers in semi-active suspensions. Traditional skyhook, Groundhook, and hybrid Skygroundhook controllers are well-known for their ability to reduce body and wheel vibrations; however, each approach has limitations in handling a broad frequency spectrum and often relies on abrupt switching. By a…
▽ More
This paper presents a Proportional-Derivative (PD) Skygroundhook controller for magnetorheological (MR) dampers in semi-active suspensions. Traditional skyhook, Groundhook, and hybrid Skygroundhook controllers are well-known for their ability to reduce body and wheel vibrations; however, each approach has limitations in handling a broad frequency spectrum and often relies on abrupt switching. By adding a derivative action to the classical Skygroundhook logic, the proposed PD-Skygroundhook method enhances high-frequency damping and stabilizes transition behaviors. By leveraging the fast response of MR dampers, our controller adjusts the damper force continuously in real time to match the desired damping force of PD-Skygroundhook controller with efficient computation. Experimental evaluations under bump excitations and sine-sweeping tests demonstrate a significant reduction in sprung mass acceleration and unsprung mass acceleration, outperforming standard Skygroundhook in both ride comfort and road handling. These results highlight that the derivative action effectively reduces resonance peaks and smooths out force transitions of regular Skygroundhook. Our method offers a robust alternative to more computationally demanding semi-active controllers.
△ Less
Submitted 17 March, 2025;
originally announced March 2025.
-
BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes
Authors:
Minkyun Seo,
Hyungtae Lim,
Kanghee Lee,
Luca Carlone,
Jaesik Park
Abstract:
Recent advances in deep learning-based point cloud registration have improved generalization, yet most methods still require retraining or manual parameter tuning for each new environment. In this paper, we identify three key factors limiting generalization: (a) reliance on environment-specific voxel size and search radius, (b) poor out-of-domain robustness of learning-based keypoint detectors, an…
▽ More
Recent advances in deep learning-based point cloud registration have improved generalization, yet most methods still require retraining or manual parameter tuning for each new environment. In this paper, we identify three key factors limiting generalization: (a) reliance on environment-specific voxel size and search radius, (b) poor out-of-domain robustness of learning-based keypoint detectors, and (c) raw coordinate usage, which exacerbates scale discrepancies. To address these issues, we present a zero-shot registration pipeline called BUFFER-X by (a) adaptively determining voxel size/search radii, (b) using farthest point sampling to bypass learned detectors, and (c) leveraging patch-wise scale normalization for consistent coordinate bounds. In particular, we present a multi-scale patch-based descriptor generation and a hierarchical inlier search across scales to improve robustness in diverse scenes. We also propose a novel generalizability benchmark using 11 datasets that cover various indoor/outdoor scenarios and sensor modalities, demonstrating that BUFFER-X achieves substantial generalization without prior information or manual parameter tuning for the test datasets. Our code is available at https://github.com/MIT-SPARK/BUFFER-X.
△ Less
Submitted 6 August, 2025; v1 submitted 10 March, 2025;
originally announced March 2025.
-
Integrated Communication and Binary State Detection Under Unequal Error Constraints
Authors:
Daewon Seo,
Sung Hoon Lim
Abstract:
This work considers a problem of integrated sensing and communication (ISAC) in which the goal of sensing is to detect a binary state. Unlike most approaches that minimize the total detection error probability, in our work, we disaggregate the error probability into false alarm and missed detection probabilities and investigate their information-theoretic three-way tradeoff including communication…
▽ More
This work considers a problem of integrated sensing and communication (ISAC) in which the goal of sensing is to detect a binary state. Unlike most approaches that minimize the total detection error probability, in our work, we disaggregate the error probability into false alarm and missed detection probabilities and investigate their information-theoretic three-way tradeoff including communication data rate. We consider a broadcast channel that consists of a transmitter, a communication receiver, and a detector where the receiver's and the detector's channels are affected by an unknown binary state. We consider and present results on two different state-dependent models. In the first setting, the state is fixed throughout the entire transmission, for which we fully characterize the optimal three-way tradeoff between the coding rate for communication and the two possibly nonidentical error exponents for sensing in the asymptotic regime. The achievability and converse proofs rely on the analysis of the cumulant-generating function of the log-likelihood ratio. In the second setting, the state changes every symbol in an independently and identically distributed (i.i.d.) manner, for which we characterize the optimal tradeoff region based on the analysis of the receiver operating characteristic (ROC) curves.
△ Less
Submitted 31 January, 2025;
originally announced January 2025.
-
Multi-modal AI for comprehensive breast cancer prognostication
Authors:
Jan Witowski,
Ken G. Zeng,
Joseph Cappadona,
Jailan Elayoubi,
Khalil Choucair,
Elena Diana Chiru,
Nancy Chan,
Young-Joon Kang,
Frederick Howard,
Irina Ostrovnaya,
Carlos Fernandez-Granda,
Freya Schnabel,
Zoe Steinsnyder,
Ugur Ozerdem,
Kangning Liu,
Waleed Abdulsattar,
Yu Zong,
Lina Daoud,
Rafic Beydoun,
Anas Saad,
Nitya Thakore,
Mohammad Sadic,
Frank Yeung,
Elisa Liu,
Theodore Hill
, et al. (26 additional authors not shown)
Abstract:
Treatment selection in breast cancer is guided by molecular subtypes and clinical characteristics. However, current tools including genomic assays lack the accuracy required for optimal clinical decision-making. We developed a novel artificial intelligence (AI)-based approach that integrates digital pathology images with clinical data, providing a more robust and effective method for predicting th…
▽ More
Treatment selection in breast cancer is guided by molecular subtypes and clinical characteristics. However, current tools including genomic assays lack the accuracy required for optimal clinical decision-making. We developed a novel artificial intelligence (AI)-based approach that integrates digital pathology images with clinical data, providing a more robust and effective method for predicting the risk of cancer recurrence in breast cancer patients. Specifically, we utilized a vision transformer pan-cancer foundation model trained with self-supervised learning to extract features from digitized H&E-stained slides. These features were integrated with clinical data to form a multi-modal AI test predicting cancer recurrence and death. The test was developed and evaluated using data from a total of 8,161 female breast cancer patients across 15 cohorts originating from seven countries. Of these, 3,502 patients from five cohorts were used exclusively for evaluation, while the remaining patients were used for training. Our test accurately predicted our primary endpoint, disease-free interval, in the five evaluation cohorts (C-index: 0.71 [0.68-0.75], HR: 3.63 [3.02-4.37, p<0.001]). In a direct comparison (n=858), the AI test was more accurate than Oncotype DX, the standard-of-care 21-gene assay, achieving a C-index of 0.67 [0.61-0.74] versus 0.61 [0.49-0.73], respectively. Additionally, the AI test added independent prognostic information to Oncotype DX in a multivariate analysis (HR: 3.11 [1.91-5.09, p<0.001)]). The test demonstrated robust accuracy across major molecular breast cancer subtypes, including TNBC (C-index: 0.71 [0.62-0.81], HR: 3.81 [2.35-6.17, p=0.02]), where no diagnostic tools are currently recommended by clinical guidelines. These results suggest that our AI test improves upon the accuracy of existing prognostic tests, while being applicable to a wider range of patients.
△ Less
Submitted 2 March, 2025; v1 submitted 28 October, 2024;
originally announced October 2024.
-
Brain-Aware Readout Layers in GNNs: Advancing Alzheimer's early Detection and Neuroimaging
Authors:
Jiwon Youn,
Dong Woo Kang,
Hyun Kook Lim,
Mansu Kim
Abstract:
Alzheimer's disease (AD) is a neurodegenerative disorder characterized by progressive memory and cognitive decline, affecting millions worldwide. Diagnosing AD is challenging due to its heterogeneous nature and variable progression. This study introduces a novel brain-aware readout layer (BA readout layer) for Graph Neural Networks (GNNs), designed to improve interpretability and predictive accura…
▽ More
Alzheimer's disease (AD) is a neurodegenerative disorder characterized by progressive memory and cognitive decline, affecting millions worldwide. Diagnosing AD is challenging due to its heterogeneous nature and variable progression. This study introduces a novel brain-aware readout layer (BA readout layer) for Graph Neural Networks (GNNs), designed to improve interpretability and predictive accuracy in neuroimaging for early AD diagnosis. By clustering brain regions based on functional connectivity and node embedding, this layer improves the GNN's capability to capture complex brain network characteristics. We analyzed neuroimaging data from 383 participants, including both cognitively normal and preclinical AD individuals, using T1-weighted MRI, resting-state fMRI, and FBB-PET to construct brain graphs. Our results show that GNNs with the BA readout layer significantly outperform traditional models in predicting the Preclinical Alzheimer's Cognitive Composite (PACC) score, demonstrating higher robustness and stability. The adaptive BA readout layer also offers enhanced interpretability by highlighting task-specific brain regions critical to cognitive functions impacted by AD. These findings suggest that our approach provides a valuable tool for the early diagnosis and analysis of Alzheimer's disease.
△ Less
Submitted 3 October, 2024;
originally announced October 2024.
-
Obstacle-Aware Quadrupedal Locomotion With Resilient Multi-Modal Reinforcement Learning
Authors:
I Made Aswin Nahrendra,
Byeongho Yu,
Minho Oh,
Dongkyu Lee,
Seunghyun Lee,
Hyeonwoo Lee,
Hyungtae Lim,
Hyun Myung
Abstract:
Quadrupedal robots hold promising potential for applications in navigating cluttered environments with resilience akin to their animal counterparts. However, their floating base configuration makes them vulnerable to real-world uncertainties, yielding substantial challenges in their locomotion control. Deep reinforcement learning has become one of the plausible alternatives for realizing a robust…
▽ More
Quadrupedal robots hold promising potential for applications in navigating cluttered environments with resilience akin to their animal counterparts. However, their floating base configuration makes them vulnerable to real-world uncertainties, yielding substantial challenges in their locomotion control. Deep reinforcement learning has become one of the plausible alternatives for realizing a robust locomotion controller. However, the approaches that rely solely on proprioception sacrifice collision-free locomotion because they require front-feet contact to detect the presence of stairs to adapt the locomotion gait. Meanwhile, incorporating exteroception necessitates a precisely modeled map observed by exteroceptive sensors over a period of time. Therefore, this work proposes a novel method to fuse proprioception and exteroception featuring a resilient multi-modal reinforcement learning. The proposed method yields a controller that showcases agile locomotion performance on a quadrupedal robot over a myriad of real-world courses, including rough terrains, steep slopes, and high-rise stairs, while retaining its robustness against out-of-distribution situations.
△ Less
Submitted 29 September, 2024;
originally announced September 2024.
-
LiDAR-3DGS: LiDAR Reinforced 3D Gaussian Splatting for Multimodal Radiance Field Rendering
Authors:
Hansol Lim,
Hanbeom Chang,
Jongseong Brad Choi,
Chul Min Yeum
Abstract:
In this paper, we explore the capabilities of multimodal inputs to 3D Gaussian Splatting (3DGS) based Radiance Field Rendering. We present LiDAR-3DGS, a novel method of reinforcing 3DGS inputs with LiDAR generated point clouds to significantly improve the accuracy and detail of 3D models. We demonstrate a systematic approach of LiDAR reinforcement to 3DGS to enable capturing of important features…
▽ More
In this paper, we explore the capabilities of multimodal inputs to 3D Gaussian Splatting (3DGS) based Radiance Field Rendering. We present LiDAR-3DGS, a novel method of reinforcing 3DGS inputs with LiDAR generated point clouds to significantly improve the accuracy and detail of 3D models. We demonstrate a systematic approach of LiDAR reinforcement to 3DGS to enable capturing of important features such as bolts, apertures, and other details that are often missed by image-based features alone. These details are crucial for engineering applications such as remote monitoring and maintenance. Without modifying the underlying 3DGS algorithm, we demonstrate that even a modest addition of LiDAR generated point cloud significantly enhances the perceptual quality of the models. At 30k iterations, the model generated by our method resulted in an increase of 7.064% in PSNR and 0.565% in SSIM, respectively. Since the LiDAR used in this research was a commonly used commercial-grade device, the improvements observed were modest and can be further enhanced with higher-grade LiDAR systems. Additionally, these improvements can be supplementary to other derivative works of Radiance Field Rendering and also provide a new insight for future LiDAR and computer vision integrated modeling.
△ Less
Submitted 9 September, 2024;
originally announced September 2024.
-
Identity-enabled CDMA LiDAR for massively parallel ranging with a single-element receiver
Authors:
Yixiu Shen,
Zi Heng Lim,
Guangya Zhou
Abstract:
Light detection and ranging (LiDAR) have emerged as a crucial tool for high-resolution 3D imaging, particularly in autonomous vehicles, remote sensing, and augmented reality. However, the increasing demand for faster acquisition speed and higher resolution in LiDAR systems has highlighted the limitations of traditional mechanical scanning methods. This study introduces a novel wavelength-multiplex…
▽ More
Light detection and ranging (LiDAR) have emerged as a crucial tool for high-resolution 3D imaging, particularly in autonomous vehicles, remote sensing, and augmented reality. However, the increasing demand for faster acquisition speed and higher resolution in LiDAR systems has highlighted the limitations of traditional mechanical scanning methods. This study introduces a novel wavelength-multiplexed code-division multiple access (CDMA) parallel laser ranging approach with a single-pixel receiver to address these challenges. By leveraging the unique properties of Gold-sequences in a direct-sequence spread spectrum (DSSS) framework, our design enables comprehensive parallelization in detection and ranging activities to significantly enhance system efficiency and user capacity. The proposed coaxial architecture simplifies hardware requirements using a single avalanche photodiode (APD) for multi-reception, reducing susceptibility to ambient noise and external interferences. We demonstrate 3D imaging at 5 m and 10 m, and the experimental results highlight the capability of our CDMA LiDAR system to achieve 40 parallel ranging channels with centimeter-level depth resolution and an angular resolution of 0.03 degree. Furthermore, our system allows for user identification modulation, enabling identity-based ranging among different users. The robustness of our proposed system against interference and speckle noise and near-far signal problems, combined with its potential for miniaturization and integration into chip-scale optics, presents a promising avenue to develop high-performance, compact LiDAR systems suitable for commercial applications.
△ Less
Submitted 10 July, 2024; v1 submitted 9 July, 2024;
originally announced July 2024.
-
TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation
Authors:
Taeyang Yun,
Hyunkuk Lim,
Jeonghwan Lee,
Min Song
Abstract:
Emotion Recognition in Conversation (ERC) plays a crucial role in enabling dialogue systems to effectively respond to user requests. The emotions in a conversation can be identified by the representations from various modalities, such as audio, visual, and text. However, due to the weak contribution of non-verbal modalities to recognize emotions, multimodal ERC has always been considered a challen…
▽ More
Emotion Recognition in Conversation (ERC) plays a crucial role in enabling dialogue systems to effectively respond to user requests. The emotions in a conversation can be identified by the representations from various modalities, such as audio, visual, and text. However, due to the weak contribution of non-verbal modalities to recognize emotions, multimodal ERC has always been considered a challenging task. In this paper, we propose Teacher-leading Multimodal fusion network for ERC (TelME). TelME incorporates cross-modal knowledge distillation to transfer information from a language model acting as the teacher to the non-verbal students, thereby optimizing the efficacy of the weak modalities. We then combine multimodal features using a shifting fusion approach in which student networks support the teacher. TelME achieves state-of-the-art performance in MELD, a multi-speaker conversation dataset for ERC. Finally, we demonstrate the effectiveness of our components through additional experiments.
△ Less
Submitted 31 March, 2024; v1 submitted 16 January, 2024;
originally announced January 2024.
-
On the Fundamental Tradeoff of Joint Communication and Quickest Change Detection with State-Independent Data Channels
Authors:
Daewon Seo,
Sung Hoon Lim
Abstract:
In this work, we take the initiative in studying the information-theoretic tradeoff between communication and quickest change detection (QCD) under an integrated sensing and communication setting. We formally establish a joint communication and sensing problem for the quickest change detection. We assume a broadcast channel with a transmitter, a communication receiver, and a QCD detector in which…
▽ More
In this work, we take the initiative in studying the information-theoretic tradeoff between communication and quickest change detection (QCD) under an integrated sensing and communication setting. We formally establish a joint communication and sensing problem for the quickest change detection. We assume a broadcast channel with a transmitter, a communication receiver, and a QCD detector in which only the detection channel is state dependent. For the problem setting, by utilizing constant subblock-composition codes and a modified CuSum detection rule, which we call subblock CuSum (SCS), we provide an inner bound on the information-theoretic tradeoff between communication rate and change point detection delay in the asymptotic regime of vanishing false alarm rate. We further provide a partial converse that matches our inner bound for a certain class of codes. This implies that the SCS detection strategy is asymptotically optimal for our codes as the false alarm rate constraint vanishes. We also present some canonical examples of the tradeoff region for a binary channel, a scalar Gaussian channel, and a MIMO Gaussian channel.
△ Less
Submitted 9 October, 2024; v1 submitted 23 January, 2024;
originally announced January 2024.
-
Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any Voice Conversion using Only Speech Data
Authors:
Hyungseob Lim,
Kyungguen Byun,
Sunkuk Moon,
Erik Visser
Abstract:
While many recent any-to-any voice conversion models succeed in transferring some target speech's style information to the converted speech, they still lack the ability to faithfully reproduce the speaking style of the target speaker. In this work, we propose a novel method to extract rich style information from target utterances and to efficiently transfer it to source speech content without requ…
▽ More
While many recent any-to-any voice conversion models succeed in transferring some target speech's style information to the converted speech, they still lack the ability to faithfully reproduce the speaking style of the target speaker. In this work, we propose a novel method to extract rich style information from target utterances and to efficiently transfer it to source speech content without requiring text transcriptions or speaker labeling. Our proposed approach introduces an attention mechanism utilizing a self-supervised learning (SSL) model to collect the speaking styles of a target speaker each corresponding to the different phonetic content. The styles are represented with a set of embeddings called stylebook. In the next step, the stylebook is attended with the source speech's phonetic content to determine the final target style for each source content. Finally, content information extracted from the source speech and content-dependent target style embeddings are fed into a diffusion-based decoder to generate the converted speech mel-spectrogram. Experiment results show that our proposed method combined with a diffusion-based generative model can achieve better speaker similarity in any-to-any voice conversion tasks when compared to baseline models, while the increase in computational complexity with longer utterances is suppressed.
△ Less
Submitted 14 December, 2023; v1 submitted 6 September, 2023;
originally announced September 2023.
-
Continuous-Time Distributed Dynamic Programming for Networked Multi-Agent Markov Decision Processes
Authors:
Donghwan Lee,
Han-Dong Lim,
Do Wan Kim
Abstract:
The main goal of this paper is to investigate continuous-time distributed dynamic programming (DP) algorithms for networked multi-agent Markov decision problems (MAMDPs). In our study, we adopt a distributed multi-agent framework where individual agents have access only to their own rewards, lacking insights into the rewards of other agents. Moreover, each agent has the ability to share its parame…
▽ More
The main goal of this paper is to investigate continuous-time distributed dynamic programming (DP) algorithms for networked multi-agent Markov decision problems (MAMDPs). In our study, we adopt a distributed multi-agent framework where individual agents have access only to their own rewards, lacking insights into the rewards of other agents. Moreover, each agent has the ability to share its parameters with neighboring agents through a communication network, represented by a graph. We first introduce a novel distributed DP, inspired by the distributed optimization method of Wang and Elia. Next, a new distributed DP is introduced through a decoupling process. The convergence of the DP algorithms is proved through systems and control perspectives. The study in this paper sets the stage for new distributed temporal different learning algorithms.
△ Less
Submitted 13 June, 2024; v1 submitted 31 July, 2023;
originally announced July 2023.
-
Multi-Agent Reachability Calibration with Conformal Prediction
Authors:
Anish Muthali,
Haotian Shen,
Sampada Deglurkar,
Michael H. Lim,
Rebecca Roelofs,
Aleksandra Faust,
Claire Tomlin
Abstract:
We investigate methods to provide safety assurances for autonomous agents that incorporate predictions of other, uncontrolled agents' behavior into their own trajectory planning. Given a learning-based forecasting model that predicts agents' trajectories, we introduce a method for providing probabilistic assurances on the model's prediction error with calibrated confidence intervals. Through quant…
▽ More
We investigate methods to provide safety assurances for autonomous agents that incorporate predictions of other, uncontrolled agents' behavior into their own trajectory planning. Given a learning-based forecasting model that predicts agents' trajectories, we introduce a method for providing probabilistic assurances on the model's prediction error with calibrated confidence intervals. Through quantile regression, conformal prediction, and reachability analysis, our method generates probabilistically safe and dynamically feasible prediction sets. We showcase their utility in certifying the safety of planning algorithms, both in simulations using actual autonomous driving data and in an experiment with Boeing vehicles.
△ Less
Submitted 13 December, 2023; v1 submitted 1 April, 2023;
originally announced April 2023.
-
Lightweight feature encoder for wake-up word detection based on self-supervised speech representation
Authors:
Hyungjun Lim,
Younggwan Kim,
Kiho Yeom,
Eunjoo Seo,
Hoodong Lee,
Stanley Jungkyu Choi,
Honglak Lee
Abstract:
Self-supervised learning method that provides generalized speech representations has recently received increasing attention. Wav2vec 2.0 is the most famous example, showing remarkable performance in numerous downstream speech processing tasks. Despite its success, it is challenging to use it directly for wake-up word detection on mobile devices due to its expensive computational cost. In this work…
▽ More
Self-supervised learning method that provides generalized speech representations has recently received increasing attention. Wav2vec 2.0 is the most famous example, showing remarkable performance in numerous downstream speech processing tasks. Despite its success, it is challenging to use it directly for wake-up word detection on mobile devices due to its expensive computational cost. In this work, we propose LiteFEW, a lightweight feature encoder for wake-up word detection that preserves the inherent ability of wav2vec 2.0 with a minimum scale. In the method, the knowledge of the pre-trained wav2vec 2.0 is compressed by introducing an auto-encoder-based dimensionality reduction technique and distilled to LiteFEW. Experimental results on the open-source "Hey Snips" dataset show that the proposed method applied to various model structures significantly improves the performance, achieving over 20% of relative improvements with only 64k parameters.
△ Less
Submitted 13 March, 2023;
originally announced March 2023.
-
RollBack: A New Time-Agnostic Replay Attack Against the Automotive Remote Keyless Entry Systems
Authors:
Levente Csikor,
Hoon Wei Lim,
Jun Wen Wong,
Soundarya Ramesh,
Rohini Poolat Parameswarath,
Mun Choon Chan
Abstract:
Today's RKE systems implement disposable rolling codes, making every key fob button press unique, effectively preventing simple replay attacks. However, a prior attack called RollJam was proven to break all rolling code-based systems in general. By a careful sequence of signal jamming, capturing, and replaying, an attacker can become aware of the subsequent valid unlock signal that has not been us…
▽ More
Today's RKE systems implement disposable rolling codes, making every key fob button press unique, effectively preventing simple replay attacks. However, a prior attack called RollJam was proven to break all rolling code-based systems in general. By a careful sequence of signal jamming, capturing, and replaying, an attacker can become aware of the subsequent valid unlock signal that has not been used yet. RollJam, however, requires continuous deployment indefinitely until it is exploited. Otherwise, the captured signals become invalid if the key fob is used again without RollJam in place. We introduce RollBack, a new replay-and-resynchronize attack against most of today's RKE systems. In particular, we show that even though the one-time code becomes invalid in rolling code systems, replaying a few previously captured signals consecutively can trigger a rollback-like mechanism in the RKE system. Put differently, the rolling codes become resynchronized back to a previous code used in the past from where all subsequent yet already used signals work again. Moreover, the victim can still use the key fob without noticing any difference before and after the attack. Unlike RollJam, RollBack does not necessitate jamming at all. Furthermore, it requires signal capturing only once and can be exploited at any time in the future as many times as desired. This time-agnostic property is particularly attractive to attackers, especially in car-sharing/renting scenarios where accessing the key fob is straightforward. However, while RollJam defeats virtually any rolling code-based system, vehicles might have additional anti-theft measures against malfunctioning key fobs, hence against RollBack. Our ongoing analysis (covering Asian vehicle manufacturers for the time being) against different vehicle makes and models has revealed that ~70% of them are vulnerable to RollBack.
△ Less
Submitted 14 September, 2022;
originally announced October 2022.
-
Optimality Guarantees for Particle Belief Approximation of POMDPs
Authors:
Michael H. Lim,
Tyler J. Becker,
Mykel J. Kochenderfer,
Claire J. Tomlin,
Zachary N. Sunberg
Abstract:
Partially observable Markov decision processes (POMDPs) provide a flexible representation for real-world decision and control problems. However, POMDPs are notoriously difficult to solve, especially when the state and observation spaces are continuous or hybrid, which is often the case for physical systems. While recent online sampling-based POMDP algorithms that plan with observation likelihood w…
▽ More
Partially observable Markov decision processes (POMDPs) provide a flexible representation for real-world decision and control problems. However, POMDPs are notoriously difficult to solve, especially when the state and observation spaces are continuous or hybrid, which is often the case for physical systems. While recent online sampling-based POMDP algorithms that plan with observation likelihood weighting have shown practical effectiveness, a general theory characterizing the approximation error of the particle filtering techniques that these algorithms use has not previously been proposed. Our main contribution is bounding the error between any POMDP and its corresponding finite sample particle belief MDP (PB-MDP) approximation. This fundamental bridge between PB-MDPs and POMDPs allows us to adapt any sampling-based MDP algorithm to a POMDP by solving the corresponding particle belief MDP, thereby extending the convergence guarantees of the MDP algorithm to the POMDP. Practically, this is implemented by using the particle filter belief transition model as the generative model for the MDP solver. While this requires access to the observation density model from the POMDP, it only increases the transition sampling complexity of the MDP solver by a factor of $\mathcal{O}(C)$, where $C$ is the number of particles. Thus, when combined with sparse sampling MDP algorithms, this approach can yield algorithms for POMDPs that have no direct theoretical dependence on the size of the state and observation spaces. In addition to our theoretical contribution, we perform five numerical experiments on benchmark POMDPs to demonstrate that a simple MDP algorithm adapted using PB-MDP approximation, Sparse-PFT, achieves performance competitive with other leading continuous observation POMDP solvers.
△ Less
Submitted 19 October, 2023; v1 submitted 10 October, 2022;
originally announced October 2022.
-
Navigation between initial and desired community states using shortcuts
Authors:
Benjamin W. Blonder,
Michael H. Lim,
Zachary Sunberg,
Claire Tomlin
Abstract:
Ecological management problems often involve navigating from an initial to a desired community state. We ask whether navigation is possible without brute-force additions and deletions of species, using actions of varying costs: adding/deleting a small number of individuals of a species, changing the environment, and waiting. Navigation can yield direct paths (single sequence of actions) or shortcu…
▽ More
Ecological management problems often involve navigating from an initial to a desired community state. We ask whether navigation is possible without brute-force additions and deletions of species, using actions of varying costs: adding/deleting a small number of individuals of a species, changing the environment, and waiting. Navigation can yield direct paths (single sequence of actions) or shortcut paths (multiple sequences of actions with lower cost than a direct path). We ask (1) when is non-brute-force navigation possible?; (2) do shortcuts exist and what are their properties?; and (3) what heuristics predict shortcut existence? Using several empirical datasets, we show that (1) non-brute-force navigation is only possible between some state pairs, (2) shortcuts exist between many state pairs; and (3) changes in abundance and richness are the strongest predictors of shortcut existence, independent of dataset and algorithm choices. State diagrams thus unveil hidden strategies for efficiently shifting between states.
△ Less
Submitted 2 December, 2022; v1 submitted 15 April, 2022;
originally announced April 2022.
-
Residual Aligner Network
Authors:
Jian-Qing Zheng,
Ziyang Wang,
Baoru Huang,
Ngee Han Lim,
Bartlomiej W. Papiez
Abstract:
Image registration is important for medical imaging, the estimation of the spatial transformation between different images. Many previous studies have used learning-based methods for coarse-to-fine registration to efficiently perform 3D image registration. The coarse-to-fine approach, however, is limited when dealing with the different motions of nearby objects. Here we propose a novel Motion-Awar…
▽ More
Image registration is important for medical imaging, the estimation of the spatial transformation between different images. Many previous studies have used learning-based methods for coarse-to-fine registration to efficiently perform 3D image registration. The coarse-to-fine approach, however, is limited when dealing with the different motions of nearby objects. Here we propose a novel Motion-Aware (MA) structure that captures the different motions in a region. The MA structure incorporates a novel Residual Aligner (RA) module which predicts the multi-head displacement field used to disentangle the different motions of multiple neighbouring objects. Compared with other deep learning methods, the network based on the MA structure and RA module achieve one of the most accurate unsupervised inter-subject registration on the 9 organs of assorted sizes in abdominal CT scans, with the highest-ranked registration of the veins (Dice Similarity Coefficient / Average surface distance: 62\%/4.9mm for the vena cava and 34\%/7.9mm for the portal and splenic vein), with a half-sized structure and more efficient computation. Applied to the segmentation of lungs in chest CT scans, the new network achieves results which were indistinguishable from the best-ranked networks (94\%/3.0mm). Additionally, the theorem on predicted motion pattern and the design of MA structure are validated by further analysis.
△ Less
Submitted 7 March, 2022;
originally announced March 2022.
-
Compositional Learning-based Planning for Vision POMDPs
Authors:
Sampada Deglurkar,
Michael H. Lim,
Johnathan Tucker,
Zachary N. Sunberg,
Aleksandra Faust,
Claire J. Tomlin
Abstract:
The Partially Observable Markov Decision Process (POMDP) is a powerful framework for capturing decision-making problems that involve state and transition uncertainty. However, most current POMDP planners cannot effectively handle high-dimensional image observations prevalent in real world applications, and often require lengthy online training that requires interaction with the environment. In thi…
▽ More
The Partially Observable Markov Decision Process (POMDP) is a powerful framework for capturing decision-making problems that involve state and transition uncertainty. However, most current POMDP planners cannot effectively handle high-dimensional image observations prevalent in real world applications, and often require lengthy online training that requires interaction with the environment. In this work, we propose Visual Tree Search (VTS), a compositional learning and planning procedure that combines generative models learned offline with online model-based POMDP planning. The deep generative observation models evaluate the likelihood of and predict future image observations in a Monte Carlo tree search planner. We show that VTS is robust to different types of image noises that were not present during training and can adapt to different reward structures without the need to re-train. This new approach significantly and stably outperforms several baseline state-of-the-art vision POMDP algorithms while using a fraction of the training time.
△ Less
Submitted 2 December, 2022; v1 submitted 17 December, 2021;
originally announced December 2021.
-
Adversarial Scene Reconstruction and Object Detection System for Assisting Autonomous Vehicle
Authors:
Md Foysal Haque,
Hay-Youn Lim,
Dae-Seong Kang
Abstract:
In the current computer vision era classifying scenes through video surveillance systems is a crucial task. Artificial Intelligence (AI) Video Surveillance technologies have been advanced remarkably while artificial intelligence and deep learning ascended into the system. Adopting the superior compounds of deep learning visual classification methods achieved enormous accuracy in classifying visual…
▽ More
In the current computer vision era classifying scenes through video surveillance systems is a crucial task. Artificial Intelligence (AI) Video Surveillance technologies have been advanced remarkably while artificial intelligence and deep learning ascended into the system. Adopting the superior compounds of deep learning visual classification methods achieved enormous accuracy in classifying visual scenes. However, the visual classifiers face difficulties examining the scenes in dark visible areas, especially during the nighttime. Also, the classifiers face difficulties in identifying the contexts of the scenes. This paper proposed a deep learning model that reconstructs dark visual scenes to clear scenes like daylight, and the method recognizes visual actions for the autonomous vehicle. The proposed model achieved 87.3 percent accuracy for scene reconstruction and 89.2 percent in scene understanding and detection tasks.
△ Less
Submitted 13 October, 2021;
originally announced October 2021.
-
Multi-Task Learning with Sequence-Conditioned Transporter Networks
Authors:
Michael H. Lim,
Andy Zeng,
Brian Ichter,
Maryam Bandari,
Erwin Coumans,
Claire Tomlin,
Stefan Schaal,
Aleksandra Faust
Abstract:
Enabling robots to solve multiple manipulation tasks has a wide range of industrial applications. While learning-based approaches enjoy flexibility and generalizability, scaling these approaches to solve such compositional tasks remains a challenge. In this work, we aim to solve multi-task learning through the lens of sequence-conditioning and weighted sampling. First, we propose a new suite of be…
▽ More
Enabling robots to solve multiple manipulation tasks has a wide range of industrial applications. While learning-based approaches enjoy flexibility and generalizability, scaling these approaches to solve such compositional tasks remains a challenge. In this work, we aim to solve multi-task learning through the lens of sequence-conditioning and weighted sampling. First, we propose a new suite of benchmark specifically aimed at compositional tasks, MultiRavens, which allows defining custom task combinations through task modules that are inspired by industrial tasks and exemplify the difficulties in vision-based learning and planning methods. Second, we propose a vision-based end-to-end system architecture, Sequence-Conditioned Transporter Networks, which augments Goal-Conditioned Transporter Networks with sequence-conditioning and weighted sampling and can efficiently learn to solve multi-task long horizon problems. Our analysis suggests that not only the new framework significantly improves pick-and-place performance on novel 10 multi-task benchmark problems, but also the multi-task learning with weighted sampling can vastly improve learning and agent performances on individual tasks.
△ Less
Submitted 15 September, 2021;
originally announced September 2021.
-
Towards a Better Understanding of VR Sickness: Physical Symptom Prediction for VR Contents
Authors:
Hak Gu Kim,
Sangmin Lee,
Seongyeop Kim,
Heoun-taek Lim,
Yong Man Ro
Abstract:
We address the black-box issue of VR sickness assessment (VRSA) by evaluating the level of physical symptoms of VR sickness. For the VR contents inducing the similar VR sickness level, the physical symptoms can vary depending on the characteristics of the contents. Most of existing VRSA methods focused on assessing the overall VR sickness score. To make better understanding of VR sickness, it is r…
▽ More
We address the black-box issue of VR sickness assessment (VRSA) by evaluating the level of physical symptoms of VR sickness. For the VR contents inducing the similar VR sickness level, the physical symptoms can vary depending on the characteristics of the contents. Most of existing VRSA methods focused on assessing the overall VR sickness score. To make better understanding of VR sickness, it is required to predict and provide the level of major symptoms of VR sickness rather than overall degree of VR sickness. In this paper, we predict the degrees of main physical symptoms affecting the overall degree of VR sickness, which are disorientation, nausea, and oculomotor. In addition, we introduce a new large-scale dataset for VRSA including 360 videos with various frame rates, physiological signals, and subjective scores. On VRSA benchmark and our newly collected dataset, our approach shows a potential to not only achieve the highest correlation with subjective scores, but also to better understand which symptoms are the main causes of VR sickness.
△ Less
Submitted 14 April, 2021;
originally announced April 2021.
-
Deep Learning-based High-precision Depth Map Estimation from Missing Viewpoints for 360 Degree Digital Holography
Authors:
Hakdong Kim,
Heonyeong Lim,
Minkyu Jee,
Yurim Lee,
Jisoo Jeong,
Kyudam Choi,
MinSung Yoon,
Cheongwon Kim
Abstract:
In this paper, we propose a novel, convolutional neural network model to extract highly precise depth maps from missing viewpoints, especially well applicable to generate holographic 3D contents. The depth map is an essential element for phase extraction which is required for synthesis of computer-generated hologram (CGH). The proposed model called the HDD Net uses MSE for the better performance o…
▽ More
In this paper, we propose a novel, convolutional neural network model to extract highly precise depth maps from missing viewpoints, especially well applicable to generate holographic 3D contents. The depth map is an essential element for phase extraction which is required for synthesis of computer-generated hologram (CGH). The proposed model called the HDD Net uses MSE for the better performance of depth map estimation as loss function, and utilizes the bilinear interpolation in up sampling layer with the Relu as activation function. We design and prepare a total of 8,192 multi-view images, each resolution of 640 by 360 for the deep learning study. The proposed model estimates depth maps through extracting features, up sampling. For quantitative assessment, we compare the estimated depth maps with the ground truths by using the PSNR, ACC, and RMSE. We also compare the CGH patterns made from estimated depth maps with ones made from ground truths. Furthermore, we demonstrate the experimental results to test the quality of estimated depth maps through directly reconstructing holographic 3D image scenes from the CGHs.
△ Less
Submitted 8 March, 2021;
originally announced March 2021.
-
Deep Learning-based Beam Tracking for Millimeter-wave Communications under Mobility
Authors:
Sun Hong Lim,
Sunwoo Kim,
Byonghyo Shim,
Jun Won Choi
Abstract:
In this paper, we propose a deep learning-based beam tracking method for millimeter-wave (mmWave)communications. Beam tracking is employed for transmitting the known symbols using the sounding beams and tracking time-varying channels to maintain a reliable communication link. When the pose of a user equipment (UE) device varies rapidly, the mmWave channels also tend to vary fast, which hinders sea…
▽ More
In this paper, we propose a deep learning-based beam tracking method for millimeter-wave (mmWave)communications. Beam tracking is employed for transmitting the known symbols using the sounding beams and tracking time-varying channels to maintain a reliable communication link. When the pose of a user equipment (UE) device varies rapidly, the mmWave channels also tend to vary fast, which hinders seamless communication. Thus, models that can capture temporal behavior of mmWave channels caused by the motion of the device are required, to cope with this problem. Accordingly, we employa deep neural network to analyze the temporal structure and patterns underlying in the time-varying channels and the signals acquired by inertial sensors. We propose a model based on long short termmemory (LSTM) that predicts the distribution of the future channel behavior based on a sequence of input signals available at the UE. This channel distribution is used to 1) control the sounding beams adaptively for the future channel state and 2) update the channel estimate through the measurement update step under a sequential Bayesian estimation framework. Our experimental results demonstrate that the proposed method achieves a significant performance gain over the conventional beam tracking methods under various mobility scenarios.
△ Less
Submitted 1 December, 2022; v1 submitted 19 February, 2021;
originally announced February 2021.
-
D-Net: Siamese based Network with Mutual Attention for Volume Alignment
Authors:
Jian-Qing Zheng,
Ngee Han Lim,
Bartlomiej W. Papiez
Abstract:
Alignment of contrast and non-contrast-enhanced imaging is essential for the quantification of changes in several biomedical applications. In particular, the extraction of cartilage shape from contrast-enhanced Computed Tomography (CT) of tibiae requires accurate alignment of the bone, currently performed manually. Existing deep learning-based methods for alignment require a common template or are…
▽ More
Alignment of contrast and non-contrast-enhanced imaging is essential for the quantification of changes in several biomedical applications. In particular, the extraction of cartilage shape from contrast-enhanced Computed Tomography (CT) of tibiae requires accurate alignment of the bone, currently performed manually. Existing deep learning-based methods for alignment require a common template or are limited in rotation range. Therefore, we present a novel network, D-net, to estimate arbitrary rotation and translation between 3D CT scans that additionally does not require a prior standard template. D-net is an extension to the branched Siamese encoder-decoder structure connected by new mutual non-local links, which efficiently capture long-range connections of similar features between two branches. The 3D supervised network is trained and validated using preclinical CT scans of mouse tibiae with and without contrast enhancement in cartilage. The presented results show a significant improvement in the estimation of CT alignment, outperforming the current comparable methods.
△ Less
Submitted 25 January, 2021;
originally announced January 2021.
-
Voronoi Progressive Widening: Efficient Online Solvers for Continuous State, Action, and Observation POMDPs
Authors:
Michael H. Lim,
Claire J. Tomlin,
Zachary N. Sunberg
Abstract:
This paper introduces Voronoi Progressive Widening (VPW), a generalization of Voronoi optimistic optimization (VOO) and action progressive widening to partially observable Markov decision processes (POMDPs). Tree search algorithms can use VPW to effectively handle continuous or hybrid action spaces by efficiently balancing local and global action searching. This paper proposes two VPW-based algori…
▽ More
This paper introduces Voronoi Progressive Widening (VPW), a generalization of Voronoi optimistic optimization (VOO) and action progressive widening to partially observable Markov decision processes (POMDPs). Tree search algorithms can use VPW to effectively handle continuous or hybrid action spaces by efficiently balancing local and global action searching. This paper proposes two VPW-based algorithms and analyzes them from theoretical and simulation perspectives. Voronoi Optimistic Weighted Sparse Sampling (VOWSS) is a theoretical tool that justifies VPW-based online solvers, and it is the first algorithm with global convergence guarantees for continuous state, action, and observation POMDPs. Voronoi Optimistic Monte Carlo Planning with Observation Weighting (VOMCPOW) is a versatile and efficient algorithm that consistently outperforms state-of-the-art POMDP algorithms in several simulation experiments.
△ Less
Submitted 1 April, 2021; v1 submitted 18 December, 2020;
originally announced December 2020.
-
Neural Audio Fingerprint for High-specific Audio Retrieval based on Contrastive Learning
Authors:
Sungkyun Chang,
Donmoon Lee,
Jeongsoo Park,
Hyungui Lim,
Kyogu Lee,
Karam Ko,
Yoonchang Han
Abstract:
Most of existing audio fingerprinting systems have limitations to be used for high-specific audio retrieval at scale. In this work, we generate a low-dimensional representation from a short unit segment of audio, and couple this fingerprint with a fast maximum inner-product search. To this end, we present a contrastive learning framework that derives from the segment-level search objective. Each u…
▽ More
Most of existing audio fingerprinting systems have limitations to be used for high-specific audio retrieval at scale. In this work, we generate a low-dimensional representation from a short unit segment of audio, and couple this fingerprint with a fast maximum inner-product search. To this end, we present a contrastive learning framework that derives from the segment-level search objective. Each update in training uses a batch consisting of a set of pseudo labels, randomly selected original samples, and their augmented replicas. These replicas can simulate the degrading effects on original audio signals by applying small time offsets and various types of distortions, such as background noise and room/microphone impulse responses. In the segment-level search task, where the conventional audio fingerprinting systems used to fail, our system using 10x smaller storage has shown promising results. Our code and dataset are available at \url{https://mimbres.github.io/neural-audio-fp/}.
△ Less
Submitted 10 February, 2021; v1 submitted 22 October, 2020;
originally announced October 2020.
-
A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments
Authors:
Youngmoon Jung,
Yeunju Choi,
Hyungjun Lim,
Hoirin Kim
Abstract:
Speaker verification (SV) has recently attracted considerable research interest due to the growing popularity of virtual assistants. At the same time, there is an increasing requirement for an SV system: it should be robust to short speech segments, especially in noisy and reverberant environments. In this paper, we consider one more important requirement for practical applications: the system sho…
▽ More
Speaker verification (SV) has recently attracted considerable research interest due to the growing popularity of virtual assistants. At the same time, there is an increasing requirement for an SV system: it should be robust to short speech segments, especially in noisy and reverberant environments. In this paper, we consider one more important requirement for practical applications: the system should be robust to an audio stream containing long non-speech segments, where a voice activity detection (VAD) is not applied. To meet these two requirements, we introduce feature pyramid module (FPM)-based multi-scale aggregation (MSA) and self-adaptive soft VAD (SAS-VAD). We present the FPM-based MSA to deal with short speech segments in noisy and reverberant environments. Also, we use the SAS-VAD to increase the robustness to long non-speech segments. To further improve the robustness to acoustic distortions (i.e., noise and reverberation), we apply a masking-based speech enhancement (SE) method. We combine SV, VAD, and SE models in a unified deep learning framework and jointly train the entire network in an end-to-end manner. To the best of our knowledge, this is the first work combining these three models in a deep learning framework. We conduct experiments on Korean indoor (KID) and VoxCeleb datasets, which are corrupted by noise and reverberation. The results show that the proposed method is effective for SV in the challenging conditions and performs better than the baseline i-vector and deep speaker embedding systems.
△ Less
Submitted 6 October, 2020;
originally announced October 2020.
-
MSDPN: Monocular Depth Prediction with Partial Laser Observation using Multi-stage Neural Networks
Authors:
Hyungtae Lim,
Hyeonjae Gil,
Hyun Myung
Abstract:
In this study, a deep-learning-based multi-stage network architecture called Multi-Stage Depth Prediction Network (MSDPN) is proposed to predict a dense depth map using a 2D LiDAR and a monocular camera. Our proposed network consists of a multi-stage encoder-decoder architecture and Cross Stage Feature Aggregation (CSFA). The proposed multi-stage encoder-decoder architecture alleviates the partial…
▽ More
In this study, a deep-learning-based multi-stage network architecture called Multi-Stage Depth Prediction Network (MSDPN) is proposed to predict a dense depth map using a 2D LiDAR and a monocular camera. Our proposed network consists of a multi-stage encoder-decoder architecture and Cross Stage Feature Aggregation (CSFA). The proposed multi-stage encoder-decoder architecture alleviates the partial observation problem caused by the characteristics of a 2D LiDAR, and CSFA prevents the multi-stage network from diluting the features and allows the network to learn the inter-spatial relationship between features better. Previous works use sub-sampled data from the ground truth as an input rather than actual 2D LiDAR data. In contrast, our approach trains the model and conducts experiments with a physically-collected 2D LiDAR dataset. To this end, we acquired our own dataset called KAIST RGBD-scan dataset and validated the effectiveness and the robustness of MSDPN under realistic conditions. As verified experimentally, our network yields promising performance against state-of-the-art methods. Additionally, we analyzed the performance of different input methods and confirmed that the reference depth map is robust in untrained scenarios.
△ Less
Submitted 4 August, 2020;
originally announced August 2020.
-
Sequential Routing Framework: Fully Capsule Network-based Speech Recognition
Authors:
Kyungmin Lee,
Hyunwhan Joe,
Hyeontaek Lim,
Kwangyoun Kim,
Sungsoo Kim,
Chang Woo Han,
Hong-Gee Kim
Abstract:
Capsule networks (CapsNets) have recently gotten attention as a novel neural architecture. This paper presents the sequential routing framework which we believe is the first method to adapt a CapsNet-only structure to sequence-to-sequence recognition. Input sequences are capsulized then sliced by a window size. Each slice is classified to a label at the corresponding time through iterative routing…
▽ More
Capsule networks (CapsNets) have recently gotten attention as a novel neural architecture. This paper presents the sequential routing framework which we believe is the first method to adapt a CapsNet-only structure to sequence-to-sequence recognition. Input sequences are capsulized then sliced by a window size. Each slice is classified to a label at the corresponding time through iterative routing mechanisms. Afterwards, losses are computed by connectionist temporal classification (CTC). During routing, the required number of parameters can be controlled by the window size regardless of the length of sequences by sharing learnable weights across the slices. We additionally propose a sequential dynamic routing algorithm to replace traditional dynamic routing. The proposed technique can minimize decoding speed degradation caused by the routing iterations since it can operate in a non-iterative manner without dropping accuracy. The method achieves a 1.1% lower word error rate at 16.9% on the Wall Street Journal corpus compared to bidirectional long short-term memory-based CTC networks. On the TIMIT corpus, it attains a 0.7% lower phone error rate at 17.5% compared to convolutional neural network-based CTC networks (Zhang et al., 2016).
△ Less
Submitted 1 April, 2021; v1 submitted 22 July, 2020;
originally announced July 2020.
-
Robust Precoder for Mitigating Inter-Symbol and Inter-Carrier Interferences in Coherent Optical FBMC/OQAM
Authors:
Khaled Abdulaziz Alaghbari,
Heng-Siong Lim,
Tawfig Eltaif
Abstract:
In this paper, a new precoder for a coherent optical filter bank multicarrier with offset quadrature amplitude modulation (FBMC/OQAM) system is proposed. The precoder is designed based on an iterative polynomial eigenvalue decomposition (PEVD) algorithm to jointly mitigate the inter-symbol interference (ISI) and inter-carrier interference (ICI). The PEVD algorithm is used to decompose a polynomial…
▽ More
In this paper, a new precoder for a coherent optical filter bank multicarrier with offset quadrature amplitude modulation (FBMC/OQAM) system is proposed. The precoder is designed based on an iterative polynomial eigenvalue decomposition (PEVD) algorithm to jointly mitigate the inter-symbol interference (ISI) and inter-carrier interference (ICI). The PEVD algorithm is used to decompose a polynomial channel matrix into polynomial eigenvectors and eigenvalues matrices, then a precoder is designed based on the decomposed matrices. The precoder acts as a filter applied to each subcarrier and to its adjacent subcarriers. In addition, a new adaptive optimum energy algorithm is proposed to truncate the insignificant precoder filter coefficients produced by the PEVD algorithm and consequently reduce the computational complexity. A mathematical model and algorithm for implementing the precoder are also presented. Robustness of the proposed precoder has been analyzed and compared to existing precoder in optical channel with different dispersion effects. Numerical results demonstrate that the new precoder significantly outperforms the existing precoder in terms of error vector magnitude (EVM), bit error rate (BER), Frobenius norm of the error matrix, and computational complexity.
△ Less
Submitted 20 January, 2020;
originally announced January 2020.
-
Algorithmic decision-making in AVs: Understanding ethical and technical concerns for smart cities
Authors:
Hazel Si Min Lim,
Araz Taeihagh
Abstract:
Autonomous Vehicles (AVs) are increasingly embraced around the world to advance smart mobility and more broadly, smart, and sustainable cities. Algorithms form the basis of decision-making in AVs, allowing them to perform driving tasks autonomously, efficiently, and more safely than human drivers and offering various economic, social, and environmental benefits. However, algorithmic decision-makin…
▽ More
Autonomous Vehicles (AVs) are increasingly embraced around the world to advance smart mobility and more broadly, smart, and sustainable cities. Algorithms form the basis of decision-making in AVs, allowing them to perform driving tasks autonomously, efficiently, and more safely than human drivers and offering various economic, social, and environmental benefits. However, algorithmic decision-making in AVs can also introduce new issues that create new safety risks and perpetuate discrimination. We identify bias, ethics, and perverse incentives as key ethical issues in the AV algorithms' decision-making that can create new safety risks and discriminatory outcomes. Technical issues in the AVs' perception, decision-making and control algorithms, limitations of existing AV testing and verification methods, and cybersecurity vulnerabilities can also undermine the performance of the AV system. This article investigates the ethical and technical concerns surrounding algorithmic decision-making in AVs by exploring how driving decisions can perpetuate discrimination and create new safety risks for the public. We discuss steps taken to address these issues, highlight the existing research gaps and the need to mitigate these issues through the design of AV's algorithms and of policies and regulations to fully realise AVs' benefits for smart and sustainable cities.
△ Less
Submitted 29 October, 2019;
originally announced October 2019.
-
Sparse tree search optimality guarantees in POMDPs with continuous observation spaces
Authors:
Michael H. Lim,
Claire J. Tomlin,
Zachary N. Sunberg
Abstract:
Partially observable Markov decision processes (POMDPs) with continuous state and observation spaces have powerful flexibility for representing real-world decision and control problems but are notoriously difficult to solve. Recent online sampling-based algorithms that use observation likelihood weighting have shown unprecedented effectiveness in domains with continuous observation spaces. However…
▽ More
Partially observable Markov decision processes (POMDPs) with continuous state and observation spaces have powerful flexibility for representing real-world decision and control problems but are notoriously difficult to solve. Recent online sampling-based algorithms that use observation likelihood weighting have shown unprecedented effectiveness in domains with continuous observation spaces. However there has been no formal theoretical justification for this technique. This work offers such a justification, proving that a simplified algorithm, partially observable weighted sparse sampling (POWSS), will estimate Q-values accurately with high probability and can be made to perform arbitrarily near the optimal solution by increasing computational power.
△ Less
Submitted 5 June, 2023; v1 submitted 9 October, 2019;
originally announced October 2019.
-
Additional Shared Decoder on Siamese Multi-view Encoders for Learning Acoustic Word Embeddings
Authors:
Myunghun Jung,
Hyungjun Lim,
Jahyun Goo,
Youngmoon Jung,
Hoirin Kim
Abstract:
Acoustic word embeddings --- fixed-dimensional vector representations of arbitrary-length words --- have attracted increasing interest in query-by-example spoken term detection. Recently, on the fact that the orthography of text labels partly reflects the phonetic similarity between the words' pronunciation, a multi-view approach has been introduced that jointly learns acoustic and text embeddings…
▽ More
Acoustic word embeddings --- fixed-dimensional vector representations of arbitrary-length words --- have attracted increasing interest in query-by-example spoken term detection. Recently, on the fact that the orthography of text labels partly reflects the phonetic similarity between the words' pronunciation, a multi-view approach has been introduced that jointly learns acoustic and text embeddings. It showed that it is possible to learn discriminative embeddings by designing the objective which takes text labels as well as word segments. In this paper, we propose a network architecture that expands the multi-view approach by combining the Siamese multi-view encoders with a shared decoder network to maximize the effect of the relationship between acoustic and text embeddings in embedding space. Discriminatively trained with multi-view triplet loss and decoding loss, our proposed approach achieves better performance on acoustic word discrimination task with the WSJ dataset, resulting in 11.1% relative improvement in average precision. We also present experimental results on cross-view word discrimination and word level speech recognition tasks.
△ Less
Submitted 1 October, 2019;
originally announced October 2019.
-
Momentum-Net: Fast and convergent iterative neural network for inverse problems
Authors:
Il Yong Chun,
Zhengyu Huang,
Hongki Lim,
Jeffrey A. Fessler
Abstract:
Iterative neural networks (INN) are rapidly gaining attention for solving inverse problems in imaging, image processing, and computer vision. INNs combine regression NNs and an iterative model-based image reconstruction (MBIR) algorithm, often leading to both good generalization capability and outperforming reconstruction quality over existing MBIR optimization models. This paper proposes the firs…
▽ More
Iterative neural networks (INN) are rapidly gaining attention for solving inverse problems in imaging, image processing, and computer vision. INNs combine regression NNs and an iterative model-based image reconstruction (MBIR) algorithm, often leading to both good generalization capability and outperforming reconstruction quality over existing MBIR optimization models. This paper proposes the first fast and convergent INN architecture, Momentum-Net, by generalizing a block-wise MBIR algorithm that uses momentum and majorizers with regression NNs. For fast MBIR, Momentum-Net uses momentum terms in extrapolation modules, and noniterative MBIR modules at each iteration by using majorizers, where each iteration of Momentum-Net consists of three core modules: image refining, extrapolation, and MBIR. Momentum-Net guarantees convergence to a fixed-point for general differentiable (non)convex MBIR functions (or data-fit terms) and convex feasible sets, under two asymptomatic conditions. To consider data-fit variations across training and testing samples, we also propose a regularization parameter selection scheme based on the "spectral spread" of majorization matrices. Numerical experiments for light-field photography using a focal stack and sparse-view computational tomography demonstrate that, given identical regression NN architectures, Momentum-Net significantly improves MBIR speed and accuracy over several existing INNs; it significantly improves reconstruction quality compared to a state-of-the-art MBIR method in each application.
△ Less
Submitted 20 June, 2020; v1 submitted 26 July, 2019;
originally announced July 2019.
-
Spatial Pyramid Encoding with Convex Length Normalization for Text-Independent Speaker Verification
Authors:
Youngmoon Jung,
Younggwan Kim,
Hyungjun Lim,
Yeunju Choi,
Hoirin Kim
Abstract:
In this paper, we propose a new pooling method called spatial pyramid encoding (SPE) to generate speaker embeddings for text-independent speaker verification. We first partition the output feature maps from a deep residual network (ResNet) into increasingly fine sub-regions and extract speaker embeddings from each sub-region through a learnable dictionary encoding layer. These embeddings are conca…
▽ More
In this paper, we propose a new pooling method called spatial pyramid encoding (SPE) to generate speaker embeddings for text-independent speaker verification. We first partition the output feature maps from a deep residual network (ResNet) into increasingly fine sub-regions and extract speaker embeddings from each sub-region through a learnable dictionary encoding layer. These embeddings are concatenated to obtain the final speaker representation. The SPE layer not only generates a fixed-dimensional speaker embedding for a variable-length speech segment, but also aggregates the information of feature distribution from multi-level temporal bins. Furthermore, we apply deep length normalization by augmenting the loss function with ring loss. By applying ring loss, the network gradually learns to normalize the speaker embeddings using model weights themselves while preserving convexity, leading to more robust speaker embeddings. Experiments on the VoxCeleb1 dataset show that the proposed system using the SPE layer and ring loss-based deep length normalization outperforms both i-vector and d-vector baselines.
△ Less
Submitted 19 June, 2019;
originally announced June 2019.
-
Improved low-count quantitative PET reconstruction with an iterative neural network
Authors:
Hongki Lim,
Il Yong Chun,
Yuni K. Dewaraja,
Jeffrey A. Fessler
Abstract:
Image reconstruction in low-count PET is particularly challenging because gammas from natural radioactivity in Lu-based crystals cause high random fractions that lower the measurement signal-to-noise-ratio (SNR). In model-based image reconstruction (MBIR), using more iterations of an unregularized method may increase the noise, so incorporating regularization into the image reconstruction is desir…
▽ More
Image reconstruction in low-count PET is particularly challenging because gammas from natural radioactivity in Lu-based crystals cause high random fractions that lower the measurement signal-to-noise-ratio (SNR). In model-based image reconstruction (MBIR), using more iterations of an unregularized method may increase the noise, so incorporating regularization into the image reconstruction is desirable to control the noise. New regularization methods based on learned convolutional operators are emerging in MBIR. We modify the architecture of an iterative neural network, BCD-Net, for PET MBIR, and demonstrate the efficacy of the trained BCD-Net using XCAT phantom data that simulates the low true coincidence count-rates with high random fractions typical for Y-90 PET patient imaging after Y-90 microsphere radioembolization. Numerical results show that the proposed BCD-Net significantly improves CNR and RMSE of the reconstructed images compared to MBIR methods using non-trained regularizers, total variation (TV) and non-local means (NLM). Moreover, BCD-Net successfully generalizes to test data that differs from the training data. Improvements were also demonstrated for the clinically relevant phantom measurement data where we used training and testing datasets having very different activity distributions and count-levels.
△ Less
Submitted 25 May, 2020; v1 submitted 5 June, 2019;
originally announced June 2019.
-
Scaling Video Analytics on Constrained Edge Nodes
Authors:
Christopher Canel,
Thomas Kim,
Giulio Zhou,
Conglong Li,
Hyeontaek Lim,
David G. Andersen,
Michael Kaminsky,
Subramanya R. Dulloor
Abstract:
As video camera deployments continue to grow, the need to process large volumes of real-time data strains wide area network infrastructure. When per-camera bandwidth is limited, it is infeasible for applications such as traffic monitoring and pedestrian tracking to offload high-quality video streams to a datacenter. This paper presents FilterForward, a new edge-to-cloud system that enables datacen…
▽ More
As video camera deployments continue to grow, the need to process large volumes of real-time data strains wide area network infrastructure. When per-camera bandwidth is limited, it is infeasible for applications such as traffic monitoring and pedestrian tracking to offload high-quality video streams to a datacenter. This paper presents FilterForward, a new edge-to-cloud system that enables datacenter-based applications to process content from thousands of cameras by installing lightweight edge filters that backhaul only relevant video frames. FilterForward introduces fast and expressive per-application microclassifiers that share computation to simultaneously detect dozens of events on computationally constrained edge nodes. Only matching events are transmitted to the cloud. Evaluation on two real-world camera feed datasets shows that FilterForward reduces bandwidth use by an order of magnitude while improving computational efficiency and event detection accuracy for challenging video content.
△ Less
Submitted 24 May, 2019;
originally announced May 2019.
-
I4U Submission to NIST SRE 2018: Leveraging from a Decade of Shared Experiences
Authors:
Kong Aik Lee,
Ville Hautamaki,
Tomi Kinnunen,
Hitoshi Yamamoto,
Koji Okabe,
Ville Vestman,
Jing Huang,
Guohong Ding,
Hanwu Sun,
Anthony Larcher,
Rohan Kumar Das,
Haizhou Li,
Mickael Rouvier,
Pierre-Michel Bousquet,
Wei Rao,
Qing Wang,
Chunlei Zhang,
Fahimeh Bahmaninezhad,
Hector Delgado,
Jose Patino,
Qiongqiong Wang,
Ling Guo,
Takafumi Koshinaka,
Jiacen Zhang,
Koichi Shinoda
, et al. (21 additional authors not shown)
Abstract:
The I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE). The latest edition of such joint submission was in SRE 2018, in which the I4U submission was among the best-performing systems. SRE'18 also marks the 10-year anniversary of I4U consortium into NIST SRE series of evaluation. The primary objective of the current paper is to summarize the res…
▽ More
The I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE). The latest edition of such joint submission was in SRE 2018, in which the I4U submission was among the best-performing systems. SRE'18 also marks the 10-year anniversary of I4U consortium into NIST SRE series of evaluation. The primary objective of the current paper is to summarize the results and lessons learned based on the twelve sub-systems and their fusion submitted to SRE'18. It is also our intention to present a shared view on the advancements, progresses, and major paradigm shifts that we have witnessed as an SRE participant in the past decade from SRE'08 to SRE'18. In this regard, we have seen, among others, a paradigm shift from supervector representation to deep speaker embedding, and a switch of research challenge from channel compensation to domain adaptation.
△ Less
Submitted 15 April, 2019;
originally announced April 2019.
-
Learning acoustic word embeddings with phonetically associated triplet network
Authors:
Hyungjun Lim,
Younggwan Kim,
Youngmoon Jung,
Myunghun Jung,
Hoirin Kim
Abstract:
Previous researches on acoustic word embeddings used in query-by-example spoken term detection have shown remarkable performance improvements when using a triplet network. However, the triplet network is trained using only a limited information about acoustic similarity between words. In this paper, we propose a novel architecture, phonetically associated triplet network (PATN), which aims at incr…
▽ More
Previous researches on acoustic word embeddings used in query-by-example spoken term detection have shown remarkable performance improvements when using a triplet network. However, the triplet network is trained using only a limited information about acoustic similarity between words. In this paper, we propose a novel architecture, phonetically associated triplet network (PATN), which aims at increasing discriminative power of acoustic word embeddings by utilizing phonetic information as well as word identity. The proposed model is learned to minimize a combined loss function that was made by introducing a cross entropy loss to the lower layer of LSTM-based triplet network. We observed that the proposed method performs significantly better than the baseline triplet network on a word discrimination task with the WSJ dataset resulting in over 20% relative improvement in recall rate at 1.0 false alarm per hour. Finally, we examined the generalization ability by conducting the out-of-domain test on the RM dataset.
△ Less
Submitted 27 November, 2018; v1 submitted 6 November, 2018;
originally announced November 2018.
-
Chord Generation from Symbolic Melody Using BLSTM Networks
Authors:
Hyungui Lim,
Seungyeon Rhyu,
Kyogu Lee
Abstract:
Generating a chord progression from a monophonic melody is a challenging problem because a chord progression requires a series of layered notes played simultaneously. This paper presents a novel method of generating chord sequences from a symbolic melody using bidirectional long short-term memory (BLSTM) networks trained on a lead sheet database. To this end, a group of feature vectors composed of…
▽ More
Generating a chord progression from a monophonic melody is a challenging problem because a chord progression requires a series of layered notes played simultaneously. This paper presents a novel method of generating chord sequences from a symbolic melody using bidirectional long short-term memory (BLSTM) networks trained on a lead sheet database. To this end, a group of feature vectors composed of 12 semitones is extracted from the notes in each bar of monophonic melodies. In order to ensure that the data shares uniform key and duration characteristics, the key and the time signatures of the vectors are normalized. The BLSTM networks then learn from the data to incorporate the temporal dependencies to produce a chord progression. Both quantitative and qualitative evaluations are conducted by comparing the proposed method with the conventional HMM and DNN-HMM based approaches. Proposed model achieves 23.8% and 11.4% performance increase from the other models, respectively. User studies further confirm that the chord sequences generated by the proposed method are preferred by listeners.
△ Less
Submitted 4 December, 2017;
originally announced December 2017.