Search | arXiv e-print repository

Fourier-Modulated Implicit Neural Representation for Multispectral Satellite Image Compression

Authors: Woojin Cho, Steve Andreas Immanuel, Junhyuk Heo, Darongsae Kwon

Abstract: Multispectral satellite images play a vital role in agriculture, fisheries, and environmental monitoring. However, their high dimensionality, large data volumes, and diverse spatial resolutions across multiple channels pose significant challenges for data compression and analysis. This paper presents ImpliSat, a unified framework specifically designed to address these challenges through efficient… ▽ More Multispectral satellite images play a vital role in agriculture, fisheries, and environmental monitoring. However, their high dimensionality, large data volumes, and diverse spatial resolutions across multiple channels pose significant challenges for data compression and analysis. This paper presents ImpliSat, a unified framework specifically designed to address these challenges through efficient compression and reconstruction of multispectral satellite data. ImpliSat leverages Implicit Neural Representations (INR) to model satellite images as continuous functions over coordinate space, capturing fine spatial details across varying spatial resolutions. Furthermore, we introduce a Fourier modulation algorithm that dynamically adjusts to the spectral and spatial characteristics of each band, ensuring optimal compression while preserving critical image details. △ Less

Submitted 11 June, 2025; v1 submitted 1 June, 2025; originally announced June 2025.

Comments: Accepted to IGARSS 2025 (Oral)

arXiv:2505.16798 [pdf, ps, other]

SEED: Speaker Embedding Enhancement Diffusion Model

Authors: KiHyun Nam, Jungwoo Heo, Jee-weon Jung, Gangin Park, Chaeyoung Jung, Ha-Jin Yu, Joon Son Chung

Abstract: A primary challenge when deploying speaker recognition systems in real-world applications is performance degradation caused by environmental mismatch. We propose a diffusion-based method that takes speaker embeddings extracted from a pre-trained speaker recognition model and generates refined embeddings. For training, our approach progressively adds Gaussian noise to both clean and noisy speaker e… ▽ More A primary challenge when deploying speaker recognition systems in real-world applications is performance degradation caused by environmental mismatch. We propose a diffusion-based method that takes speaker embeddings extracted from a pre-trained speaker recognition model and generates refined embeddings. For training, our approach progressively adds Gaussian noise to both clean and noisy speaker embeddings extracted from clean and noisy speech, respectively, via forward process of a diffusion model, and then reconstructs them to clean embeddings in the reverse process. While inferencing, all embeddings are regenerated via diffusion process. Our method needs neither speaker label nor any modification to the existing speaker recognition pipeline. Experiments on evaluation sets simulating environment mismatch scenarios show that our method can improve recognition accuracy by up to 19.6% over baseline models while retaining performance on conventional scenarios. We publish our code here https://github.com/kaistmm/seed-pytorch △ Less

Submitted 22 May, 2025; originally announced May 2025.

Comments: Accepted to Interspeech 2025. The official code can be found at https://github.com/kaistmm/seed-pytorch

arXiv:2503.03785 [pdf, other]

Tackling Few-Shot Segmentation in Remote Sensing via Inpainting Diffusion Model

Authors: Steve Andreas Immanuel, Woojin Cho, Junhyuk Heo, Darongsae Kwon

Abstract: Limited data is a common problem in remote sensing due to the high cost of obtaining annotated samples. In the few-shot segmentation task, models are typically trained on base classes with abundant annotations and later adapted to novel classes with limited examples. However, this often necessitates specialized model architectures or complex training strategies. Instead, we propose a simple approa… ▽ More Limited data is a common problem in remote sensing due to the high cost of obtaining annotated samples. In the few-shot segmentation task, models are typically trained on base classes with abundant annotations and later adapted to novel classes with limited examples. However, this often necessitates specialized model architectures or complex training strategies. Instead, we propose a simple approach that leverages diffusion models to generate diverse variations of novel-class objects within a given scene, conditioned by the limited examples of the novel classes. By framing the problem as an image inpainting task, we synthesize plausible instances of novel classes under various environments, effectively increasing the number of samples for the novel classes and mitigating overfitting. The generated samples are then assessed using a cosine similarity metric to ensure semantic consistency with the novel classes. Additionally, we employ Segment Anything Model (SAM) to segment the generated samples and obtain precise annotations. By using high-quality synthetic data, we can directly fine-tune off-the-shelf segmentation models. Experimental results demonstrate that our method significantly enhances segmentation performance in low-data regimes, highlighting its potential for real-world remote sensing applications. △ Less

Submitted 4 March, 2025; originally announced March 2025.

Comments: Accepted to ICLRW 2025 (Oral)

arXiv:2502.08179 [pdf, other]

Can TDD Be Employed in LEO SatCom Systems? Challenges and Potential Approaches

Authors: Hyunwoo Lee, Ian P. Roberts, Jehyun Heo, Joohyun Son, Hanwoong Kim, Yunseo Lee, Daesik Hong

Abstract: Frequency-division duplexing (FDD) remains the de facto standard in modern low Earth orbit (LEO) satellite communication (SatCom) systems, such as SpaceX's Starlink, OneWeb, and Amazon's Project Kuiper. While time-division duplexing (TDD) is often regarded as superior in today's terrestrial networks, its viability in future LEO SatCom systems remains unclear. This article details how the long prop… ▽ More Frequency-division duplexing (FDD) remains the de facto standard in modern low Earth orbit (LEO) satellite communication (SatCom) systems, such as SpaceX's Starlink, OneWeb, and Amazon's Project Kuiper. While time-division duplexing (TDD) is often regarded as superior in today's terrestrial networks, its viability in future LEO SatCom systems remains unclear. This article details how the long propagation delays and high orbital velocities exhibited by LEO SatCom systems impedes the adoption of TDD, due to challenges involving the frame structure and synchronization. We then present potential approaches to overcome these challenges, which vary in terms of resource efficiency and operational/device complexity and thus would likely be application-specific. We conclude by assessing the performance of these proposed approaches, putting into perspective the tradeoff between complexity and performance gains over FDD. Overall, this article aims to motivate future investigation into the prospects of TDD in LEO SatCom systems and solutions to enable such, with the goal of enhancing future systems and unifying them with terrestrial networks. △ Less

Submitted 12 February, 2025; originally announced February 2025.

arXiv:2412.00124 [pdf, other]

Auto-Encoded Supervision for Perceptual Image Super-Resolution

Authors: MinKyu Lee, Sangeek Hyun, Woojin Jun, Jae-Pil Heo

Abstract: This work tackles the fidelity objective in the perceptual super-resolution~(SR). Specifically, we address the shortcomings of pixel-level $L_\text{p}$ loss ($\mathcal{L}_\text{pix}$) in the GAN-based SR framework. Since $L_\text{pix}$ is known to have a trade-off relationship against perceptual quality, prior methods often multiply a small scale factor or utilize low-pass filters. However, this w… ▽ More This work tackles the fidelity objective in the perceptual super-resolution~(SR). Specifically, we address the shortcomings of pixel-level $L_\text{p}$ loss ($\mathcal{L}_\text{pix}$) in the GAN-based SR framework. Since $L_\text{pix}$ is known to have a trade-off relationship against perceptual quality, prior methods often multiply a small scale factor or utilize low-pass filters. However, this work shows that these circumventions fail to address the fundamental factor that induces blurring. Accordingly, we focus on two points: 1) precisely discriminating the subcomponent of $L_\text{pix}$ that contributes to blurring, and 2) only guiding based on the factor that is free from this trade-off relationship. We show that they can be achieved in a surprisingly simple manner, with an Auto-Encoder (AE) pretrained with $L_\text{pix}$. Accordingly, we propose the Auto-Encoded Supervision for Optimal Penalization loss ($L_\text{AESOP}$), a novel loss function that measures distance in the AE space, instead of the raw pixel space. Note that the AE space indicates the space after the decoder, not the bottleneck. By simply substituting $L_\text{pix}$ with $L_\text{AESOP}$, we can provide effective reconstruction guidance without compromising perceptual quality. Designed for simplicity, our method enables easy integration into existing SR frameworks. Experimental results verify that AESOP can lead to favorable results in the perceptual SR task. △ Less

Submitted 11 April, 2025; v1 submitted 28 November, 2024; originally announced December 2024.

Comments: Codes are available at https://github.com/2minkyulee/AESOP-Auto-Encoded-Supervision-for-Perceptual-Image-Super-Resolution

arXiv:2406.07103 [pdf, other]

MR-RawNet: Speaker verification system with multiple temporal resolutions for variable duration utterances using raw waveforms

Authors: Seung-bin Kim, Chan-yeong Lim, Jungwoo Heo, Ju-ho Kim, Hyun-seo Shin, Kyo-Won Koo, Ha-Jin Yu

Abstract: In speaker verification systems, the utilization of short utterances presents a persistent challenge, leading to performance degradation primarily due to insufficient phonetic information to characterize the speakers. To overcome this obstacle, we propose a novel structure, MR-RawNet, designed to enhance the robustness of speaker verification systems against variable duration utterances using raw… ▽ More In speaker verification systems, the utilization of short utterances presents a persistent challenge, leading to performance degradation primarily due to insufficient phonetic information to characterize the speakers. To overcome this obstacle, we propose a novel structure, MR-RawNet, designed to enhance the robustness of speaker verification systems against variable duration utterances using raw waveforms. The MR-RawNet extracts time-frequency representations from raw waveforms via a multi-resolution feature extractor that optimally adjusts both temporal and spectral resolutions simultaneously. Furthermore, we apply a multi-resolution attention block that focuses on diverse and extensive temporal contexts, ensuring robustness against changes in utterance length. The experimental results, conducted on VoxCeleb1 dataset, demonstrate that the MR-RawNet exhibits superior performance in handling utterances of variable duration compared to other raw waveform-based systems. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: 5 pages, accepted by Interspeech 2024

arXiv:2405.05426 [pdf, other]

ATLS: Automated Trailer Loading for Surface Vessels

Authors: Amer Abughaida, Meet Gandhi, Jun Heo, Vaishnav Tadiparthi, Yosuke Sakamoto, Joohyun Woo, Sangjae Bae

Abstract: Automated docking technologies of marine boats have been enlightened by an increasing number of literature. This paper contributes to the literature by proposing a mathematical framework that automates "trailer loading" in the presence of wind disturbances, which is unexplored despite its importance to boat owners. The comprehensive pipeline of localization, system identification, and trajectory o… ▽ More Automated docking technologies of marine boats have been enlightened by an increasing number of literature. This paper contributes to the literature by proposing a mathematical framework that automates "trailer loading" in the presence of wind disturbances, which is unexplored despite its importance to boat owners. The comprehensive pipeline of localization, system identification, and trajectory optimization is structured, followed by several techniques to improve performance reliability. The performance of the proposed method was demonstrated with a commercial pontoon boat in Michigan, in 2023, securing a success rate of 80\% in the presence of perception errors and wind disturbance. This result indicates the strong potential of the proposed pipeline, effectively accommodating the wind effect. △ Less

Submitted 8 May, 2024; originally announced May 2024.

Comments: To be presented at IEEE Intelligent Vehicles Symposium (IV 2024)

arXiv:2309.08320 [pdf, other]

Diff-SV: A Unified Hierarchical Framework for Noise-Robust Speaker Verification Using Score-Based Diffusion Probabilistic Models

Authors: Ju-ho Kim, Jungwoo Heo, Hyun-seo Shin, Chan-yeong Lim, Ha-Jin Yu

Abstract: Background noise considerably reduces the accuracy and reliability of speaker verification (SV) systems. These challenges can be addressed using a speech enhancement system as a front-end module. Recently, diffusion probabilistic models (DPMs) have exhibited remarkable noise-compensation capabilities in the speech enhancement domain. Building on this success, we propose Diff-SV, a noise-robust SV… ▽ More Background noise considerably reduces the accuracy and reliability of speaker verification (SV) systems. These challenges can be addressed using a speech enhancement system as a front-end module. Recently, diffusion probabilistic models (DPMs) have exhibited remarkable noise-compensation capabilities in the speech enhancement domain. Building on this success, we propose Diff-SV, a noise-robust SV framework that leverages DPM. Diff-SV unifies a DPM-based speech enhancement system with a speaker embedding extractor, and yields a discriminative and noise-tolerable speaker representation through a hierarchical structure. The proposed model was evaluated under both in-domain and out-of-domain noisy conditions using the VoxCeleb1 test set, an external noise source, and the VOiCES corpus. The obtained experimental results demonstrate that Diff-SV achieves state-of-the-art performance, outperforming recently proposed noise-robust SV systems. △ Less

Submitted 13 December, 2023; v1 submitted 14 September, 2023; originally announced September 2023.

Comments: 5 pages, 2 figures, accepted for ICASSP 2024

arXiv:2309.08208 [pdf, other]

HM-Conformer: A Conformer-based audio deepfake detection system with hierarchical pooling and multi-level classification token aggregation methods

Authors: Hyun-seo Shin, Jungwoo Heo, Ju-ho Kim, Chan-yeong Lim, Wonbin Kim, Ha-Jin Yu

Abstract: Audio deepfake detection (ADD) is the task of detecting spoofing attacks generated by text-to-speech or voice conversion systems. Spoofing evidence, which helps to distinguish between spoofed and bona-fide utterances, might exist either locally or globally in the input features. To capture these, the Conformer, which consists of Transformers and CNN, possesses a suitable structure. However, since… ▽ More Audio deepfake detection (ADD) is the task of detecting spoofing attacks generated by text-to-speech or voice conversion systems. Spoofing evidence, which helps to distinguish between spoofed and bona-fide utterances, might exist either locally or globally in the input features. To capture these, the Conformer, which consists of Transformers and CNN, possesses a suitable structure. However, since the Conformer was designed for sequence-to-sequence tasks, its direct application to ADD tasks may be sub-optimal. To tackle this limitation, we propose HM-Conformer by adopting two components: (1) Hierarchical pooling method progressively reducing the sequence length to eliminate duplicated information (2) Multi-level classification token aggregation method utilizing classification tokens to gather information from different blocks. Owing to these components, HM-Conformer can efficiently detect spoofing evidence by processing various sequence lengths and aggregating them. In experimental results on the ASVspoof 2021 Deepfake dataset, HM-Conformer achieved a 15.71% EER, showing competitive performance compared to recent systems. △ Less

Submitted 15 September, 2023; originally announced September 2023.

Comments: Submitted to 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024)

arXiv:2309.04549 [pdf, other]

doi 10.1109/SEC54971.2022.00036

Poster: Making Edge-assisted LiDAR Perceptions Robust to Lossy Point Cloud Compression

Authors: Jin Heo, Gregorie Phillips, Per-Erik Brodin, Ada Gavrilovska

Abstract: Real-time light detection and ranging (LiDAR) perceptions, e.g., 3D object detection and simultaneous localization and mapping are computationally intensive to mobile devices of limited resources and often offloaded on the edge. Offloading LiDAR perceptions requires compressing the raw sensor data, and lossy compression is used for efficiently reducing the data volume. Lossy compression degrades t… ▽ More Real-time light detection and ranging (LiDAR) perceptions, e.g., 3D object detection and simultaneous localization and mapping are computationally intensive to mobile devices of limited resources and often offloaded on the edge. Offloading LiDAR perceptions requires compressing the raw sensor data, and lossy compression is used for efficiently reducing the data volume. Lossy compression degrades the quality of LiDAR point clouds, and the perception performance is decreased consequently. In this work, we present an interpolation algorithm improving the quality of a LiDAR point cloud to mitigate the perception performance loss due to lossy compression. The algorithm targets the range image (RI) representation of a point cloud and interpolates points at the RI based on depth gradients. Compared to existing image interpolation algorithms, our algorithm shows a better qualitative result when the point cloud is reconstructed from the interpolated RI. With the preliminary results, we also describe the next steps of the current work. △ Less

Submitted 8 September, 2023; originally announced September 2023.

Comments: extended abstract of 2 pages, 2 figures, 1 table

arXiv:2307.10628 [pdf, other]

PAS: Partial Additive Speech Data Augmentation Method for Noise Robust Speaker Verification

Authors: Wonbin Kim, Hyun-seo Shin, Ju-ho Kim, Jungwoo Heo, Chan-yeong Lim, Ha-Jin Yu

Abstract: Background noise reduces speech intelligibility and quality, making speaker verification (SV) in noisy environments a challenging task. To improve the noise robustness of SV systems, additive noise data augmentation method has been commonly used. In this paper, we propose a new additive noise method, partial additive speech (PAS), which aims to train SV systems to be less affected by noisy environ… ▽ More Background noise reduces speech intelligibility and quality, making speaker verification (SV) in noisy environments a challenging task. To improve the noise robustness of SV systems, additive noise data augmentation method has been commonly used. In this paper, we propose a new additive noise method, partial additive speech (PAS), which aims to train SV systems to be less affected by noisy environments. The experimental results demonstrate that PAS outperforms traditional additive noise in terms of equal error rates (EER), with relative improvements of 4.64% and 5.01% observed in SE-ResNet34 and ECAPA-TDNN. We also show the effectiveness of proposed method by analyzing attention modules and visualizing speaker embeddings. △ Less

Submitted 20 July, 2023; originally announced July 2023.

Comments: 5 pages, 2 figures, 1 table, accepted to CKAIA2023 as a conference paper

arXiv:2305.17394 [pdf, other]

One-Step Knowledge Distillation and Fine-Tuning in Using Large Pre-Trained Self-Supervised Learning Models for Speaker Verification

Authors: Jungwoo Heo, Chan-yeong Lim, Ju-ho Kim, Hyun-seo Shin, Ha-Jin Yu

Abstract: The application of speech self-supervised learning (SSL) models has achieved remarkable performance in speaker verification (SV). However, there is a computational cost hurdle in employing them, which makes development and deployment difficult. Several studies have simply compressed SSL models through knowledge distillation (KD) without considering the target task. Consequently, these methods coul… ▽ More The application of speech self-supervised learning (SSL) models has achieved remarkable performance in speaker verification (SV). However, there is a computational cost hurdle in employing them, which makes development and deployment difficult. Several studies have simply compressed SSL models through knowledge distillation (KD) without considering the target task. Consequently, these methods could not extract SV-tailored features. This paper suggests One-Step Knowledge Distillation and Fine-Tuning (OS-KDFT), which incorporates KD and fine-tuning (FT). We optimize a student model for SV during KD training to avert the distillation of inappropriate information for the SV. OS-KDFT could downsize Wav2Vec 2.0 based ECAPA-TDNN size by approximately 76.2%, and reduce the SSL model's inference time by 79% while presenting an EER of 0.98%. The proposed OS-KDFT is validated across VoxCeleb1 and VoxCeleb2 datasets and W2V2 and HuBERT SSL models. Experiments are available on our GitHub. △ Less

Submitted 7 June, 2023; v1 submitted 27 May, 2023; originally announced May 2023.

Comments: ISCA INTERSPEECH 2023

arXiv:2211.02227 [pdf, other]

Integrated Parameter-Efficient Tuning for General-Purpose Audio Models

Authors: Ju-ho Kim, Jungwoo Heo, Hyun-seo Shin, Chan-yeong Lim, Ha-Jin Yu

Abstract: The advent of hyper-scale and general-purpose pre-trained models is shifting the paradigm of building task-specific models for target tasks. In the field of audio research, task-agnostic pre-trained models with high transferability and adaptability have achieved state-of-the-art performances through fine-tuning for downstream tasks. Nevertheless, re-training all the parameters of these massive mod… ▽ More The advent of hyper-scale and general-purpose pre-trained models is shifting the paradigm of building task-specific models for target tasks. In the field of audio research, task-agnostic pre-trained models with high transferability and adaptability have achieved state-of-the-art performances through fine-tuning for downstream tasks. Nevertheless, re-training all the parameters of these massive models entails an enormous amount of time and cost, along with a huge carbon footprint. To overcome these limitations, the present study explores and applies efficient transfer learning methods in the audio domain. We also propose an integrated parameter-efficient tuning (IPET) framework by aggregating the embedding prompt (a prompt-based learning approach), and the adapter (an effective transfer learning method). We demonstrate the efficacy of the proposed framework using two backbone pre-trained audio models with different characteristics: the audio spectrogram transformer and wav2vec 2.0. The proposed IPET framework exhibits remarkable performance compared to fine-tuning method with fewer trainable parameters in four downstream tasks: sound event classification, music genre classification, keyword spotting, and speaker verification. Furthermore, the authors identify and analyze the shortcomings of the IPET framework, providing lessons and research directions for parameter efficient tuning in the audio domain. △ Less

Submitted 1 March, 2023; v1 submitted 3 November, 2022; originally announced November 2022.

Comments: 5 pages, 3 figures

arXiv:2211.01599 [pdf, other]

Convolution channel separation and frequency sub-bands aggregation for music genre classification

Authors: Jungwoo Heo, Hyun-seo Shin, Ju-ho Kim, Chan-yeong Lim, Ha-Jin Yu

Abstract: In music, short-term features such as pitch and tempo constitute long-term semantic features such as melody and narrative. A music genre classification (MGC) system should be able to analyze these features. In this research, we propose a novel framework that can extract and aggregate both short- and long-term features hierarchically. Our framework is based on ECAPA-TDNN, where all the layers that… ▽ More In music, short-term features such as pitch and tempo constitute long-term semantic features such as melody and narrative. A music genre classification (MGC) system should be able to analyze these features. In this research, we propose a novel framework that can extract and aggregate both short- and long-term features hierarchically. Our framework is based on ECAPA-TDNN, where all the layers that extract short-term features are affected by the layers that extract long-term features because of the back-propagation training. To prevent the distortion of short-term features, we devised the convolution channel separation technique that separates short-term features from long-term feature extraction paths. To extract more diverse features from our framework, we incorporated the frequency sub-bands aggregation method, which divides the input spectrogram along frequency bandwidths and processes each segment. We evaluated our framework using the Melon Playlist dataset which is a large-scale dataset containing 600 times more data than GTZAN which is a widely used dataset in MGC studies. As the result, our framework achieved 70.4% accuracy, which was improved by 16.9% compared to a conventional framework. △ Less

Submitted 3 November, 2022; originally announced November 2022.

arXiv:2206.13807 [pdf, other]

doi 10.21437/Interspeech.2022-602

Two Methods for Spoofing-Aware Speaker Verification: Multi-Layer Perceptron Score Fusion Model and Integrated Embedding Projector

Authors: Jungwoo Heo, Ju-ho Kim, Hyun-seo Shin

Abstract: The use of deep neural networks (DNN) has dramatically elevated the performance of automatic speaker verification (ASV) over the last decade. However, ASV systems can be easily neutralized by spoofing attacks. Therefore, the Spoofing-Aware Speaker Verification (SASV) challenge is designed and held to promote development of systems that can perform ASV considering spoofing attacks by integrating AS… ▽ More The use of deep neural networks (DNN) has dramatically elevated the performance of automatic speaker verification (ASV) over the last decade. However, ASV systems can be easily neutralized by spoofing attacks. Therefore, the Spoofing-Aware Speaker Verification (SASV) challenge is designed and held to promote development of systems that can perform ASV considering spoofing attacks by integrating ASV and spoofing countermeasure (CM) systems. In this paper, we propose two back-end systems: multi-layer perceptron score fusion model (MSFM) and integrated embedding projector (IEP). The MSFM, score fusion back-end system, derived SASV score utilizing ASV and CM scores and embeddings. On the other hand,IEP combines ASV and CM embeddings into SASV embedding and calculates final SASV score based on the cosine similarity. We effectively integrated ASV and CM systems through proposed MSFM and IEP and achieved the SASV equal error rates 0.56%, 1.32% on the official evaluation trials of the SASV 2022 challenge. △ Less

Submitted 28 June, 2022; originally announced June 2022.

Comments: 5 pages, 4 figures, 5 tables, accepted to 2022 Interspeech as a conference paper

Journal ref: Proc. Interspeech 2022

arXiv:2206.13044 [pdf, other]

Extended U-Net for Speaker Verification in Noisy Environments

Authors: Ju-ho Kim, Jungwoo Heo, Hye-jin Shim, Ha-Jin Yu

Abstract: Background noise is a well-known factor that deteriorates the accuracy and reliability of speaker verification (SV) systems by blurring speech intelligibility. Various studies have used separate pretrained enhancement models as the front-end module of the SV system in noisy environments, and these methods effectively remove noises. However, the denoising process of independent enhancement models n… ▽ More Background noise is a well-known factor that deteriorates the accuracy and reliability of speaker verification (SV) systems by blurring speech intelligibility. Various studies have used separate pretrained enhancement models as the front-end module of the SV system in noisy environments, and these methods effectively remove noises. However, the denoising process of independent enhancement models not tailored to the SV task can also distort the speaker information included in utterances. We argue that the enhancement network and speaker embedding extractor should be fully jointly trained for SV tasks under noisy conditions to alleviate this issue. Therefore, we proposed a U-Net-based integrated framework that simultaneously optimizes speaker identification and feature enhancement losses. Moreover, we analyzed the structural limitations of using U-Net directly for noise SV tasks and further proposed Extended U-Net to reduce these drawbacks. We evaluated the models on the noise-synthesized VoxCeleb1 test set and VOiCES development set recorded in various noisy scenarios. The experimental results demonstrate that the U-Net-based fully joint training framework is more effective than the baseline, and the extended U-Net exhibited state-of-the-art performance versus the recently proposed compensation systems. △ Less

Submitted 27 June, 2022; originally announced June 2022.

Comments: 5 pages, 2 figures, 4 tables, accepted to 2022 Interspeech as a conference paper

arXiv:2112.12343 [pdf, other]

Graph attentive feature aggregation for text-independent speaker verification

Authors: Hye-jin Shim, Jungwoo Heo, Jae-han Park, Ga-hui Lee, Ha-Jin Yu

Abstract: The objective of this paper is to combine multiple frame-level features into a single utterance-level representation considering pairwise relationship. For this purpose, we propose a novel graph attentive feature aggregation module by interpreting each frame-level feature as a node of a graph. The inter-relationship between all possible pairs of features, typically exploited indirectly, can be dir… ▽ More The objective of this paper is to combine multiple frame-level features into a single utterance-level representation considering pairwise relationship. For this purpose, we propose a novel graph attentive feature aggregation module by interpreting each frame-level feature as a node of a graph. The inter-relationship between all possible pairs of features, typically exploited indirectly, can be directly modeled using a graph. The module comprises a graph attention layer and a graph pooling layer followed by a readout operation. The graph attention layer first models the non-Euclidean data manifold between different nodes. Then, the graph pooling layer discards less informative nodes considering the significance of the nodes. Finally, the readout operation combines the remaining nodes into a single representation. We employ two recent systems, SE-ResNet and RawNet2, with different input features and architectures and demonstrate that the proposed feature aggregation module consistently shows a relative improvement over 10%, compared to the baseline. △ Less

Submitted 22 December, 2021; originally announced December 2021.

Comments: 5 pages, 1 figure, 6 tables, submitted to ICASSP 2022

arXiv:2112.07935 [pdf, other]

RawNeXt: Speaker verification system for variable-duration utterances with deep layer aggregation and extended dynamic scaling policies

Authors: Ju-ho Kim, Hye-jin Shim, Jungwoo Heo, Ha-Jin Yu

Abstract: Despite achieving satisfactory performance in speaker verification using deep neural networks, variable-duration utterances remain a challenge that threatens the robustness of systems. To deal with this issue, we propose a speaker verification system called RawNeXt that can handle input raw waveforms of arbitrary length by employing the following two components: (1) A deep layer aggregation strate… ▽ More Despite achieving satisfactory performance in speaker verification using deep neural networks, variable-duration utterances remain a challenge that threatens the robustness of systems. To deal with this issue, we propose a speaker verification system called RawNeXt that can handle input raw waveforms of arbitrary length by employing the following two components: (1) A deep layer aggregation strategy enhances speaker information by iteratively and hierarchically aggregating features of various time scales and spectral channels output from blocks. (2) An extended dynamic scaling policy flexibly processes features according to the length of the utterance by selectively merging the activations of different resolution branches in each block. Owing to these two components, our proposed model can extract speaker embeddings rich in time-spectral information and operate dynamically on length variations. Experimental results on the VoxCeleb1 test set consisting of various duration utterances demonstrate that RawNeXt achieves state-of-the-art performance compared to the recently proposed systems. Our code and trained model weights are available at https://github.com/wngh1187/RawNeXt. △ Less

Submitted 27 June, 2022; v1 submitted 15 December, 2021; originally announced December 2021.

Comments: 5 pages, 2 figures, 4 tables, accepted to 2022 ICASSP as a conference paper

Showing 1–18 of 18 results for author: Heo, J