Skip to main content

Showing 1–38 of 38 results for author: Huh, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2505.07851  [pdf, other

    eess.IV cs.AI cs.CV cs.RO

    Pose Estimation for Intra-cardiac Echocardiography Catheter via AI-Based Anatomical Understanding

    Authors: Jaeyoung Huh, Ankur Kapoor, Young-Ho Kim

    Abstract: Intra-cardiac Echocardiography (ICE) plays a crucial role in Electrophysiology (EP) and Structural Heart Disease (SHD) interventions by providing high-resolution, real-time imaging of cardiac structures. However, existing navigation methods rely on electromagnetic (EM) tracking, which is susceptible to interference and position drift, or require manual adjustments based on operator expertise. To o… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

  2. arXiv:2505.05518  [pdf, other

    eess.IV cs.CV cs.RO

    Guidance for Intra-cardiac Echocardiography Manipulation to Maintain Continuous Therapy Device Tip Visibility

    Authors: Jaeyoung Huh, Ankur Kapoor, Young-Ho Kim

    Abstract: Intra-cardiac Echocardiography (ICE) plays a critical role in Electrophysiology (EP) and Structural Heart Disease (SHD) interventions by providing real-time visualization of intracardiac structures. However, maintaining continuous visibility of the therapy device tip remains a challenge due to frequent adjustments required during manual ICE catheter manipulation. To address this, we propose an AI-… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

  3. arXiv:2410.11068  [pdf, other

    cs.CV cs.LG cs.SD eess.AS

    Character-aware audio-visual subtitling in context

    Authors: Jaesung Huh, Andrew Zisserman

    Abstract: This paper presents an improved framework for character-aware audio-visual subtitling in TV shows. Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues. This holistic solution addresses what is said, when it's said, and who is speaking, providing a more comprehensive and accurate character-aware subtitling for TV shows. Ou… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

    Comments: ACCV 2024

  4. arXiv:2408.14886  [pdf, other

    cs.SD cs.AI eess.AS

    The VoxCeleb Speaker Recognition Challenge: A Retrospective

    Authors: Jaesung Huh, Joon Son Chung, Arsha Nagrani, Andrew Brown, Jee-weon Jung, Daniel Garcia-Romero, Andrew Zisserman

    Abstract: The VoxCeleb Speaker Recognition Challenges (VoxSRC) were a series of challenges and workshops that ran annually from 2019 to 2023. The challenges primarily evaluated the tasks of speaker recognition and diarisation under various settings including: closed and open training data; as well as supervised, self-supervised, and semi-supervised training for domain adaptation. The challenges also provide… ▽ More

    Submitted 27 August, 2024; originally announced August 2024.

    Comments: TASLP 2024

  5. arXiv:2401.12039  [pdf, other

    cs.CV cs.SD eess.AS

    Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling

    Authors: Bruno Korbar, Jaesung Huh, Andrew Zisserman

    Abstract: The goal of this paper is automatic character-aware subtitle generation. Given a video and a minimal amount of metadata, we propose an audio-visual method that generates a full transcript of the dialogue, with precise speech timestamps, and the character speaking identified. The key idea is to first use audio-visual cues to select a set of high-precision audio exemplars for each character, and the… ▽ More

    Submitted 22 January, 2024; originally announced January 2024.

    Comments: Accepted for publication in ICASSP 2024

  6. arXiv:2312.03013  [pdf, other

    eess.IV cs.AI cs.CV cs.LG

    Breast Ultrasound Report Generation using LangChain

    Authors: Jaeyoung Huh, Hyun Jeong Park, Jong Chul Ye

    Abstract: Breast ultrasound (BUS) is a critical diagnostic tool in the field of breast imaging, aiding in the early detection and characterization of breast abnormalities. Interpreting breast ultrasound images commonly involves creating comprehensive medical reports, containing vital information to promptly assess the patient's condition. However, the ultrasound imaging system necessitates capturing multipl… ▽ More

    Submitted 4 December, 2023; originally announced December 2023.

  7. arXiv:2307.09006  [pdf, other

    cs.SD cs.LG eess.AS

    OxfordVGG Submission to the EGO4D AV Transcription Challenge

    Authors: Jaesung Huh, Max Bain, Andrew Zisserman

    Abstract: This report presents the technical details of our submission on the EGO4D Audio-Visual (AV) Automatic Speech Recognition Challenge 2023 from the OxfordVGG team. We present WhisperX, a system for efficient speech transcription of long-form audio with word-level time alignment, along with two text normalisers which are publicly available. Our final submission obtained 56.0% of the Word Error Rate (W… ▽ More

    Submitted 18 July, 2023; originally announced July 2023.

    Comments: Technical Report

  8. arXiv:2305.10534  [pdf, other

    cs.RO eess.SY

    RAMP: Hierarchical Reactive Motion Planning for Manipulation Tasks Using Implicit Signed Distance Functions

    Authors: Vasileios Vasilopoulos, Suveer Garg, Pedro Piacenza, Jinwook Huh, Volkan Isler

    Abstract: We introduce Reactive Action and Motion Planner (RAMP), which combines the strengths of sampling-based and reactive approaches for motion planning. In essence, RAMP is a hierarchical approach where a novel variant of a Model Predictive Path Integral (MPPI) controller is used to generate trajectories which are then followed asynchronously by a local vector field controller. We demonstrate, in the c… ▽ More

    Submitted 31 July, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

    Comments: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2023) - 8 pages, 6 figures

  9. arXiv:2303.00747  [pdf, other

    cs.SD eess.AS

    WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

    Authors: Max Bain, Jaesung Huh, Tengda Han, Andrew Zisserman

    Abstract: Large-scale, weakly-supervised speech recognition models, such as Whisper, have demonstrated impressive results on speech recognition across domains and languages. However, their application to long audio transcription via buffered or sliding window approaches is prone to drifting, hallucination & repetition; and prohibits batched transcription due to their sequential nature. Further, timestamps c… ▽ More

    Submitted 11 July, 2023; v1 submitted 1 March, 2023; originally announced March 2023.

    Comments: Accepted to INTERSPEECH 2023

  10. arXiv:2303.00091  [pdf, other

    eess.AS cs.AI cs.CL cs.CV cs.SD eess.IV

    Improving Medical Speech-to-Text Accuracy with Vision-Language Pre-training Model

    Authors: Jaeyoung Huh, Sangjoon Park, Jeong Eun Lee, Jong Chul Ye

    Abstract: Automatic Speech Recognition (ASR) is a technology that converts spoken words into text, facilitating interaction between humans and machines. One of the most common applications of ASR is Speech-To-Text (STT) technology, which simplifies user workflows by transcribing spoken words into text. In the medical field, STT has the potential to significantly reduce the workload of clinicians who rely on… ▽ More

    Submitted 27 February, 2023; originally announced March 2023.

  11. arXiv:2302.10248  [pdf, ps, other

    cs.SD cs.LG eess.AS

    VoxSRC 2022: The Fourth VoxCeleb Speaker Recognition Challenge

    Authors: Jaesung Huh, Andrew Brown, Jee-weon Jung, Joon Son Chung, Arsha Nagrani, Daniel Garcia-Romero, Andrew Zisserman

    Abstract: This paper summarises the findings from the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22), which was held in conjunction with INTERSPEECH 2022. The goal of this challenge was to evaluate how well state-of-the-art speaker recognition systems can diarise and recognise speakers from speech obtained "in the wild". The challenge consisted of: (i) the provision of publicly available speaker re… ▽ More

    Submitted 6 March, 2023; v1 submitted 20 February, 2023; originally announced February 2023.

  12. arXiv:2302.00646  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Epic-Sounds: A Large-scale Dataset of Actions That Sound

    Authors: Jaesung Huh, Jacob Chalk, Evangelos Kazakos, Dima Damen, Andrew Zisserman

    Abstract: We introduce Epic-Sounds, a large-scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of the egocentric videos. We propose an annotation pipeline where annotators temporally label distinguishable audio segments and describe the action that could have caused this sound. We identify actions that can be discriminated purely from audio, through groupi… ▽ More

    Submitted 28 September, 2024; v1 submitted 1 February, 2023; originally announced February 2023.

    Comments: 12 pages, 12 figures

  13. arXiv:2211.05910  [pdf, other

    eess.IV cs.CV

    Efficient and Accurate Quantized Image Super-Resolution on Mobile NPUs, Mobile AI & AIM 2022 challenge: Report

    Authors: Andrey Ignatov, Radu Timofte, Maurizio Denna, Abdel Younes, Ganzorig Gankhuyag, Jingang Huh, Myeong Kyun Kim, Kihwan Yoon, Hyeon-Cheol Moon, Seungho Lee, Yoonsik Choe, Jinwoo Jeong, Sungjei Kim, Maciej Smyl, Tomasz Latkowski, Pawel Kubik, Michal Sokolski, Yujie Ma, Jiahao Chao, Zhou Zhou, Hongfan Gao, Zhengfeng Yang, Zhenbing Zeng, Zhengyang Zhuge, Chenghua Li , et al. (71 additional authors not shown)

    Abstract: Image super-resolution is a common task on mobile and IoT devices, where one often needs to upscale and enhance low-resolution images and video frames. While numerous solutions have been proposed for this problem in the past, they are usually not compatible with low-power mobile NPUs having many computational and memory constraints. In this Mobile AI challenge, we address this problem and propose… ▽ More

    Submitted 7 November, 2022; originally announced November 2022.

    Comments: arXiv admin note: text overlap with arXiv:2105.07825, arXiv:2105.08826, arXiv:2211.04470, arXiv:2211.03885, arXiv:2211.05256

  14. arXiv:2211.00437  [pdf, other

    eess.AS cs.SD

    Disentangled representation learning for multilingual speaker recognition

    Authors: Kihyun Nam, Youkyum Kim, Jaesung Huh, Hee Soo Heo, Jee-weon Jung, Joon Son Chung

    Abstract: The goal of this paper is to learn robust speaker representation for bilingual speaking scenario. The majority of the world's population speak at least two languages; however, most speaker recognition systems fail to recognise the same speaker when speaking in different languages. Popular speaker recognition evaluation sets do not consider the bilingual scenario, making it difficult to analyse t… ▽ More

    Submitted 6 June, 2023; v1 submitted 1 November, 2022; originally announced November 2022.

    Comments: Interspeech 2023

  15. arXiv:2210.14682  [pdf, other

    cs.SD cs.AI eess.AS

    In search of strong embedding extractors for speaker diarisation

    Authors: Jee-weon Jung, Hee-Soo Heo, Bong-Jin Lee, Jaesung Huh, Andrew Brown, Youngki Kwon, Shinji Watanabe, Joon Son Chung

    Abstract: Speaker embedding extractors (EEs), which map input audio to a speaker discriminant latent space, are of paramount importance in speaker diarisation. However, there are several challenges when adopting EEs for diarisation, from which we tackle two key problems. First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and… ▽ More

    Submitted 26 October, 2022; originally announced October 2022.

    Comments: 5pages, 1 figure, 2 tables, submitted to ICASSP

  16. arXiv:2202.08262  [pdf, other

    eess.IV cs.AI cs.CV cs.LG

    Phase Aberration Robust Beamformer for Planewave US Using Self-Supervised Learning

    Authors: Shujaat Khan, Jaeyoung Huh, Jong Chul Ye

    Abstract: Ultrasound (US) is widely used for clinical imaging applications thanks to its real-time and non-invasive nature. However, its lesion detectability is often limited in many applications due to the phase aberration artefact caused by variations in the speed of sound (SoS) within body parts. To address this, here we propose a novel self-supervised 3D CNN that enables phase aberration robust plane-wa… ▽ More

    Submitted 16 February, 2022; originally announced February 2022.

    Comments: 10 pages, 12 figures, submitted to IEEE-TMI

  17. arXiv:2201.04583  [pdf, other

    cs.SD eess.AS

    VoxSRC 2021: The Third VoxCeleb Speaker Recognition Challenge

    Authors: Andrew Brown, Jaesung Huh, Joon Son Chung, Arsha Nagrani, Daniel Garcia-Romero, Andrew Zisserman

    Abstract: The third instalment of the VoxCeleb Speaker Recognition Challenge was held in conjunction with Interspeech 2021. The aim of this challenge was to assess how well current speaker recognition technology is able to diarise and recognise speakers in unconstrained or `in the wild' data. The challenge consisted of: (i) the provision of publicly available speaker recognition and diarisation data from Yo… ▽ More

    Submitted 16 November, 2022; v1 submitted 12 January, 2022; originally announced January 2022.

    Comments: arXiv admin note: substantial text overlap with arXiv:2012.06867

  18. arXiv:2112.02896  [pdf, other

    eess.IV cs.CV cs.LG

    Tunable Image Quality Control of 3-D Ultrasound using Switchable CycleGAN

    Authors: Jaeyoung Huh, Shujaat Khan, Sungjin Choi, Dongkuk Shin, Eun Sun Lee, Jong Chul Ye

    Abstract: In contrast to 2-D ultrasound (US) for uniaxial plane imaging, a 3-D US imaging system can visualize a volume along three axial planes. This allows for a full view of the anatomy, which is useful for gynecological (GYN) and obstetrical (OB) applications. Unfortunately, the 3-D US has an inherent limitation in resolution compared to the 2-D US. In the case of 3-D US with a 3-D mechanical probe, for… ▽ More

    Submitted 6 December, 2021; originally announced December 2021.

  19. arXiv:2111.01024  [pdf, other

    cs.CV cs.SD eess.AS

    With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition

    Authors: Evangelos Kazakos, Jaesung Huh, Arsha Nagrani, Andrew Zisserman, Dima Damen

    Abstract: In egocentric videos, actions occur in quick succession. We capitalise on the action's temporal context and propose a method that learns to attend to surrounding actions in order to improve recognition performance. To incorporate the temporal context, we propose a transformer-based multimodal model that ingests video and audio as input modalities, with an explicit language model providing action s… ▽ More

    Submitted 1 November, 2021; originally announced November 2021.

    Comments: Accepted at BMVC 2021

  20. arXiv:2103.09022  [pdf, other

    eess.IV cs.CV cs.LG

    Missing Cone Artifacts Removal in ODT using Unsupervised Deep Learning in Projection Domain

    Authors: Hyungjin Chung, Jaeyoung Huh, Geon Kim, Yong Keun Park, Jong Chul Ye

    Abstract: Optical diffraction tomography (ODT) produces three dimensional distribution of refractive index (RI) by measuring scattering fields at various angles. Although the distribution of RI index is highly informative, due to the missing cone problem stemming from the limited-angle acquisition of holograms, reconstructions have very poor resolution along axial direction compared to the horizontal imagin… ▽ More

    Submitted 18 July, 2021; v1 submitted 16 March, 2021; originally announced March 2021.

    Comments: This will appear in IEEE Trans. on Computational Imaging

  21. arXiv:2012.06867  [pdf, other

    cs.SD cs.LG eess.AS

    VoxSRC 2020: The Second VoxCeleb Speaker Recognition Challenge

    Authors: Arsha Nagrani, Joon Son Chung, Jaesung Huh, Andrew Brown, Ernesto Coto, Weidi Xie, Mitchell McLaren, Douglas A Reynolds, Andrew Zisserman

    Abstract: We held the second installment of the VoxCeleb Speaker Recognition Challenge in conjunction with Interspeech 2020. The goal of this challenge was to assess how well current speaker recognition technology is able to diarise and recognize speakers in unconstrained or `in the wild' data. It consisted of: (i) a publicly available speaker recognition and diarisation dataset from YouTube videos together… ▽ More

    Submitted 12 December, 2020; originally announced December 2020.

  22. arXiv:2011.14885  [pdf, ps, other

    cs.SD eess.AS

    Look who's not talking

    Authors: Youngki Kwon, Hee Soo Heo, Jaesung Huh, Bong-Jin Lee, Joon Son Chung

    Abstract: The objective of this work is speaker diarisation of speech recordings 'in the wild'. The ability to determine speech segments is a crucial part of diarisation systems, accounting for a large proportion of errors. In this paper, we present a simple but effective solution for speech activity detection based on the speaker embeddings. In particular, we discover that the norm of the speaker embedding… ▽ More

    Submitted 30 November, 2020; originally announced November 2020.

    Comments: SLT 2021

  23. arXiv:2010.15716  [pdf, other

    cs.SD eess.AS

    Playing a Part: Speaker Verification at the Movies

    Authors: Andrew Brown, Jaesung Huh, Arsha Nagrani, Joon Son Chung, Andrew Zisserman

    Abstract: The goal of this work is to investigate the performance of popular speaker recognition models on speech segments from movies, where often actors intentionally disguise their voice to play a character. We make the following three contributions: (i) We collect a novel, challenging speaker recognition dataset called VoxMovies, with speech for 856 identities from almost 4000 movie clips. VoxMovies con… ▽ More

    Submitted 11 February, 2021; v1 submitted 29 October, 2020; originally announced October 2020.

    Comments: The first three authors contributed equally to this work

  24. arXiv:2009.14153  [pdf, other

    eess.AS cs.SD

    Clova Baseline System for the VoxCeleb Speaker Recognition Challenge 2020

    Authors: Hee Soo Heo, Bong-Jin Lee, Jaesung Huh, Joon Son Chung

    Abstract: This report describes our submission to the VoxCeleb Speaker Recognition Challenge (VoxSRC) at Interspeech 2020. We perform a careful analysis of speaker recognition models based on the popular ResNet architecture, and train a number of variants using a range of loss functions. Our results show significant improvements over most existing works without the use of model ensemble or post-processing.… ▽ More

    Submitted 29 September, 2020; originally announced September 2020.

  25. arXiv:2008.13646  [pdf, other

    eess.IV cs.CV cs.LG stat.ML

    Switchable Deep Beamformer

    Authors: Shujaat Khan, Jaeyoung Huh, Jong Chul Ye

    Abstract: Recent proposals of deep beamformers using deep neural networks have attracted significant attention as computational efficient alternatives to adaptive and compressive beamformers. Moreover, deep beamformers are versatile in that image post-processing algorithms can be combined with the beamforming. Unfortunately, in the current technology, a separate beamformer should be trained and stored for e… ▽ More

    Submitted 4 September, 2020; v1 submitted 31 August, 2020; originally announced August 2020.

  26. arXiv:2007.12085  [pdf, other

    cs.SD cs.LG eess.AS

    Augmentation adversarial training for self-supervised speaker recognition

    Authors: Jaesung Huh, Hee Soo Heo, Jingu Kang, Shinji Watanabe, Joon Son Chung

    Abstract: The goal of this work is to train robust speaker recognition models without speaker labels. Recent works on unsupervised speaker representations are based on contrastive learning in which they encourage within-utterance embeddings to be similar and across-utterance embeddings to be dissimilar. However, since the within-utterance segments share the same acoustic characteristics, it is difficult to… ▽ More

    Submitted 30 October, 2020; v1 submitted 23 July, 2020; originally announced July 2020.

    Comments: Workshop on Self-Supervised Learning for Speech and Audio Processing, NeurIPS

  27. arXiv:2007.05205  [pdf, other

    eess.IV cs.CV cs.LG stat.ML

    OT-driven Multi-Domain Unsupervised Ultrasound Image Artifact Removal using a Single CNN

    Authors: Jaeyoung Huh, Shujaat Khan, Jong Chul Ye

    Abstract: Ultrasound imaging (US) often suffers from distinct image artifacts from various sources. Classic approaches for solving these problems are usually model-based iterative approaches that have been developed specifically for each type of artifact, which are often computationally intensive. Recently, deep learning approaches have been proposed as computationally efficient and high performance alterna… ▽ More

    Submitted 10 July, 2020; originally announced July 2020.

  28. arXiv:2007.01216  [pdf, other

    cs.SD cs.CV eess.AS eess.IV

    Spot the conversation: speaker diarisation in the wild

    Authors: Joon Son Chung, Jaesung Huh, Arsha Nagrani, Triantafyllos Afouras, Andrew Zisserman

    Abstract: The goal of this paper is speaker diarisation of videos collected 'in the wild'. We make three key contributions. First, we propose an automatic audio-visual diarisation method for YouTube videos. Our method consists of active speaker detection using audio-visual methods and speaker verification using self-enrolled speaker models. Second, we integrate our method into a semi-automatic dataset creat… ▽ More

    Submitted 15 August, 2021; v1 submitted 2 July, 2020; originally announced July 2020.

    Comments: The dataset will be available for download from http://www.robots.ox.ac.uk/~vgg/data/voxceleb/voxconverse.html . The development set will be released in July 2020, and the test set will be released in October 2020

  29. arXiv:2006.14773  [pdf, other

    cs.CV cs.LG eess.IV stat.ML

    Pushing the Limit of Unsupervised Learning for Ultrasound Image Artifact Removal

    Authors: Shujaat Khan, Jaeyoung Huh, Jong Chul Ye

    Abstract: Ultrasound (US) imaging is a fast and non-invasive imaging modality which is widely used for real-time clinical imaging applications without concerning about radiation hazard. Unfortunately, it often suffers from poor visual quality from various origins, such as speckle noises, blurring, multi-line acquisition (MLA), limited RF channels, small number of view angles for the case of plane wave imagi… ▽ More

    Submitted 25 June, 2020; originally announced June 2020.

  30. arXiv:2005.08776  [pdf, other

    eess.AS cs.SD

    Metric Learning for Keyword Spotting

    Authors: Jaesung Huh, Minjae Lee, Heesoo Heo, Seongkyu Mun, Joon Son Chung

    Abstract: The goal of this work is to train effective representations for keyword spotting via metric learning. Most existing works address keyword spotting as a closed-set classification problem, where both target and non-target keywords are predefined. Therefore, prevailing classifier-based keyword spotting systems perform poorly on non-target sounds which are unseen during the training stage, causing hig… ▽ More

    Submitted 18 May, 2020; originally announced May 2020.

  31. In defence of metric learning for speaker recognition

    Authors: Joon Son Chung, Jaesung Huh, Seongkyu Mun, Minjae Lee, Hee Soo Heo, Soyeon Choe, Chiheon Ham, Sunghwan Jung, Bong-Jin Lee, Icksang Han

    Abstract: The objective of this paper is 'open-set' speaker recognition of unseen speakers, where ideal embeddings should be able to condense information into a compact utterance-level representation that has small intra-speaker and large inter-speaker distance. A popular belief in speaker recognition is that networks trained with classification objectives outperform metric learning methods. In this paper… ▽ More

    Submitted 24 April, 2020; v1 submitted 26 March, 2020; originally announced March 2020.

    Comments: The code can be found at https://github.com/clovaai/voxceleb_trainer

  32. arXiv:2002.03559  [pdf, other

    cs.SD cs.LG eess.AS

    Modeling Musical Onset Probabilities via Neural Distribution Learning

    Authors: Jaesung Huh, Egil Martinsson, Adrian Kim, Jung-Woo Ha

    Abstract: Musical onset detection can be formulated as a time-to-event (TTE) or time-since-event (TSE) prediction task by defining music as a sequence of onset events. Here we propose a novel method to model the probability of onsets by introducing a sequential density prediction model. The proposed model estimates TTE & TSE distributions from mel-spectrograms using convolutional neural networks (CNNs) as a… ▽ More

    Submitted 10 February, 2020; originally announced February 2020.

    Comments: 2 pages, 2 figures, 2 tables

  33. arXiv:2001.09236  [pdf, other

    eess.SY

    Abstraction-based Synthesis for Stochastic Systems with Omega-Regular Objectives

    Authors: Maxence Dutreix, Jeongmin Huh, Samuel Coogan

    Abstract: This paper studies the synthesis of controllers for discrete-time, continuous state stochastic systems subject to omega-regular specifications using finite-state abstractions. We present a synthesis algorithm for minimizing or maximizing the probability that a discrete-time stochastic system with finite number of modes satisfies an omega-regular property. Our approach uses a finite-state abstracti… ▽ More

    Submitted 21 September, 2020; v1 submitted 24 January, 2020; originally announced January 2020.

    Comments: A theoretical mistake has been identified in the first version of the manuscript (Fact 1, which states that minimizing policies are memoryless in the product BMDP) and has been corrected in the second version. Other content related to this mistake has been modified (e.g. removal of Algorithm 2 and 4)

  34. arXiv:1911.02411  [pdf, other

    cs.SD eess.AS

    The sound of my voice: speaker representation loss for target voice separation

    Authors: Seongkyu Mun, Soyeon Choe, Jaesung Huh, Joon Son Chung

    Abstract: Content and style representations have been widely studied in the field of style transfer. In this paper, we propose a new loss function using speaker content representation for audio source separation, and we call it speaker representation loss. The objective is to extract the target speaker voice from the noisy input and also remove it from the residual components. Compared to the conventional s… ▽ More

    Submitted 27 February, 2020; v1 submitted 6 November, 2019; originally announced November 2019.

    Comments: To appear in ICASSP 2020. The first two authors contributed equally to this work

  35. arXiv:1910.11238  [pdf, other

    cs.SD cs.LG eess.AS

    Delving into VoxCeleb: environment invariant speaker recognition

    Authors: Joon Son Chung, Jaesung Huh, Seongkyu Mun

    Abstract: Research in speaker recognition has recently seen significant progress due to the application of neural network models and the availability of new large-scale datasets. There has been a plethora of work in search for more powerful architectures or loss functions suitable for the task, but these works do not consider what information is learnt by the models, apart from being able to predict the giv… ▽ More

    Submitted 3 February, 2020; v1 submitted 24 October, 2019; originally announced October 2019.

  36. arXiv:1907.10257  [pdf, other

    eess.IV cs.CV cs.LG

    Adaptive and Compressive Beamforming Using Deep Learning for Medical Ultrasound

    Authors: Shujaat Khan, Jaeyoung Huh, Jong Chul Ye

    Abstract: In ultrasound (US) imaging, various types of adaptive beamforming techniques have been investigated to improve the resolution and contrast-to-noise ratio of the delay and sum (DAS) beamformers. Unfortunately, the performance of these adaptive beamforming approaches degrade when the underlying model is not sufficiently accurate and the number of channels decreases. To address this problem, here we… ▽ More

    Submitted 23 February, 2020; v1 submitted 24 July, 2019; originally announced July 2019.

    Comments: This is a significantly extended version of the original paper in arXiv:1901.01706. This paper is accepted for IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control

  37. arXiv:1904.02843  [pdf, other

    eess.IV cs.CV cs.LG stat.ML

    Deep Learning-based Universal Beamformer for Ultrasound Imaging

    Authors: Shujaat Khan, Jaeyoung Huh, Jong Chul Ye

    Abstract: In ultrasound (US) imaging, individual channel RF measurements are back-propagated and accumulated to form an image after applying specific delays. While this time reversal is usually implemented using a hardware- or software-based delay-and-sum (DAS) beamformer, the performance of DAS decreases rapidly in situations where data acquisition is not ideal. Herein, for the first time, we demonstrate t… ▽ More

    Submitted 15 July, 2019; v1 submitted 4 April, 2019; originally announced April 2019.

    Comments: Accepted for MICCAI 2019. arXiv admin note: substantial text overlap with arXiv:1901.01706

  38. arXiv:1903.03107   

    cs.SD cs.LG eess.AS stat.ML

    Phase-aware Speech Enhancement with Deep Complex U-Net

    Authors: Hyeong-Seok Choi, Jang-Hyun Kim, Jaesung Huh, Adrian Kim, Jung-Woo Ha, Kyogu Lee

    Abstract: Most deep learning-based models for speech enhancement have mainly focused on estimating the magnitude of spectrogram while reusing the phase from noisy speech for reconstruction. This is due to the difficulty of estimating the phase of clean speech. To improve speech enhancement performance, we tackle the phase estimation problem in three ways. First, we propose Deep Complex U-Net, an advanced U-… ▽ More

    Submitted 2 April, 2019; v1 submitted 7 March, 2019; originally announced March 2019.

    Comments: Significant error was found in data processing step, therefore will be retracted from International Conference on Learning Representations (ICLR) 2019. It is not recommended to read current version