Skip to main content

Showing 1–14 of 14 results for author: Park, S J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2503.11315  [pdf, ps, other

    cs.CV cs.MM cs.SD eess.AS

    MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens

    Authors: Jeong Hun Yeo, Hyeongseop Rha, Se Jin Park, Yong Man Ro

    Abstract: Audio-Visual Speech Recognition (AVSR) achieves robust speech recognition in noisy environments by combining auditory and visual information. However, recent Large Language Model (LLM) based AVSR systems incur high computational costs due to the high temporal resolution of audio-visual speech processed by LLMs. In this work, we introduce an efficient multimodal speech LLM framework that minimizes… ▽ More

    Submitted 5 June, 2025; v1 submitted 14 March, 2025; originally announced March 2025.

    Comments: Accepted at Findings of ACL 2025. The code and models are available https://github.com/JeongHun0716/MMS-LLaMA

  2. arXiv:2412.18603  [pdf, other

    cs.CL cs.SD eess.AS

    Long-Form Speech Generation with Spoken Language Models

    Authors: Se Jin Park, Julian Salazar, Aren Jansen, Keisuke Kinoshita, Yong Man Ro, RJ Skerry-Ryan

    Abstract: We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants. However, current spoken language models struggle to generate plausible speech past tens of seconds, from high temporal resolution of speech tokens causing loss of coherence, to architectural issues with long-sequence training or extrapolation, to… ▽ More

    Submitted 24 December, 2024; originally announced December 2024.

  3. arXiv:2401.09802  [pdf, other

    eess.AS cs.CV cs.SD

    Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation

    Authors: Minsu Kim, Jeong Hun Yeo, Se Jin Park, Hyeongseop Rha, Yong Man Ro

    Abstract: This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model. As the massive multilingual modeling of visual data requires huge computational costs, we propose a novel training strategy, processing with visual speech units. Motivated by the recent success of the audio speech unit, we propose to use a visual speec… ▽ More

    Submitted 18 July, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

    Comments: ACMMM 2024

  4. arXiv:2312.09576  [pdf, other

    eess.IV cs.CV

    SegRap2023: A Benchmark of Organs-at-Risk and Gross Tumor Volume Segmentation for Radiotherapy Planning of Nasopharyngeal Carcinoma

    Authors: Xiangde Luo, Jia Fu, Yunxin Zhong, Shuolin Liu, Bing Han, Mehdi Astaraki, Simone Bendazzoli, Iuliana Toma-Dasu, Yiwen Ye, Ziyang Chen, Yong Xia, Yanzhou Su, Jin Ye, Junjun He, Zhaohu Xing, Hongqiu Wang, Lei Zhu, Kaixiang Yang, Xin Fang, Zhiwei Wang, Chan Woong Lee, Sang Joon Park, Jaehee Chun, Constantin Ulrich, Klaus H. Maier-Hein , et al. (17 additional authors not shown)

    Abstract: Radiation therapy is a primary and effective NasoPharyngeal Carcinoma (NPC) treatment strategy. The precise delineation of Gross Tumor Volumes (GTVs) and Organs-At-Risk (OARs) is crucial in radiation treatment, directly impacting patient prognosis. Previously, the delineation of GTVs and OARs was performed by experienced radiation oncologists. Recently, deep learning has achieved promising results… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

    Comments: A challenge report of SegRap2023 (organized in conjunction with MICCAI2023)

  5. arXiv:2312.02512  [pdf, other

    cs.CV cs.AI cs.MM eess.AS

    AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

    Authors: Jeongsoo Choi, Se Jin Park, Minsu Kim, Yong Man Ro

    Abstract: This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV) framework, where the input and output of the system are multimodal (i.e., audio and visual speech). With the proposed AV2AV, two key advantages can be brought: 1) We can perform real-like conversations with individuals worldwide in a virtual meeting by utilizing our own primary languages. In contrast… ▽ More

    Submitted 26 March, 2024; v1 submitted 5 December, 2023; originally announced December 2023.

    Comments: CVPR 2024. Code & Demo: https://choijeongsoo.github.io/av2av

  6. arXiv:2310.14946  [pdf, other

    cs.MM cs.SD eess.AS

    Intuitive Multilingual Audio-Visual Speech Recognition with a Single-Trained Model

    Authors: Joanna Hong, Se Jin Park, Yong Man Ro

    Abstract: We present a novel approach to multilingual audio-visual speech recognition tasks by introducing a single model on a multilingual dataset. Motivated by a human cognitive system where humans can intuitively distinguish different languages without any conscious effort or guidance, we propose a model that can capture which language is given as an input speech by distinguishing the inherent similariti… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

    Comments: EMNLP 2023 Findings

  7. arXiv:2310.05934  [pdf, other

    cs.CV cs.AI cs.MM eess.IV

    DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with Diffusion

    Authors: Se Jin Park, Joanna Hong, Minsu Kim, Yong Man Ro

    Abstract: Speech-driven 3D facial animation has gained significant attention for its ability to create realistic and expressive facial animations in 3D space based on speech. Learning-based methods have shown promising progress in achieving accurate facial motion synchronized with speech. However, one-to-many nature of speech-to-3D facial synthesis has not been fully explored: while the lip accurately synch… ▽ More

    Submitted 23 August, 2023; originally announced October 2023.

  8. arXiv:2306.16003  [pdf, other

    cs.GR cs.CV cs.SD eess.AS

    Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models

    Authors: Jeongsoo Choi, Minsu Kim, Se Jin Park, Yong Man Ro

    Abstract: In this paper, we present a method for reprogramming pre-trained audio-driven talking face synthesis models to operate in a text-driven manner. Consequently, we can easily generate face videos that articulate the provided textual sentences, eliminating the necessity of recording speech for each inference, as required in the audio-driven model. To this end, we propose to embed the input text into t… ▽ More

    Submitted 18 January, 2024; v1 submitted 28 June, 2023; originally announced June 2023.

    Comments: ICASSP 2024

  9. arXiv:2305.19556  [pdf, other

    cs.CV cs.AI cs.SD eess.AS eess.IV

    Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation

    Authors: Se Jin Park, Minsu Kim, Jeongsoo Choi, Yong Man Ro

    Abstract: Talking face generation is the challenging task of synthesizing a natural and realistic face that requires accurate synchronization with a given audio. Due to co-articulation, where an isolated phone is influenced by the preceding or following phones, the articulation of a phone varies upon the phonetic context. Therefore, modeling lip motion with the phonetic context can generate more spatio-temp… ▽ More

    Submitted 1 April, 2024; v1 submitted 31 May, 2023; originally announced May 2023.

    Comments: Accepted at ICASSP 2024

  10. arXiv:2305.06813  [pdf, other

    eess.IV cs.CV

    Generation of Structurally Realistic Retinal Fundus Images with Diffusion Models

    Authors: Sojung Go, Younghoon Ji, Sang Jun Park, Soochahn Lee

    Abstract: We introduce a new technique for generating retinal fundus images that have anatomically accurate vascular structures, using diffusion models. We generate artery/vein masks to create the vascular structure, which we then condition to produce retinal fundus images. The proposed method can generate high-quality images with more realistic vascular structures and can create a diverse range of images b… ▽ More

    Submitted 11 May, 2023; originally announced May 2023.

    Comments: 9 pages, 6 figures

  11. arXiv:2211.00924  [pdf, other

    cs.CV cs.AI eess.IV

    SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory

    Authors: Se Jin Park, Minsu Kim, Joanna Hong, Jeongsoo Choi, Yong Man Ro

    Abstract: The challenge of talking face generation from speech lies in aligning two different modal information, audio and video, such that the mouth region corresponds to input audio. Previous methods either exploit audio-visual representation learning or leverage intermediate structural information such as landmarks and 3D models. However, they struggle to synthesize fine details of the lips varying at th… ▽ More

    Submitted 2 November, 2022; v1 submitted 2 November, 2022; originally announced November 2022.

    Comments: Accepted at AAAI 2022 (Oral)

  12. arXiv:2204.01265  [pdf, other

    cs.CV cs.AI cs.SD eess.AS

    Multi-modality Associative Bridging through Memory: Speech Sound Recollected from Face Video

    Authors: Minsu Kim, Joanna Hong, Se Jin Park, Yong Man Ro

    Abstract: In this paper, we introduce a novel audio-visual multi-modal bridging framework that can utilize both audio and visual information, even with uni-modal inputs. We exploit a memory network that stores source (i.e., visual) and target (i.e., audio) modal representations, where source modal representation is what we are given, and target modal representations are what we want to obtain from the memor… ▽ More

    Submitted 4 April, 2022; originally announced April 2022.

    Comments: Published at ICCV 2021

  13. arXiv:2201.06735  [pdf

    eess.SP

    AI Augmented Digital Metal Component

    Authors: Eunhyeok Seo, Hyokyung Sung, Hayeol Kim, Taekyeong Kim, Sangeun Park, Min Sik Lee, Seung Ki Moon, Jung Gi Kim, Hayoung Chung, Seong-Kyum Choi, Ji-hun Yu, Kyung Tae Kim, Seong Jin Park, Namhun Kim, Im Doo Jung

    Abstract: The aim of this work is to propose a new paradigm that imparts intelligence to metal parts with the fusion of metal additive manufacturing and artificial intelligence (AI). Our digital metal part classifies the status with real time data processing with convolutional neural network (CNN). The training data for the CNN is collected from a strain gauge embedded in metal parts by laser powder bed fus… ▽ More

    Submitted 17 January, 2022; originally announced January 2022.

    Comments: 46 pages

  14. arXiv:2008.03616  [pdf, ps, other

    eess.AS cs.LG eess.SP

    Variable frame rate-based data augmentation to handle speaking-style variability for automatic speaker verification

    Authors: Amber Afshan, Jinxi Guo, Soo Jin Park, Vijay Ravi, Alan McCree, Abeer Alwan

    Abstract: The effects of speaking-style variability on automatic speaker verification were investigated using the UCLA Speaker Variability database which comprises multiple speaking styles per speaker. An x-vector/PLDA (probabilistic linear discriminant analysis) system was trained with the SRE and Switchboard databases with standard augmentation techniques and evaluated with utterances from the UCLA databa… ▽ More

    Submitted 8 August, 2020; originally announced August 2020.

    Comments: Accepted to Interspeech 2020