Skip to main content

Showing 1–12 of 12 results for author: Yeo, J H

Searching in archive eess. Search in all archives.
.
  1. arXiv:2503.11315  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens

    Authors: Jeong Hun Yeo, Hyeongseop Rha, Se Jin Park, Yong Man Ro

    Abstract: Audio-Visual Speech Recognition (AVSR) achieves robust speech recognition in noisy environments by combining auditory and visual information. However, recent Large Language Model (LLM) based AVSR systems incur high computational costs due to the high temporal resolution of audio-visual speech processed by LLMs. In this work, we introduce an efficient multimodal speech LLM framework that minimizes… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

    Comments: The code and models are available https://github.com/JeongHun0716/MMS-LLaMA

  2. arXiv:2503.06273  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations

    Authors: Jeong Hun Yeo, Minsu Kim, Chae Won Kim, Stavros Petridis, Yong Man Ro

    Abstract: We explore a novel zero-shot Audio-Visual Speech Recognition (AVSR) framework, dubbed Zero-AVSR, which enables speech recognition in target languages without requiring any audio-visual speech data in those languages. Specifically, we introduce the Audio-Visual Speech Romanizer (AV-Romanizer), which learns language-agnostic speech representations by predicting Roman text. Then, by leveraging the st… ▽ More

    Submitted 8 March, 2025; originally announced March 2025.

  3. arXiv:2409.00986  [pdf, other

    cs.CV cs.CL eess.AS eess.IV

    Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language

    Authors: Jeong Hun Yeo, Chae Won Kim, Hyunjun Kim, Hyeongseop Rha, Seunghee Han, Wen-Huang Cheng, Yong Man Ro

    Abstract: Lip reading aims to predict spoken language by analyzing lip movements. Despite advancements in lip reading technologies, performance degrades when models are applied to unseen speakers due to their sensitivity to variations in visual information such as lip appearances. To address this challenge, speaker adaptive lip reading technologies have advanced by focusing on effectively adapting a lip rea… ▽ More

    Submitted 1 January, 2025; v1 submitted 2 September, 2024; originally announced September 2024.

    Comments: Code available: https://github.com/JeongHun0716/Personalized-Lip-Reading

  4. arXiv:2402.15151  [pdf, other

    cs.CV cs.CL eess.AS eess.IV

    Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing

    Authors: Jeong Hun Yeo, Seunghee Han, Minsu Kim, Yong Man Ro

    Abstract: In visual speech processing, context modeling capability is one of the most important requirements due to the ambiguous nature of lip movements. For example, homophenes, words that share identical lip movements but produce different sounds, can be distinguished by considering the context. In this paper, we propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM),… ▽ More

    Submitted 13 May, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

    Comments: An Erratum was added on the last page of this paper

  5. arXiv:2401.09802  [pdf, other

    eess.AS cs.CV cs.SD

    Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation

    Authors: Minsu Kim, Jeong Hun Yeo, Se Jin Park, Hyeongseop Rha, Yong Man Ro

    Abstract: This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model. As the massive multilingual modeling of visual data requires huge computational costs, we propose a novel training strategy, processing with visual speech units. Motivated by the recent success of the audio speech unit, we propose to use a visual speec… ▽ More

    Submitted 18 July, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

    Comments: ACMMM 2024

  6. arXiv:2309.08535  [pdf, other

    cs.CV cs.AI eess.AS

    Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from Whisper

    Authors: Jeong Hun Yeo, Minsu Kim, Shinji Watanabe, Yong Man Ro

    Abstract: This paper proposes a powerful Visual Speech Recognition (VSR) method for multiple languages, especially for low-resource languages that have a limited number of labeled data. Different from previous methods that tried to improve the VSR performance for the target language by using knowledge learned from other languages, we explore whether we can increase the amount of training data itself for the… ▽ More

    Submitted 12 January, 2024; v1 submitted 15 September, 2023; originally announced September 2023.

    Comments: Accepted at ICASSP 2024

  7. arXiv:2309.08531  [pdf, other

    cs.CV cs.CL eess.AS eess.IV

    Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens

    Authors: Minsu Kim, Jeongsoo Choi, Soumi Maiti, Jeong Hun Yeo, Shinji Watanabe, Yong Man Ro

    Abstract: In this paper, we propose methods to build a powerful and efficient Image-to-Speech captioning (Im2Sp) model. To this end, we start with importing the rich knowledge related to image comprehension and language modeling from a large-scale pre-trained vision-language model into Im2Sp. We set the output of the proposed Im2Sp as discretized speech units, i.e., the quantized speech features of a self-s… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

  8. arXiv:2308.09311  [pdf, other

    cs.CV cs.CL cs.SD eess.AS eess.IV

    Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge

    Authors: Minsu Kim, Jeong Hun Yeo, Jeongsoo Choi, Yong Man Ro

    Abstract: This paper proposes a novel lip reading framework, especially for low-resource languages, which has not been well addressed in the previous literature. Since low-resource languages do not have enough video-text paired data to train the model to have sufficient power to model lip movements and language, it is regarded as challenging to develop lip reading models for low-resource languages. In order… ▽ More

    Submitted 12 January, 2024; v1 submitted 18 August, 2023; originally announced August 2023.

    Comments: Accepted at ICCV 2023

  9. arXiv:2308.07593  [pdf, other

    cs.CV cs.MM eess.AS eess.IV

    AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model

    Authors: Jeong Hun Yeo, Minsu Kim, Jeongsoo Choi, Dae Hoe Kim, Yong Man Ro

    Abstract: Visual Speech Recognition (VSR) is the task of predicting spoken words from silent lip movements. VSR is regarded as a challenging task because of the insufficient information on lip movements. In this paper, we propose an Audio Knowledge empowered Visual Speech Recognition framework (AKVSR) to complement the insufficient speech information of visual modality by using audio modality. Different fro… ▽ More

    Submitted 11 January, 2024; v1 submitted 15 August, 2023; originally announced August 2023.

    Comments: Accepted by IEEE Transactions on Multimedia

  10. arXiv:2305.04542  [pdf, other

    cs.CV eess.AS

    Multi-Temporal Lip-Audio Memory for Visual Speech Recognition

    Authors: Jeong Hun Yeo, Minsu Kim, Yong Man Ro

    Abstract: Visual Speech Recognition (VSR) is a task to predict a sentence or word from lip movements. Some works have been recently presented which use audio signals to supplement visual information. However, existing methods utilize only limited information such as phoneme-level features and soft labels of Automatic Speech Recognition (ASR) networks. In this paper, we present a Multi-Temporal Lip-Audio Mem… ▽ More

    Submitted 8 May, 2023; originally announced May 2023.

    Comments: Presented at ICASSP 2023

  11. arXiv:2009.01382  [pdf

    eess.SY

    Application of Transformer Impedance Correction Tables in Power Flow Studies

    Authors: Pooria Dehghanian, Ju Hee Yeo, Jessica Wert, Hanyue Li, Komal Shetye, Thomas J. Overbye

    Abstract: Phase Shifting Transformers (PST) are used to control or block certain flows of real power through phase angle regulation across the device. Its functionality is crucial to special situations such as eliminating loop flow through an area and balancing real power flow between parallel paths. Impedance correction tables are used to model that the impedance of phase shifting transformers often vary a… ▽ More

    Submitted 2 September, 2020; originally announced September 2020.

  12. arXiv:1911.06934  [pdf, other

    eess.SY

    The Creation and Validation of Load Time Series for Synthetic Electric Power Systems

    Authors: Hanyue Li, Ju Hee Yeo, Ashly L. Bornsheuer, Thomas J. Overbye

    Abstract: Synthetic power systems that imitate functional and statistical characteristics of the actual grid have been developed to promote researchers' access to public system models. Developing time series to represent different operating conditions of these synthetic systems will expand the potential of synthetic power systems applications. This paper proposes a methodology to create synthetic time serie… ▽ More

    Submitted 15 November, 2019; originally announced November 2019.

    Comments: Submitted to IEEE Transactions on Power Systems