Skip to main content

Showing 1–50 of 65 results for author: Sung, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.00552  [pdf, other

    cs.IR cs.LG

    Graph Spectral Filtering with Chebyshev Interpolation for Recommendation

    Authors: Chanwoo Kim, Jinkyu Sung, Yebonn Han, Joonseok Lee

    Abstract: Graph convolutional networks have recently gained prominence in collaborative filtering (CF) for recommendations. However, we identify potential bottlenecks in two foundational components. First, the embedding layer leads to a latent space with limited capacity, overlooking locally observed but potentially valuable preference patterns. Also, the widely-used neighborhood aggregation is limited in i… ▽ More

    Submitted 1 May, 2025; originally announced May 2025.

    Comments: Accepted by SIGIR 2025; 11 pages, 9 figures, 5 tables

  2. arXiv:2504.04045  [pdf, other

    cs.CV cs.AI cs.LG

    A Survey of Pathology Foundation Model: Progress and Future Directions

    Authors: Conghao Xiong, Hao Chen, Joseph J. Y. Sung

    Abstract: Computational pathology, analyzing whole slide images for automated cancer diagnosis, relies on the multiple instance learning framework where performance heavily depends on the feature extractor and aggregator. Recent Pathology Foundation Models (PFMs), pretrained on large-scale histopathology data, have significantly enhanced capabilities of extractors and aggregators but lack systematic analysi… ▽ More

    Submitted 4 April, 2025; originally announced April 2025.

  3. arXiv:2503.07390  [pdf, other

    cs.CV

    PersonaBooth: Personalized Text-to-Motion Generation

    Authors: Boeun Kim, Hea In Jeong, JungHoon Sung, Yihua Cheng, Jeongmin Lee, Ju Yong Chang, Sang-Il Choi, Younggeun Choi, Saim Shin, Jungho Kim, Hyung Jin Chang

    Abstract: This paper introduces Motion Personalization, a new task that generates personalized motions aligned with text descriptions using several basic motions containing Persona. To support this novel task, we introduce a new large-scale motion dataset called PerMo (PersonaMotion), which captures the unique personas of multiple actors. We also propose a multi-modal finetuning method of a pretrained motio… ▽ More

    Submitted 21 March, 2025; v1 submitted 10 March, 2025; originally announced March 2025.

  4. arXiv:2503.05116  [pdf, other

    cs.AR

    Piccolo: Large-Scale Graph Processing with Fine-Grained In-Memory Scatter-Gather

    Authors: Changmin Shin, Jaeyong Song, Hongsun Jang, Dogeun Kim, Jun Sung, Taehee Kwon, Jae Hyung Ju, Frank Liu, Yeonkyu Choi, Jinho Lee

    Abstract: Graph processing requires irregular, fine-grained random access patterns incompatible with contemporary off-chip memory architecture, leading to inefficient data access. This inefficiency makes graph processing an extremely memory-bound application. Because of this, existing graph processing accelerators typically employ a graph tiling-based or processing-in-memory (PIM) approach to relieve the me… ▽ More

    Submitted 9 March, 2025; v1 submitted 6 March, 2025; originally announced March 2025.

    Comments: HPCA 2025

  5. arXiv:2502.06139  [pdf, other

    cs.CL

    LCIRC: A Recurrent Compression Approach for Efficient Long-form Context and Query Dependent Modeling in LLMs

    Authors: Sumin An, Junyoung Sung, Wonpyo Park, Chanjun Park, Paul Hongsuck Seo

    Abstract: While large language models (LLMs) excel in generating coherent and contextually rich outputs, their capacity to efficiently handle long-form contexts is limited by fixed-length position embeddings. Additionally, the computational cost of processing long sequences increases quadratically, making it challenging to extend context length. To address these challenges, we propose Long-form Context Inje… ▽ More

    Submitted 9 February, 2025; originally announced February 2025.

    Comments: Accepted to NAACL 2025 Main

  6. arXiv:2502.06086  [pdf, other

    cs.CL

    Is a Peeled Apple Still Red? Evaluating LLMs' Ability for Conceptual Combination with Property Type

    Authors: Seokwon Song, Taehyun Lee, Jaewoo Ahn, Jae Hyuk Sung, Gunhee Kim

    Abstract: Conceptual combination is a cognitive process that merges basic concepts, enabling the creation of complex expressions. During this process, the properties of combination (e.g., the whiteness of a peeled apple) can be inherited from basic concepts, newly emerge, or be canceled. However, previous studies have evaluated a limited set of properties and have not examined the generative process. To add… ▽ More

    Submitted 9 February, 2025; originally announced February 2025.

    Comments: NAACL 2025; the dataset and experimental code are available at https://github.com/seokwon99/CCPT.git

  7. Color Flow Imaging Microscopy Improves Identification of Stress Sources of Protein Aggregates in Biopharmaceuticals

    Authors: Michaela Cohrs, Shiwoo Koak, Yejin Lee, Yu Jin Sung, Wesley De Neve, Hristo L. Svilenov, Utku Ozbulak

    Abstract: Protein-based therapeutics play a pivotal role in modern medicine targeting various diseases. Despite their therapeutic importance, these products can aggregate and form subvisible particles (SvPs), which can compromise their efficacy and trigger immunological responses, emphasizing the critical need for robust monitoring techniques. Flow Imaging Microscopy (FIM) has been a significant advancement… ▽ More

    Submitted 26 January, 2025; originally announced January 2025.

    Comments: Accepted for publication in MICCAI 2024 Workshop on Medical Optical Imaging and Virtual Microscopy Image Analysis (MOVI)

  8. arXiv:2409.02788  [pdf, other

    cs.NI cs.ET

    Enhancing 5G Performance: Reducing Service Time and Research Directions for 6G Standards

    Authors: Laura Landon, Vipindev Adat Vasudevan, Jaeweon Kim, Junmo Sung, Jeffery Tony Masters, Muriel Médard

    Abstract: This paper presents several methods for minimizing packet service time in networks using 5G and beyond. We propose leveraging network coding alongside Hybrid Automatic Repeat reQuest (HARQ) to reduce service time as well as optimizing Modulation and Coding Scheme (MCS) selection based on the service time. Our network coding approach includes a method to increase the number of packets in flight, ad… ▽ More

    Submitted 4 September, 2024; originally announced September 2024.

  9. arXiv:2407.07666  [pdf

    cs.CL cs.AI

    A Proposed S.C.O.R.E. Evaluation Framework for Large Language Models : Safety, Consensus, Objectivity, Reproducibility and Explainability

    Authors: Ting Fang Tan, Kabilan Elangovan, Jasmine Ong, Nigam Shah, Joseph Sung, Tien Yin Wong, Lan Xue, Nan Liu, Haibo Wang, Chang Fu Kuo, Simon Chesterman, Zee Kin Yeong, Daniel SW Ting

    Abstract: A comprehensive qualitative evaluation framework for large language models (LLM) in healthcare that expands beyond traditional accuracy and quantitative metrics needed. We propose 5 key aspects for evaluation of LLMs: Safety, Consensus, Objectivity, Reproducibility and Explainability (S.C.O.R.E.). We suggest that S.C.O.R.E. may form the basis for an evaluation framework for future LLM-based models… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

  10. arXiv:2406.09696  [pdf, other

    eess.IV cs.CV

    MoME: Mixture of Multimodal Experts for Cancer Survival Prediction

    Authors: Conghao Xiong, Hao Chen, Hao Zheng, Dong Wei, Yefeng Zheng, Joseph J. Y. Sung, Irwin King

    Abstract: Survival analysis, as a challenging task, requires integrating Whole Slide Images (WSIs) and genomic data for comprehensive decision-making. There are two main challenges in this task: significant heterogeneity and complex inter- and intra-modal interactions between the two modalities. Previous approaches utilize co-attention methods, which fuse features from both modalities only once after separa… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: 8 + 1/2 pages, early accepted to MICCAI2024

  11. PADTHAI-MM: Principles-based Approach for Designing Trustworthy, Human-centered AI using MAST Methodology

    Authors: Myke C. Cohen, Nayoung Kim, Yang Ba, Anna Pan, Shawaiz Bhatti, Pouria Salehi, James Sung, Erik Blasch, Michelle V. Mancenido, Erin K. Chiou

    Abstract: Despite an extensive body of literature on trust in technology, designing trustworthy AI systems for high-stakes decision domains remains a significant challenge, further compounded by the lack of actionable design and evaluation tools. The Multisource AI Scorecard Table (MAST) was designed to bridge this gap by offering a systematic, tradecraft-centered approach to evaluating AI-enabled decision… ▽ More

    Submitted 22 January, 2025; v1 submitted 24 January, 2024; originally announced January 2024.

  12. arXiv:2311.18040  [pdf, other

    cs.CY

    Evaluating Trustworthiness of AI-Enabled Decision Support Systems: Validation of the Multisource AI Scorecard Table (MAST)

    Authors: Pouria Salehi, Yang Ba, Nayoung Kim, Ahmadreza Mosallanezhad, Anna Pan, Myke C. Cohen, Yixuan Wang, Jieqiong Zhao, Shawaiz Bhatti, James Sung, Erik Blasch, Michelle V. Mancenido, Erin K. Chiou

    Abstract: The Multisource AI Scorecard Table (MAST) is a checklist tool based on analytic tradecraft standards to inform the design and evaluation of trustworthy AI systems. In this study, we evaluate whether MAST is associated with people's trust perceptions in AI-enabled decision support systems (AI-DSSs). Evaluating trust in AI-DSSs poses challenges to researchers and practitioners. These challenges incl… ▽ More

    Submitted 29 November, 2023; originally announced November 2023.

  13. arXiv:2311.02107  [pdf

    cs.LG cs.AI cs.CY

    Generative Artificial Intelligence in Healthcare: Ethical Considerations and Assessment Checklist

    Authors: Yilin Ning, Salinelat Teixayavong, Yuqing Shang, Julian Savulescu, Vaishaanth Nagaraj, Di Miao, Mayli Mertens, Daniel Shu Wei Ting, Jasmine Chiat Ling Ong, Mingxuan Liu, Jiuwen Cao, Michael Dunn, Roger Vaughan, Marcus Eng Hock Ong, Joseph Jao-Yiu Sung, Eric J Topol, Nan Liu

    Abstract: The widespread use of ChatGPT and other emerging technology powered by generative artificial intelligence (GenAI) has drawn much attention to potential ethical issues, especially in high-stakes applications such as healthcare, but ethical discussions are yet to translate into operationalisable solutions. Furthermore, ongoing ethical discussions often neglect other types of GenAI that have been use… ▽ More

    Submitted 23 February, 2024; v1 submitted 2 November, 2023; originally announced November 2023.

  14. arXiv:2310.14804  [pdf, other

    cs.CV cs.AI cs.CL

    Large Language Models can Share Images, Too!

    Authors: Young-Jun Lee, Dokyong Lee, Joo Won Sung, Jonghwan Hyeon, Ho-Jin Choi

    Abstract: This paper explores the image-sharing capability of Large Language Models (LLMs), such as GPT-4 and LLaMA 2, in a zero-shot setting. To facilitate a comprehensive evaluation of LLMs, we introduce the PhotoChat++ dataset, which includes enriched annotations (i.e., intent, triggering sentence, image description, and salient information). Furthermore, we present the gradient-free and extensible Decid… ▽ More

    Submitted 4 July, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

    Comments: ACL 2024 Findings; Code is available in https://github.com/passing2961/DribeR

  15. arXiv:2309.00208  [pdf, other

    cs.CL cs.AI

    Large Language Models for Semantic Monitoring of Corporate Disclosures: A Case Study on Korea's Top 50 KOSPI Companies

    Authors: Junwon Sung, Woojin Heo, Yunkyung Byun, Youngsam Kim

    Abstract: In the rapidly advancing domain of artificial intelligence, state-of-the-art language models such as OpenAI's GPT-3.5-turbo and GPT-4 offer unprecedented opportunities for automating complex tasks. This research paper delves into the capabilities of these models for semantically analyzing corporate disclosures in the Korean context, specifically for timely disclosure. The study focuses on the top… ▽ More

    Submitted 31 August, 2023; originally announced September 2023.

  16. arXiv:2307.07130  [pdf, other

    stat.AP cs.IR

    Digital Health Discussion Through Articles Published Until the Year 2021: A Digital Topic Modeling Approach

    Authors: Junhyoun Sung, Hyungsook Kim

    Abstract: The digital health industry has grown in popularity since the 2010s, but there has been limited analysis of the topics discussed in the field across academic disciplines. This study aims to analyze the research trends of digital health-related articles published on the Web of Science until 2021, in order to understand the concentration, scope, and characteristics of the research. 15,950 digital he… ▽ More

    Submitted 18 September, 2023; v1 submitted 13 July, 2023; originally announced July 2023.

    Comments: 13 pages, 5 figures

  17. arXiv:2303.05780  [pdf, other

    cs.CV cs.AI

    TAKT: Target-Aware Knowledge Transfer for Whole Slide Image Classification

    Authors: Conghao Xiong, Yi Lin, Hao Chen, Hao Zheng, Dong Wei, Yefeng Zheng, Joseph J. Y. Sung, Irwin King

    Abstract: Transferring knowledge from a source domain to a target domain can be crucial for whole slide image classification, since the number of samples in a dataset is often limited due to high annotation costs. However, domain shift and task discrepancy between datasets can hinder effective knowledge transfer. In this paper, we propose a Target-Aware Knowledge Transfer framework, employing a teacher-stud… ▽ More

    Submitted 11 July, 2024; v1 submitted 10 March, 2023; originally announced March 2023.

    Comments: Accepted by MICCAI2024

  18. arXiv:2301.08125  [pdf, other

    cs.CV cs.AI

    Diagnose Like a Pathologist: Transformer-Enabled Hierarchical Attention-Guided Multiple Instance Learning for Whole Slide Image Classification

    Authors: Conghao Xiong, Hao Chen, Joseph J. Y. Sung, Irwin King

    Abstract: Multiple Instance Learning (MIL) and transformers are increasingly popular in histopathology Whole Slide Image (WSI) classification. However, unlike human pathologists who selectively observe specific regions of histopathology tissues under different magnifications, most methods do not incorporate multiple resolutions of the WSIs, hierarchically and attentively, thereby leading to a loss of focus… ▽ More

    Submitted 16 July, 2023; v1 submitted 19 January, 2023; originally announced January 2023.

    Comments: Accepted to IJCAI2023

  19. arXiv:2301.01449  [pdf, other

    cs.CV

    Building Coverage Estimation with Low-resolution Remote Sensing Imagery

    Authors: Enci Liu, Chenlin Meng, Matthew Kolodner, Eun Jee Sung, Sihang Chen, Marshall Burke, David Lobell, Stefano Ermon

    Abstract: Building coverage statistics provide crucial insights into the urbanization, infrastructure, and poverty level of a region, facilitating efforts towards alleviating poverty, building sustainable cities, and allocating infrastructure investments and public service provision. Global mapping of buildings has been made more efficient with the incorporation of deep learning models into the pipeline. Ho… ▽ More

    Submitted 4 January, 2023; v1 submitted 4 January, 2023; originally announced January 2023.

  20. arXiv:2211.16307  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Controllable speech synthesis by learning discrete phoneme-level prosodic representations

    Authors: Nikolaos Ellinas, Myrsini Christidou, Alexandra Vioni, June Sig Sung, Aimilios Chalamandaris, Pirros Tsiakoulis, Paris Mastorocostas

    Abstract: In this paper, we present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels. We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset. These features are fed as an input sequence of prosodic labels to a prosody encoder module which augments an autore… ▽ More

    Submitted 29 November, 2022; originally announced November 2022.

    Comments: Final published version available at: Speech Communication. arXiv admin note: substantial text overlap with arXiv:2111.10168

  21. arXiv:2211.01327  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Predicting phoneme-level prosody latents using AR and flow-based Prior Networks for expressive speech synthesis

    Authors: Konstantinos Klapsas, Karolos Nikitaras, Nikolaos Ellinas, June Sig Sung, Inchul Hwang, Spyros Raptis, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: A large part of the expressive speech synthesis literature focuses on learning prosodic representations of the speech signal which are then modeled by a prior distribution during inference. In this paper, we compare different prior architectures at the task of predicting phoneme level prosodic representations extracted with an unsupervised FVAE model. We use both subjective and objective metrics t… ▽ More

    Submitted 2 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  22. arXiv:2211.00523  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Learning utterance-level representations through token-level acoustic latents prediction for Expressive Speech Synthesis

    Authors: Karolos Nikitaras, Konstantinos Klapsas, Nikolaos Ellinas, Georgia Maniati, June Sig Sung, Inchul Hwang, Spyros Raptis, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: This paper proposes an Expressive Speech Synthesis model that utilizes token-level latent prosodic variables in order to capture and control utterance-level attributes, such as character acting voice and speaking style. Current works aim to explicitly factorize such fine-grained and utterance-level speech attributes into different representations extracted by modules that operate in the correspond… ▽ More

    Submitted 1 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  23. arXiv:2211.00342  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features

    Authors: Alexandra Vioni, Georgia Maniati, Nikolaos Ellinas, June Sig Sung, Inchul Hwang, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models. Such MOS prediction models include MOSNet and LDNet that use spectral features as input, and SSL-MOS that relies on a pretrained self-supervised learning model that directly uses the speech signal as input. In modern high-quality neural TTS systems, prosodic appropriateness with re… ▽ More

    Submitted 7 May, 2023; v1 submitted 1 November, 2022; originally announced November 2022.

    Comments: Proceedings of ICASSP 2023

  24. arXiv:2210.17264   

    cs.SD cs.CL cs.LG eess.AS

    Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation

    Authors: Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, Georgia Maniati, Panos Kakoulidis, June Sig Sung, Inchul Hwang, Spyros Raptis, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: This paper presents a method for end-to-end cross-lingual text-to-speech (TTS) which aims to preserve the target language's pronunciation regardless of the original speaker's language. The model used is based on a non-attentive Tacotron architecture, where the decoder has been replaced with a normalizing flow network conditioned on the speaker identity, allowing both TTS and voice conversion (VC)… ▽ More

    Submitted 27 February, 2024; v1 submitted 31 October, 2022; originally announced October 2022.

    Comments: Fundamental changes to the model described and experimental procedure

  25. arXiv:2206.10878  [pdf, other

    cs.CV

    Feature Re-calibration based Multiple Instance Learning for Whole Slide Image Classification

    Authors: Philip Chikontwe, Soo Jeong Nam, Heounjeong Go, Meejeong Kim, Hyun Jung Sung, Sang Hyun Park

    Abstract: Whole slide image (WSI) classification is a fundamental task for the diagnosis and treatment of diseases; but, curation of accurate labels is time-consuming and limits the application of fully-supervised methods. To address this, multiple instance learning (MIL) is a popular method that poses classification as a weakly supervised learning task with slide-level labels only. While current MIL method… ▽ More

    Submitted 21 July, 2022; v1 submitted 22 June, 2022; originally announced June 2022.

    Comments: MICCAI 2022

  26. arXiv:2204.05070  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Fine-grained Noise Control for Multispeaker Speech Synthesis

    Authors: Karolos Nikitaras, Georgios Vamvoukakis, Nikolaos Ellinas, Konstantinos Klapsas, Konstantinos Markopoulos, Spyros Raptis, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and prosody into disentangled representations.Recent works aim to additionally model the acoustic conditions explicitly, in order to disentangle the primary speech factors, i.e. linguistic content, prosody and timbre from any residual factors, such as recording conditions and background noise.This paper pr… ▽ More

    Submitted 27 October, 2022; v1 submitted 11 April, 2022; originally announced April 2022.

    Comments: Accepted to INTERSPEECH 2022

  27. Karaoker: Alignment-free singing voice synthesis with speech training data

    Authors: Panos Kakoulidis, Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, June Sig Sung, Gunu Jho, Pirros Tsiakoulis, Aimilios Chalamandaris

    Abstract: Existing singing voice synthesis models (SVS) are usually trained on singing data and depend on either error-prone time-alignment and duration features or explicit music score information. In this paper, we propose Karaoker, a multispeaker Tacotron-based model conditioned on voice characteristic features that is trained exclusively on spoken data without requiring time-alignments. Karaoker synthes… ▽ More

    Submitted 31 August, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

    Comments: Accepted to INTERSPEECH 2022

  28. arXiv:2204.03421  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Self-supervised learning for robust voice cloning

    Authors: Konstantinos Klapsas, Nikolaos Ellinas, Karolos Nikitaras, Georgios Vamvoukakis, Panos Kakoulidis, Konstantinos Markopoulos, Spyros Raptis, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: Voice cloning is a difficult task which requires robust and informative features incorporated in a high quality TTS system in order to effectively copy an unseen speaker's voice. In our work, we utilize features learned in a self-supervised framework via the Bootstrap Your Own Latent (BYOL) method, which is shown to produce high quality speech representations when specific audio augmentations are… ▽ More

    Submitted 2 November, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

    Comments: Accepted to INTERSPEECH 2022

  29. arXiv:2204.03040  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis

    Authors: Georgia Maniati, Alexandra Vioni, Nikolaos Ellinas, Karolos Nikitaras, Konstantinos Klapsas, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: In this work, we present the SOMOS dataset, the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples. It can be employed to train automatic MOS prediction systems focused on the assessment of modern synthesizers, and can stimulate advancements in acoustic model evaluation. It consists of 20K synthetic utterances of the LJ Speech voice, a publ… ▽ More

    Submitted 24 August, 2022; v1 submitted 6 April, 2022; originally announced April 2022.

    Comments: Accepted to INTERSPEECH 2022

  30. arXiv:2203.14416  [pdf, other

    eess.AS cs.LG cs.SD

    Bunched LPCNet2: Efficient Neural Vocoders Covering Devices from Cloud to Edge

    Authors: Sangjun Park, Kihyun Choo, Joohyung Lee, Anton V. Porov, Konstantin Osipov, June Sig Sung

    Abstract: Text-to-Speech (TTS) services that run on edge devices have many advantages compared to cloud TTS, e.g., latency and privacy issues. However, neural vocoders with a low complexity and small model footprint inevitably generate annoying sounds. This study proposes a Bunched LPCNet2, an improved LPCNet architecture that provides highly efficient performance in high-quality for cloud servers and in a… ▽ More

    Submitted 30 June, 2022; v1 submitted 27 March, 2022; originally announced March 2022.

    Comments: Interspeech 2022

  31. arXiv:2111.10177  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis

    Authors: Alexandra Vioni, Myrsini Christidou, Nikolaos Ellinas, Georgios Vamvoukakis, Panos Kakoulidis, Taehoon Kim, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: This paper presents a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system. Instead of learning latent prosodic features with a variational framework as is commonly done, we directly extract phoneme-level F0 and duration features from the speech data in the training set. Each prosodic feature is discretized using unsupervised clustering… ▽ More

    Submitted 19 November, 2021; originally announced November 2021.

    Comments: Proceedings of ICASSP 2021

  32. arXiv:2111.10173  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Word-Level Style Control for Expressive, Non-attentive Speech Synthesis

    Authors: Konstantinos Klapsas, Nikolaos Ellinas, June Sig Sung, Hyoungmin Park, Spyros Raptis

    Abstract: This paper presents an expressive speech synthesis architecture for modeling and controlling the speaking style at a word level. It attempts to learn word-level stylistic and prosodic representations of the speech data, with the aid of two encoders. The first one models style by finding a combination of style tokens for each word given the acoustic features, and the second outputs a word-level seq… ▽ More

    Submitted 19 November, 2021; originally announced November 2021.

    Comments: Proceedings of SPECOM 2021

  33. arXiv:2111.10168  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Improved Prosodic Clustering for Multispeaker and Speaker-independent Phoneme-level Prosody Control

    Authors: Myrsini Christidou, Alexandra Vioni, Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, Panos Kakoulidis, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: This paper presents a method for phoneme-level prosody control of F0 and duration on a multispeaker text-to-speech setup, which is based on prosodic clustering. An autoregressive attention-based model is used, incorporating multispeaker architecture modules in parallel to a prosody encoder. Several improvements over the basic single-speaker method are proposed that increase the prosodic control ra… ▽ More

    Submitted 19 November, 2021; originally announced November 2021.

    Comments: Proceedings of SPECOM 2021

  34. arXiv:2111.09146  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Rapping-Singing Voice Synthesis based on Phoneme-level Prosody Control

    Authors: Konstantinos Markopoulos, Nikolaos Ellinas, Alexandra Vioni, Myrsini Christidou, Panos Kakoulidis, Georgios Vamvoukakis, Georgia Maniati, June Sig Sung, Hyoungmin Park, Pirros Tsiakoulis, Aimilios Chalamandaris

    Abstract: In this paper, a text-to-rapping/singing system is introduced, which can be adapted to any speaker's voice. It utilizes a Tacotron-based multispeaker acoustic model trained on read-only speech data and which provides prosody control at the phoneme level. Dataset augmentation and additional prosody manipulation based on traditional DSP algorithms are also investigated. The neural TTS model is fine-… ▽ More

    Submitted 17 November, 2021; originally announced November 2021.

    Comments: Proceedings of 11th ISCA Speech Synthesis Workshop (SSW 11)

  35. arXiv:2111.09075  [pdf, ps, other

    cs.SD cs.CL cs.LG eess.AS

    Cross-lingual Low Resource Speaker Adaptation Using Phonological Features

    Authors: Georgia Maniati, Nikolaos Ellinas, Konstantinos Markopoulos, Georgios Vamvoukakis, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: The idea of using phonological features instead of phonemes as input to sequence-to-sequence TTS has been recently proposed for zero-shot multilingual speech synthesis. This approach is useful for code-switching, as it facilitates the seamless uttering of foreign text embedded in a stream of native text. In our work, we train a language-agnostic multispeaker model conditioned on a set of phonologi… ▽ More

    Submitted 17 November, 2021; originally announced November 2021.

    Comments: Proceedings of INTERSPEECH 2021

  36. arXiv:2111.09052  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency

    Authors: Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, Aimilios Chalamandaris, Georgia Maniati, Panos Kakoulidis, Spyros Raptis, June Sig Sung, Hyoungmin Park, Pirros Tsiakoulis

    Abstract: This paper presents an end-to-end text-to-speech system with low latency on a CPU, suitable for real-time applications. The system is composed of an autoregressive attention-based sequence-to-sequence acoustic model and the LPCNet vocoder for waveform generation. An acoustic model architecture that adopts modules from both the Tacotron 1 and 2 models is proposed, while stability is ensured by usin… ▽ More

    Submitted 17 November, 2021; originally announced November 2021.

    Comments: Proceedings of INTERSPEECH 2020

  37. arXiv:2111.07072  [pdf, other

    cs.CV

    Factorial Convolution Neural Networks

    Authors: Jaemo Sung, Eun-Sung Jung

    Abstract: In recent years, GoogleNet has garnered substantial attention as one of the base convolutional neural networks (CNNs) to extract visual features for object detection. However, it experiences challenges of contaminated deep features when concatenating elements with different properties. Also, since GoogleNet is not an entirely lightweight CNN, it still has many execution overheads to apply to a res… ▽ More

    Submitted 13 November, 2021; originally announced November 2021.

  38. arXiv:2104.12845  [pdf, other

    astro-ph.EP astro-ph.IM cs.LG

    Multi-Output Random Forest Regression to Emulate the Earliest Stages of Planet Formation

    Authors: Kevin Hoffman, Jae Yoon Sung, André Zazzera

    Abstract: In the current paradigm of planet formation research, it is believed that the first step to forming massive bodies (such as asteroids and planets) requires that small interstellar dust grains floating through space collide with each other and grow to larger sizes. The initial formation of these pebbles is governed by an integro-differential equation known as the Smoluchowski coagulation equation,… ▽ More

    Submitted 26 April, 2021; originally announced April 2021.

  39. arXiv:2103.14776  [pdf, other

    eess.AS cs.LG cs.SD

    Scalable and Efficient Neural Speech Coding: A Hybrid Design

    Authors: Kai Zhen, Jongmo Sung, Mi Suk Lee, Seungkwon Beak, Minje Kim

    Abstract: We present a scalable and efficient neural waveform coding system for speech compression. We formulate the speech coding problem as an autoencoding task, where a convolutional neural network (CNN) performs encoding and decoding as a neural waveform codec (NWC) during its feedforward routine. The proposed NWC also defines quantization and entropy coding as a trainable module, so the coding artifact… ▽ More

    Submitted 27 November, 2021; v1 submitted 26 March, 2021; originally announced March 2021.

    Comments: IEEE/ACM Transactions on Audio, Speech, and Language Processing (IEEE/ACM TASLP), 2021 (Accepted for publication)

  40. arXiv:2102.03985  [pdf

    cs.AI cs.HC cs.LG

    Multisource AI Scorecard Table for System Evaluation

    Authors: Erik Blasch, James Sung, Tao Nguyen

    Abstract: The paper describes a Multisource AI Scorecard Table (MAST) that provides the developer and user of an artificial intelligence (AI)/machine learning (ML) system with a standard checklist focused on the principles of good analysis adopted by the intelligence community (IC) to help promote the development of more understandable systems and engender trust in AI outputs. Such a scorecard enables a tra… ▽ More

    Submitted 7 February, 2021; originally announced February 2021.

    Comments: Presented at AAAI FSS-20: Artificial Intelligence in Government and Public Sector, Washington, DC, USA

  41. arXiv:2101.00054  [pdf, other

    cs.SD cs.LG eess.AS

    Psychoacoustic Calibration of Loss Functions for Efficient End-to-End Neural Audio Coding

    Authors: Kai Zhen, Mi Suk Lee, Jongmo Sung, Seungkwon Beack, Minje Kim

    Abstract: Conventional audio coding technologies commonly leverage human perception of sound, or psychoacoustics, to reduce the bitrate while preserving the perceptual quality of the decoded audio signals. For neural audio codecs, however, the objective nature of the loss function usually leads to suboptimal sound quality as well as high run-time complexity due to the large model size. In this work, we pres… ▽ More

    Submitted 31 December, 2020; originally announced January 2021.

    Journal ref: IEEE Signal Processing Letters, vol. 27, pp. 2159-2163, 2020

  42. arXiv:2005.00919  [pdf, other

    eess.SP cs.IT

    Compressed-Sensing based Beam Detection in 5G NR Initial Access

    Authors: Junmo Sung, Brian L. Evans

    Abstract: To support millimeter wave (mmWave) frequency bands in cellular communications, both the base station and the mobile platform utilize large antenna arrays to steer narrow beams towards each other to compensate the path loss and improve communication performance. The time-frequency resource allocated for initial access, however, is limited, which gives rise to need for efficient approaches for beam… ▽ More

    Submitted 2 May, 2020; originally announced May 2020.

    Comments: 5 pages, 6 figures, SPAWC2020

  43. arXiv:2002.05604  [pdf, other

    eess.AS cs.MM cs.SD eess.SP

    Efficient And Scalable Neural Residual Waveform Coding With Collaborative Quantization

    Authors: Kai Zhen, Mi Suk Lee, Jongmo Sung, Seungkwon Beack, Minje Kim

    Abstract: Scalability and efficiency are desired in neural speech codecs, which supports a wide range of bitrates for applications on various devices. We propose a collaborative quantization (CQ) scheme to jointly learn the codebook of LPC coefficients and the corresponding residuals. CQ does not simply shoehorn LPC to a neural network, but bridges the computational capacity of advanced neural network model… ▽ More

    Submitted 13 February, 2020; originally announced February 2020.

    Comments: Accepted in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) , Barcelona, Spain, May 4-8, 2020

  44. arXiv:1911.05727  [pdf

    cs.CY cs.IR eess.IV

    Artificial Intelligence Strategies for National Security and Safety Standards

    Authors: Erik Blasch, James Sung, Tao Nguyen, Chandra P. Daniel, Alisa P. Mason

    Abstract: Recent advances in artificial intelligence (AI) have lead to an explosion of multimedia applications (e.g., computer vision (CV) and natural language processing (NLP)) for different domains such as commercial, industrial, and intelligence. In particular, the use of AI applications in a national security environment is often problematic because the opaque nature of the systems leads to an inability… ▽ More

    Submitted 3 November, 2019; originally announced November 2019.

    Comments: Presented at AAAI FSS-19: Artificial Intelligence in Government and Public Sector, Arlington, Virginia, USA

  45. arXiv:1909.09861  [pdf, other

    cs.IT

    Hybrid Beamformer Codebook Design and Ordering for Compressive mmWave Channel Estimation

    Authors: Junmo Sung, Brian L. Evans

    Abstract: In millimeter wave (mmWave) communication systems, beamforming with large antenna arrays is critical to overcome high path losses. Separating all-digital beamforming into analog and digital stages can provide the large reduction in power consumption and small loss in spectral efficiency needed for practical implementations. Developing algorithms with this favorable tradeoff is challenging due to t… ▽ More

    Submitted 21 September, 2019; originally announced September 2019.

  46. arXiv:1909.09858  [pdf, other

    cs.IT

    Versatile Compressive mmWave Hybrid Beamformer Codebook Design Framework

    Authors: Junmo Sung, Brian L. Evans

    Abstract: Hybrid beamforming (HB) architectures are attractive for wireless communication systems with large antenna arrays because the analog beamforming stage can significantly reduce the number of RF transceivers and hence power consumption. In HB systems, channel estimation (CE) becomes challenging due to indirect access by the baseband processing to the communication channels and due to low SNR before… ▽ More

    Submitted 21 September, 2019; originally announced September 2019.

  47. arXiv:1907.05415  [pdf, other

    quant-ph cs.LG

    Learning to learn with quantum neural networks via classical neural networks

    Authors: Guillaume Verdon, Michael Broughton, Jarrod R. McClean, Kevin J. Sung, Ryan Babbush, Zhang Jiang, Hartmut Neven, Masoud Mohseni

    Abstract: Quantum Neural Networks (QNNs) are a promising variational learning paradigm with applications to near-term quantum processors, however they still face some significant challenges. One such challenge is finding good parameter initialization heuristics that ensure rapid and consistent convergence to local minima of the parameterized quantum circuit landscape. In this work, we train classical neural… ▽ More

    Submitted 11 July, 2019; originally announced July 2019.

    Comments: 12 pages, 4 figures

  48. arXiv:1907.00482  [pdf, other

    eess.SP cs.IT

    Base Station Antenna Selection for Low-Resolution ADC Systems

    Authors: Jinseok Choi, Junmo Sung, Narayan Prasad, Xiao-Feng Qi, Brian L. Evans, Alan Gatherer

    Abstract: This paper investigates antenna selection at a base station with large antenna arrays and low-resolution analog-to-digital converters. For downlink transmit antenna selection for narrowband channels, we show (1) a selection criterion that maximizes sum rate with zero-forcing precoding equivalent to that of a perfect quantization system; (2) maximum sum rate increases with number of selected antenn… ▽ More

    Submitted 30 June, 2019; originally announced July 2019.

    Comments: Submitted to IEEE Transactions on Communications

  49. arXiv:1906.07769  [pdf, other

    eess.AS cs.LG cs.SD

    Cascaded Cross-Module Residual Learning towards Lightweight End-to-End Speech Coding

    Authors: Kai Zhen, Jongmo Sung, Mi Suk Lee, Seungkwon Beack, Minje Kim

    Abstract: Speech codecs learn compact representations of speech signals to facilitate data transmission. Many recent deep neural network (DNN) based end-to-end speech codecs achieve low bitrates and high perceptual quality at the cost of model complexity. We propose a cross-module residual learning (CMRL) pipeline as a module carrier with each module reconstructing the residual from its preceding modules. C… ▽ More

    Submitted 13 September, 2019; v1 submitted 18 June, 2019; originally announced June 2019.

    Comments: Accepted for publication in INTERSPEECH 2019

    Journal ref: Published in Interspeech 2019

  50. arXiv:1805.02838  [pdf, other

    cs.CV

    A Memory Network Approach for Story-based Temporal Summarization of 360° Videos

    Authors: Sangho Lee, Jinyoung Sung, Youngjae Yu, Gunhee Kim

    Abstract: We address the problem of story-based temporal summarization of long 360° videos. We propose a novel memory network model named Past-Future Memory Network (PFMN), in which we first compute the scores of 81 normal field of view (NFOV) region proposals cropped from the input 360° video, and then recover a latent, collective summary using the network with two external memories that store the embeddin… ▽ More

    Submitted 18 June, 2018; v1 submitted 8 May, 2018; originally announced May 2018.

    Comments: Accepted paper at CVPR 2018