-
Action2Dialogue: Generating Character-Centric Narratives from Scene-Level Prompts
Authors:
Taewon Kang,
Ming C. Lin
Abstract:
Recent advances in scene-based video generation have enabled systems to synthesize coherent visual narratives from structured prompts. However, a crucial dimension of storytelling -- character-driven dialogue and speech -- remains underexplored. In this paper, we present a modular pipeline that transforms action-level prompts into visually and auditorily grounded narrative dialogue, enriching visu…
▽ More
Recent advances in scene-based video generation have enabled systems to synthesize coherent visual narratives from structured prompts. However, a crucial dimension of storytelling -- character-driven dialogue and speech -- remains underexplored. In this paper, we present a modular pipeline that transforms action-level prompts into visually and auditorily grounded narrative dialogue, enriching visual storytelling with natural voice and character expression. Our method takes as input a pair of prompts per scene, where the first defines the setting and the second specifies a character's behavior. While a story generation model such as Text2Story generates the corresponding visual scene, we focus on generating expressive character utterances from these prompts and the scene image. We apply a pretrained vision-language encoder to extract a high-level semantic feature from the representative frame, capturing salient visual context. This feature is then combined with the structured prompts and used to guide a large language model in synthesizing natural, character-consistent dialogue. To ensure contextual consistency across scenes, we introduce a Recursive Narrative Bank that conditions each dialogue generation on the accumulated dialogue history from prior scenes. This approach enables characters to speak in ways that reflect their evolving goals and interactions throughout a story. Finally, we render each utterance as expressive, character-consistent speech, resulting in fully-voiced video narratives. Our framework requires no additional training and demonstrates applicability across a variety of story settings, from fantasy adventures to slice-of-life episodes.
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
Don't Judge Code by Its Cover: Exploring Biases in LLM Judges for Code Evaluation
Authors:
Jiwon Moon,
Yerin Hwang,
Dongryeol Lee,
Taegwan Kang,
Yongil Kim,
Kyomin Jung
Abstract:
With the growing use of large language models(LLMs) as evaluators, their application has expanded to code evaluation tasks, where they assess the correctness of generated code without relying on reference implementations. While this offers scalability and flexibility, it also raises a critical, unresolved question: Can LLM judges fairly and robustly evaluate semantically equivalent code with super…
▽ More
With the growing use of large language models(LLMs) as evaluators, their application has expanded to code evaluation tasks, where they assess the correctness of generated code without relying on reference implementations. While this offers scalability and flexibility, it also raises a critical, unresolved question: Can LLM judges fairly and robustly evaluate semantically equivalent code with superficial variations? Functionally correct code often exhibits variations-such as differences in variable names, comments, or formatting-that should not influence its correctness. Yet, whether LLM judges can reliably handle these variations remains unclear. We present the first comprehensive study of this issue, defining six types of potential bias in code evaluation and revealing their systematic impact on LLM judges. Across five programming languages and multiple LLMs, we empirically demonstrate that all tested LLM judges are susceptible to both positive and negative biases, resulting in inflated or unfairly low scores. Moreover, we observe that LLM judges remain vulnerable to these biases even when prompted to generate test cases before scoring, highlighting the need for more robust code evaluation methods.
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
Fooling the LVLM Judges: Visual Biases in LVLM-Based Evaluation
Authors:
Yerin Hwang,
Dongryeol Lee,
Kyungmin Min,
Taegwan Kang,
Yong-il Kim,
Kyomin Jung
Abstract:
Recently, large vision-language models (LVLMs) have emerged as the preferred tools for judging text-image alignment, yet their robustness along the visual modality remains underexplored. This work is the first study to address a key research question: Can adversarial visual manipulations systematically fool LVLM judges into assigning unfairly inflated scores? We define potential image induced bias…
▽ More
Recently, large vision-language models (LVLMs) have emerged as the preferred tools for judging text-image alignment, yet their robustness along the visual modality remains underexplored. This work is the first study to address a key research question: Can adversarial visual manipulations systematically fool LVLM judges into assigning unfairly inflated scores? We define potential image induced biases within the context of T2I evaluation and examine how these biases affect the evaluations of LVLM judges. Moreover, we introduce a novel, fine-grained, multi-domain meta-evaluation benchmark named FRAME, which is deliberately constructed to exhibit diverse score distributions. By introducing the defined biases into the benchmark, we reveal that all tested LVLM judges exhibit vulnerability across all domains, consistently inflating scores for manipulated images. Further analysis reveals that combining multiple biases amplifies their effects, and pairwise evaluations are similarly susceptible. Moreover, we observe that visual biases persist under prompt-based mitigation strategies, highlighting the vulnerability of current LVLM evaluation systems and underscoring the urgent need for more robust LVLM judges.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
SphereDiff: Tuning-free Omnidirectional Panoramic Image and Video Generation via Spherical Latent Representation
Authors:
Minho Park,
Taewoong Kang,
Jooyeol Yun,
Sungwon Hwang,
Jaegul Choo
Abstract:
The increasing demand for AR/VR applications has highlighted the need for high-quality 360-degree panoramic content. However, generating high-quality 360-degree panoramic images and videos remains a challenging task due to the severe distortions introduced by equirectangular projection (ERP). Existing approaches either fine-tune pretrained diffusion models on limited ERP datasets or attempt tuning…
▽ More
The increasing demand for AR/VR applications has highlighted the need for high-quality 360-degree panoramic content. However, generating high-quality 360-degree panoramic images and videos remains a challenging task due to the severe distortions introduced by equirectangular projection (ERP). Existing approaches either fine-tune pretrained diffusion models on limited ERP datasets or attempt tuning-free methods that still rely on ERP latent representations, leading to discontinuities near the poles. In this paper, we introduce SphereDiff, a novel approach for seamless 360-degree panoramic image and video generation using state-of-the-art diffusion models without additional tuning. We define a spherical latent representation that ensures uniform distribution across all perspectives, mitigating the distortions inherent in ERP. We extend MultiDiffusion to spherical latent space and propose a spherical latent sampling method to enable direct use of pretrained diffusion models. Moreover, we introduce distortion-aware weighted averaging to further improve the generation quality in the projection process. Our method outperforms existing approaches in generating 360-degree panoramic content while maintaining high fidelity, making it a robust solution for immersive AR/VR applications. The code is available here. https://github.com/pmh9960/SphereDiff
△ Less
Submitted 19 April, 2025;
originally announced April 2025.
-
A real-time anomaly detection method for robots based on a flexible and sparse latent space
Authors:
Taewook Kang,
Bum-Jae You,
Juyoun Park,
Yisoo Lee
Abstract:
The growing demand for robots to operate effectively in diverse environments necessitates the need for robust real-time anomaly detection techniques during robotic operations. However, deep learning-based models in robotics face significant challenges due to limited training data and highly noisy signal features. In this paper, we present Sparse Masked Autoregressive Flow-based Adversarial AutoEnc…
▽ More
The growing demand for robots to operate effectively in diverse environments necessitates the need for robust real-time anomaly detection techniques during robotic operations. However, deep learning-based models in robotics face significant challenges due to limited training data and highly noisy signal features. In this paper, we present Sparse Masked Autoregressive Flow-based Adversarial AutoEncoder model to address these problems. This approach integrates Masked Autoregressive Flow model into Adversarial AutoEncoders to construct a flexible latent space and utilize Sparse autoencoder to efficiently focus on important features, even in scenarios with limited feature space. Our experiments demonstrate that the proposed model achieves a 4.96% to 9.75% higher area under the receiver operating characteristic curve for pick-and-place robotic operations with randomly placed cans, compared to existing state-of-the-art methods. Notably, it showed up to 19.67% better performance in scenarios involving collisions with lightweight objects. Additionally, unlike the existing state-of-the-art model, our model performs inferences within 1 millisecond, ensuring real-time anomaly detection. These capabilities make our model highly applicable to machine learning-based robotic safety systems in dynamic environments. The code is available at https://github.com/twkang43/sparse-maf-aae.
△ Less
Submitted 22 June, 2025; v1 submitted 15 April, 2025;
originally announced April 2025.
-
A production planning benchmark for real-world refinery-petrochemical complexes
Authors:
Wenli Du,
Chuan Wang,
Chen Fan,
Zhi Li,
Yeke Zhong,
Tianao Kang,
Ziting Liang,
Minglei Yang,
Feng Qian,
Xin Dai
Abstract:
To achieve digital intelligence transformation and carbon neutrality, effective production planning is crucial for integrated refinery-petrochemical complexes. Modern refinery planning relies on advanced optimization techniques, whose development requires reproducible benchmark problems. However, existing benchmarks lack practical context or impose oversimplified assumptions, limiting their applic…
▽ More
To achieve digital intelligence transformation and carbon neutrality, effective production planning is crucial for integrated refinery-petrochemical complexes. Modern refinery planning relies on advanced optimization techniques, whose development requires reproducible benchmark problems. However, existing benchmarks lack practical context or impose oversimplified assumptions, limiting their applicability to enterprise-wide optimization. To bridge the substantial gap between theoretical research and industrial applications, this paper introduces the first open-source, demand-driven benchmark for industrial-scale refinery-petrochemical complexes with transparent model formulations and comprehensive input parameters. The benchmark incorporates a novel port-stream hybrid superstructure for modular modeling and broad generalizability. Key secondary processing units are represented using the delta-base approach grounded in historical data. Three real-world cases have been constructed to encompass distinct scenario characteristics, respectively addressing (1) a stand-alone refinery without integer variables, (2) chemical site integration with inventory-related integer variables, and (3) multi-period planning. All model parameters are fully accessible. Additionally, this paper provides an analysis of computational performance, ablation experiments on delta-base modeling, and application scenarios for the proposed benchmark.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
Extendable Long-Horizon Planning via Hierarchical Multiscale Diffusion
Authors:
Chang Chen,
Hany Hamed,
Doojin Baek,
Taegu Kang,
Yoshua Bengio,
Sungjin Ahn
Abstract:
This paper tackles a novel problem, extendable long-horizon planning-enabling agents to plan trajectories longer than those in training data without compounding errors. To tackle this, we propose the Hierarchical Multiscale Diffuser (HM-Diffuser) and Progressive Trajectory Extension (PTE), an augmentation method that iteratively generates longer trajectories by stitching shorter ones. HM-Diffuser…
▽ More
This paper tackles a novel problem, extendable long-horizon planning-enabling agents to plan trajectories longer than those in training data without compounding errors. To tackle this, we propose the Hierarchical Multiscale Diffuser (HM-Diffuser) and Progressive Trajectory Extension (PTE), an augmentation method that iteratively generates longer trajectories by stitching shorter ones. HM-Diffuser trains on these extended trajectories using a hierarchical structure, efficiently handling tasks across multiple temporal scales. Additionally, we introduce Adaptive Plan Pondering and the Recursive HM-Diffuser, which consolidate hierarchical layers into a single model to process temporal scales recursively. Experimental results demonstrate the effectiveness of our approach, advancing diffusion-based planners for scalable long-horizon planning.
△ Less
Submitted 10 April, 2025; v1 submitted 25 March, 2025;
originally announced March 2025.
-
Text2Story: Advancing Video Storytelling with Text Guidance
Authors:
Taewon Kang,
Divya Kothandaraman,
Ming C. Lin
Abstract:
Generating coherent long-form video sequences from discrete input using only text prompts is a critical task in content creation. While diffusion-based models excel at short video synthesis, long-form storytelling from text remains largely unexplored and a challenge due to challenges pertaining to temporal coherency, preserving semantic meaning and action continuity across the video. We introduce…
▽ More
Generating coherent long-form video sequences from discrete input using only text prompts is a critical task in content creation. While diffusion-based models excel at short video synthesis, long-form storytelling from text remains largely unexplored and a challenge due to challenges pertaining to temporal coherency, preserving semantic meaning and action continuity across the video. We introduce a novel storytelling approach to enable seamless video generation with natural action transitions and structured narratives. We present a bidirectional time-weighted latent blending strategy to ensure temporal consistency between segments of the long-form video being generated. Further, our method extends the Black-Scholes algorithm from prompt mixing for image generation to video generation, enabling controlled motion evolution through structured text conditioning. To further enhance motion continuity, we propose a semantic action representation framework to encode high-level action semantics into the blending process, dynamically adjusting transitions based on action similarity, ensuring smooth yet adaptable motion changes. Latent space blending maintains spatial coherence between objects in a scene, while time-weighted blending enforces bidirectional constraints for temporal consistency. This integrative approach prevents abrupt transitions while ensuring fluid storytelling. Extensive experiments demonstrate significant improvements over baselines, achieving temporally consistent and visually compelling video narratives without any additional training. Our approach bridges the gap between short clips and extended video to establish a new paradigm in GenAI-driven video synthesis from text.
△ Less
Submitted 8 March, 2025;
originally announced March 2025.
-
Zero-Shot Head Swapping in Real-World Scenarios
Authors:
Taewoong Kang,
Sohyun Jeong,
Hyojin Jang,
Jaegul Choo
Abstract:
With growing demand in media and social networks for personalized images, the need for advanced head-swapping techniques, integrating an entire head from the head image with the body from the body image, has increased. However, traditional head swapping methods heavily rely on face-centered cropped data with primarily frontal facing views, which limits their effectiveness in real world application…
▽ More
With growing demand in media and social networks for personalized images, the need for advanced head-swapping techniques, integrating an entire head from the head image with the body from the body image, has increased. However, traditional head swapping methods heavily rely on face-centered cropped data with primarily frontal facing views, which limits their effectiveness in real world applications. Additionally, their masking methods, designed to indicate regions requiring editing, are optimized for these types of dataset but struggle to achieve seamless blending in complex situations, such as when the original data includes features like long hair extending beyond the masked area. To overcome these limitations and enhance adaptability in diverse and complex scenarios, we propose a novel head swapping method, HID, that is robust to images including the full head and the upper body, and handles from frontal to side views, while automatically generating context aware masks. For automatic mask generation, we introduce the IOMask, which enables seamless blending of the head and body, effectively addressing integration challenges. We further introduce the hair injection module to capture hair details with greater precision. Our experiments demonstrate that the proposed approach achieves state-of-the-art performance in head swapping, providing visually consistent and realistic results across a wide range of challenging conditions.
△ Less
Submitted 24 March, 2025; v1 submitted 2 March, 2025;
originally announced March 2025.
-
Bridging Information Gaps with Comprehensive Answers: Improving the Diversity and Informativeness of Follow-Up Questions
Authors:
Zhe Liu,
Taekyu Kang,
Haoyu Wang,
Seyed Hossein Alavi,
Vered Shwartz
Abstract:
Effective conversational systems are expected to dynamically generate contextual follow-up questions to elicit new information while maintaining the conversation flow. While humans excel at asking diverse and informative questions by intuitively assessing both obtained and missing information, existing models often fall short of human performance on this task. To mitigate this, we propose a method…
▽ More
Effective conversational systems are expected to dynamically generate contextual follow-up questions to elicit new information while maintaining the conversation flow. While humans excel at asking diverse and informative questions by intuitively assessing both obtained and missing information, existing models often fall short of human performance on this task. To mitigate this, we propose a method that generates diverse and informative questions based on targeting unanswered information using a hypothetical LLM-generated "comprehensive answer". Our method is applied to augment an existing follow-up questions dataset. The experimental results demonstrate that language models fine-tuned on the augmented datasets produce follow-up questions of significantly higher quality and diversity. This promising approach could be effectively adopted to future work to augment information-seeking dialogues for reducing ambiguities and improving the accuracy of LLM answers.
△ Less
Submitted 24 February, 2025;
originally announced February 2025.
-
LLMs can be easily Confused by Instructional Distractions
Authors:
Yerin Hwang,
Yongil Kim,
Jahyun Koo,
Taegwan Kang,
Hyunkyung Bae,
Kyomin Jung
Abstract:
Despite the fact that large language models (LLMs) show exceptional skill in instruction following tasks, this strength can turn into a vulnerability when the models are required to disregard certain instructions. Instruction-following tasks typically involve a clear task description and input text containing the target data to be processed. However, when the input itself resembles an instruction,…
▽ More
Despite the fact that large language models (LLMs) show exceptional skill in instruction following tasks, this strength can turn into a vulnerability when the models are required to disregard certain instructions. Instruction-following tasks typically involve a clear task description and input text containing the target data to be processed. However, when the input itself resembles an instruction, confusion may arise, even if there is explicit prompting to distinguish between the task instruction and the input. We refer to this phenomenon as instructional distraction. In this paper, we introduce a novel benchmark, named DIM-Bench, specifically designed to assess LLMs' performance under instructional distraction. The benchmark categorizes real-world instances of instructional distraction and evaluates LLMs across four instruction tasks: rewriting, proofreading, translation, and style transfer -- alongside five input tasks: reasoning, code generation, mathematical reasoning, bias detection, and question answering. Our experimental results reveal that even the most advanced LLMs are susceptible to instructional distraction, often failing to accurately follow user intent in such cases.
△ Less
Submitted 4 February, 2025;
originally announced February 2025.
-
Embracing Dialectic Intersubjectivity: Coordination of Different Perspectives in Content Analysis with LLM Persona Simulation
Authors:
Taewoo Kang,
Kjerstin Thorson,
Tai-Quan Peng,
Dan Hiaeshutter-Rice,
Sanguk Lee,
Stuart Soroka
Abstract:
This study attempts to advancing content analysis methodology from consensus-oriented to coordination-oriented practices, thereby embracing diverse coding outputs and exploring the dynamics among differential perspectives. As an exploratory investigation of this approach, we evaluate six GPT-4o configurations to analyze sentiment in Fox News and MSNBC transcripts on Biden and Trump during the 2020…
▽ More
This study attempts to advancing content analysis methodology from consensus-oriented to coordination-oriented practices, thereby embracing diverse coding outputs and exploring the dynamics among differential perspectives. As an exploratory investigation of this approach, we evaluate six GPT-4o configurations to analyze sentiment in Fox News and MSNBC transcripts on Biden and Trump during the 2020 U.S. presidential campaign, examining patterns across these models. By assessing each model's alignment with ideological perspectives, we explore how partisan selective processing could be identified in LLM-Assisted Content Analysis (LACA). Findings reveal that partisan persona LLMs exhibit stronger ideological biases when processing politically congruent content. Additionally, intercoder reliability is higher among same-partisan personas compared to cross-partisan pairs. This approach enhances the nuanced understanding of LLM outputs and advances the integrity of AI-driven social science research, enabling simulations of real-world implications.
△ Less
Submitted 4 February, 2025; v1 submitted 2 February, 2025;
originally announced February 2025.
-
Humanity's Last Exam
Authors:
Long Phan,
Alice Gatti,
Ziwen Han,
Nathaniel Li,
Josephina Hu,
Hugh Zhang,
Chen Bo Calvin Zhang,
Mohamed Shaaban,
John Ling,
Sean Shi,
Michael Choi,
Anish Agrawal,
Arnav Chopra,
Adam Khoja,
Ryan Kim,
Richard Ren,
Jason Hausenloy,
Oliver Zhang,
Mantas Mazeika,
Dmitry Dodonov,
Tung Nguyen,
Jaeho Lee,
Daron Anderson,
Mikhail Doroshenko,
Alun Cennyth Stokes
, et al. (1084 additional authors not shown)
Abstract:
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of…
▽ More
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.
△ Less
Submitted 19 April, 2025; v1 submitted 24 January, 2025;
originally announced January 2025.
-
Tempus Core: Area-Power Efficient Temporal-Unary Convolution Core for Low-Precision Edge DLAs
Authors:
Prabhu Vellaisamy,
Harideep Nair,
Thomas Kang,
Yichen Ni,
Haoyang Fan,
Bin Qi,
Jeff Chen,
Shawn Blanton,
John Paul Shen
Abstract:
The increasing complexity of deep neural networks (DNNs) poses significant challenges for edge inference deployment due to resource and power constraints of edge devices. Recent works on unary-based matrix multiplication hardware aim to leverage data sparsity and low-precision values to enhance hardware efficiency. However, the adoption and integration of such unary hardware into commercial deep l…
▽ More
The increasing complexity of deep neural networks (DNNs) poses significant challenges for edge inference deployment due to resource and power constraints of edge devices. Recent works on unary-based matrix multiplication hardware aim to leverage data sparsity and low-precision values to enhance hardware efficiency. However, the adoption and integration of such unary hardware into commercial deep learning accelerators (DLA) remain limited due to processing element (PE) array dataflow differences. This work presents Tempus Core, a convolution core with highly scalable unary-based PE array comprising of tub (temporal-unary-binary) multipliers that seamlessly integrates with the NVDLA (NVIDIA's open-source DLA for accelerating CNNs) while maintaining dataflow compliance and boosting hardware efficiency. Analysis across various datapath granularities shows that for INT8 precision in 45nm CMOS, Tempus Core's PE cell unit (PCU) yields 59.3% and 15.3% reductions in area and power consumption, respectively, over NVDLA's CMAC unit. Considering a 16x16 PE array in Tempus Core, area and power improves by 75% and 62%, respectively, while delivering 5x and 4x iso-area throughput improvements for INT8 and INT4 precisions. Post-place and route analysis of Tempus Core's PCU shows that the 16x4 PE array for INT4 precision in 45nm CMOS requires only 0.017 mm^2 die area and consumes only 6.2mW of total power. We demonstrate that area-power efficient unary-based hardware can be seamlessly integrated into conventional DLAs, paving the path for efficient unary hardware for edge AI inference.
△ Less
Submitted 25 December, 2024;
originally announced December 2024.
-
Federated Low-Rank Adaptation with Differential Privacy over Wireless Networks
Authors:
Tianqu Kang,
Zixin Wang,
Hengtao He,
Jun Zhang,
Shenghui Song,
Khaled B. Letaief
Abstract:
Fine-tuning large pre-trained foundation models (FMs) on distributed edge devices presents considerable computational and privacy challenges. Federated fine-tuning (FedFT) mitigates some privacy issues by facilitating collaborative model training without the need to share raw data. To lessen the computational burden on resource-limited devices, combining low-rank adaptation (LoRA) with federated l…
▽ More
Fine-tuning large pre-trained foundation models (FMs) on distributed edge devices presents considerable computational and privacy challenges. Federated fine-tuning (FedFT) mitigates some privacy issues by facilitating collaborative model training without the need to share raw data. To lessen the computational burden on resource-limited devices, combining low-rank adaptation (LoRA) with federated learning enables parameter-efficient fine-tuning. Additionally, the split FedFT architecture partitions an FM between edge devices and a central server, reducing the necessity for complete model deployment on individual devices. However, the risk of privacy eavesdropping attacks in FedFT remains a concern, particularly in sensitive areas such as healthcare and finance. In this paper, we propose a split FedFT framework with differential privacy (DP) over wireless networks, where the inherent wireless channel noise in the uplink transmission is utilized to achieve DP guarantees without adding an extra artificial noise. We shall investigate the impact of the wireless noise on convergence performance of the proposed framework. We will also show that by updating only one of the low-rank matrices in the split FedFT with DP, the proposed method can mitigate the noise amplification effect. Simulation results will demonstrate that the proposed framework achieves higher accuracy under strict privacy budgets compared to baseline methods.
△ Less
Submitted 27 November, 2024; v1 submitted 12 November, 2024;
originally announced November 2024.
-
SWITCH: Studying with Teacher for Knowledge Distillation of Large Language Models
Authors:
Jahyun Koo,
Yerin Hwang,
Yongil Kim,
Taegwan Kang,
Hyunkyung Bae,
Kyomin Jung
Abstract:
Despite the success of Large Language Models (LLMs), they still face challenges related to high inference costs and memory requirements. To address these issues, Knowledge Distillation (KD) has emerged as a popular method for model compression, with student-generated outputs (SGOs) as training data being particularly notable for reducing the mismatch between training and inference. However, SGOs o…
▽ More
Despite the success of Large Language Models (LLMs), they still face challenges related to high inference costs and memory requirements. To address these issues, Knowledge Distillation (KD) has emerged as a popular method for model compression, with student-generated outputs (SGOs) as training data being particularly notable for reducing the mismatch between training and inference. However, SGOs often produce noisy and biased sequences, which can lead to misguidance from the teacher model, especially in long sequences. To mitigate these challenges, we propose SWITCH (Studying WIth TeaCHer for Knowledge Distillation), a novel approach that strategically incorporates the teacher model during the student's sequence generation. SWITCH identifies discrepancies between the token probabilities of the teacher and student models, allowing the teacher to intervene selectively, particularly in long sequences that are more prone to teacher misguidance. Extensive experimental results across three model families and five instruction-following datasets show that SWITCH surpasses traditional KD methods, particularly excelling in the generation of long sequential data.
△ Less
Submitted 22 April, 2025; v1 submitted 25 October, 2024;
originally announced October 2024.
-
SurFhead: Affine Rig Blending for Geometrically Accurate 2D Gaussian Surfel Head Avatars
Authors:
Jaeseong Lee,
Taewoong Kang,
Marcel C. Bühler,
Min-Jung Kim,
Sungwon Hwang,
Junha Hyung,
Hyojin Jang,
Jaegul Choo
Abstract:
Recent advancements in head avatar rendering using Gaussian primitives have achieved significantly high-fidelity results. Although precise head geometry is crucial for applications like mesh reconstruction and relighting, current methods struggle to capture intricate geometric details and render unseen poses due to their reliance on similarity transformations, which cannot handle stretch and shear…
▽ More
Recent advancements in head avatar rendering using Gaussian primitives have achieved significantly high-fidelity results. Although precise head geometry is crucial for applications like mesh reconstruction and relighting, current methods struggle to capture intricate geometric details and render unseen poses due to their reliance on similarity transformations, which cannot handle stretch and shear transforms essential for detailed deformations of geometry. To address this, we propose SurFhead, a novel method that reconstructs riggable head geometry from RGB videos using 2D Gaussian surfels, which offer well-defined geometric properties, such as precise depth from fixed ray intersections and normals derived from their surface orientation, making them advantageous over 3D counterparts. SurFhead ensures high-fidelity rendering of both normals and images, even in extreme poses, by leveraging classical mesh-based deformation transfer and affine transformation interpolation. SurFhead introduces precise geometric deformation and blends surfels through polar decomposition of transformations, including those affecting normals. Our key contribution lies in bridging classical graphics techniques, such as mesh-based deformation, with modern Gaussian primitives, achieving state-of-the-art geometry reconstruction and rendering quality. Unlike previous avatar rendering approaches, SurFhead enables efficient reconstruction driven by Gaussian primitives while preserving high-fidelity geometry.
△ Less
Submitted 18 April, 2025; v1 submitted 15 October, 2024;
originally announced October 2024.
-
3D-free meets 3D priors: Novel View Synthesis from a Single Image with Pretrained Diffusion Guidance
Authors:
Taewon Kang,
Divya Kothandaraman,
Dinesh Manocha,
Ming C. Lin
Abstract:
Recent 3D novel view synthesis (NVS) methods often require extensive 3D data for training, and also typically lack generalization beyond the training distribution. Moreover, they tend to be object centric and struggle with complex and intricate scenes. Conversely, 3D-free methods can generate text-controlled views of complex, in-the-wild scenes using a pretrained stable diffusion model without the…
▽ More
Recent 3D novel view synthesis (NVS) methods often require extensive 3D data for training, and also typically lack generalization beyond the training distribution. Moreover, they tend to be object centric and struggle with complex and intricate scenes. Conversely, 3D-free methods can generate text-controlled views of complex, in-the-wild scenes using a pretrained stable diffusion model without the need for a large amount of 3D-based training data, but lack camera control. In this paper, we introduce a method capable of generating camera-controlled viewpoints from a single input image, by combining the benefits of 3D-free and 3D-based approaches. Our method excels in handling complex and diverse scenes without extensive training or additional 3D and multiview data. It leverages widely available pretrained NVS models for weak guidance, integrating this knowledge into a 3D-free view synthesis style approach, along with enriching the CLIP vision-language space with 3D camera angle information, to achieve the desired results. Experimental results demonstrate that our method outperforms existing models in both qualitative and quantitative evaluations, achieving high-fidelity, consistent novel view synthesis at desired camera angles across a wide variety of scenes while maintaining accurate, natural detail representation and image clarity across various viewpoints. We also support our method with a comprehensive analysis of 2D image generation models and the 3D space, providing a solid foundation and rationale for our solution.
△ Less
Submitted 27 November, 2024; v1 submitted 12 August, 2024;
originally announced August 2024.
-
VEGS: View Extrapolation of Urban Scenes in 3D Gaussian Splatting using Learned Priors
Authors:
Sungwon Hwang,
Min-Jung Kim,
Taewoong Kang,
Jayeon Kang,
Jaegul Choo
Abstract:
Neural rendering-based urban scene reconstruction methods commonly rely on images collected from driving vehicles with cameras facing and moving forward. Although these methods can successfully synthesize from views similar to training camera trajectory, directing the novel view outside the training camera distribution does not guarantee on-par performance. In this paper, we tackle the Extrapolate…
▽ More
Neural rendering-based urban scene reconstruction methods commonly rely on images collected from driving vehicles with cameras facing and moving forward. Although these methods can successfully synthesize from views similar to training camera trajectory, directing the novel view outside the training camera distribution does not guarantee on-par performance. In this paper, we tackle the Extrapolated View Synthesis (EVS) problem by evaluating the reconstructions on views such as looking left, right or downwards with respect to training camera distributions. To improve rendering quality for EVS, we initialize our model by constructing dense LiDAR map, and propose to leverage prior scene knowledge such as surface normal estimator and large-scale diffusion model. Qualitative and quantitative comparisons demonstrate the effectiveness of our methods on EVS. To the best of our knowledge, we are the first to address the EVS problem in urban scene reconstruction. Link to our project page: https://vegs3d.github.io/.
△ Less
Submitted 13 July, 2024; v1 submitted 3 July, 2024;
originally announced July 2024.
-
The Effect of Quantization in Federated Learning: A Rényi Differential Privacy Perspective
Authors:
Tianqu Kang,
Lumin Liu,
Hengtao He,
Jun Zhang,
S. H. Song,
Khaled B. Letaief
Abstract:
Federated Learning (FL) is an emerging paradigm that holds great promise for privacy-preserving machine learning using distributed data. To enhance privacy, FL can be combined with Differential Privacy (DP), which involves adding Gaussian noise to the model weights. However, FL faces a significant challenge in terms of large communication overhead when transmitting these model weights. To address…
▽ More
Federated Learning (FL) is an emerging paradigm that holds great promise for privacy-preserving machine learning using distributed data. To enhance privacy, FL can be combined with Differential Privacy (DP), which involves adding Gaussian noise to the model weights. However, FL faces a significant challenge in terms of large communication overhead when transmitting these model weights. To address this issue, quantization is commonly employed. Nevertheless, the presence of quantized Gaussian noise introduces complexities in understanding privacy protection. This research paper investigates the impact of quantization on privacy in FL systems. We examine the privacy guarantees of quantized Gaussian mechanisms using Rényi Differential Privacy (RDP). By deriving the privacy budget of quantized Gaussian mechanisms, we demonstrate that lower quantization bit levels provide improved privacy protection. To validate our theoretical findings, we employ Membership Inference Attacks (MIA), which gauge the accuracy of privacy leakage. The numerical results align with our theoretical analysis, confirming that quantization can indeed enhance privacy protection. This study not only enhances our understanding of the correlation between privacy and communication in FL but also underscores the advantages of quantization in preserving privacy.
△ Less
Submitted 16 May, 2024;
originally announced May 2024.
-
Capturing Momentum: Tennis Match Analysis Using Machine Learning and Time Series Theory
Authors:
Jingdi Lei,
Tianqi Kang,
Yuluan Cao,
Shiwei Ren
Abstract:
This paper represents an analysis on the momentum of tennis match. And due to Generalization performance of it, it can be helpful in constructing a system to predict the result of sports game and analyze the performance of player based on the Technical statistics. We First use hidden markov models to predict the momentum which is defined as the performance of players. Then we use Xgboost to prove…
▽ More
This paper represents an analysis on the momentum of tennis match. And due to Generalization performance of it, it can be helpful in constructing a system to predict the result of sports game and analyze the performance of player based on the Technical statistics. We First use hidden markov models to predict the momentum which is defined as the performance of players. Then we use Xgboost to prove the significance of momentum. Finally we use LightGBM to evaluate the performance of our model and use SHAP feature importance ranking and weight analysis to find the key points that affect the performance of players.
△ Less
Submitted 20 April, 2024;
originally announced April 2024.
-
Attention-Propagation Network for Egocentric Heatmap to 3D Pose Lifting
Authors:
Taeho Kang,
Youngki Lee
Abstract:
We present EgoTAP, a heatmap-to-3D pose lifting method for highly accurate stereo egocentric 3D pose estimation. Severe self-occlusion and out-of-view limbs in egocentric camera views make accurate pose estimation a challenging problem. To address the challenge, prior methods employ joint heatmaps-probabilistic 2D representations of the body pose, but heatmap-to-3D pose conversion still remains an…
▽ More
We present EgoTAP, a heatmap-to-3D pose lifting method for highly accurate stereo egocentric 3D pose estimation. Severe self-occlusion and out-of-view limbs in egocentric camera views make accurate pose estimation a challenging problem. To address the challenge, prior methods employ joint heatmaps-probabilistic 2D representations of the body pose, but heatmap-to-3D pose conversion still remains an inaccurate process. We propose a novel heatmap-to-3D lifting method composed of the Grid ViT Encoder and the Propagation Network. The Grid ViT Encoder summarizes joint heatmaps into effective feature embedding using self-attention. Then, the Propagation Network estimates the 3D pose by utilizing skeletal information to better estimate the position of obscure joints. Our method significantly outperforms the previous state-of-the-art qualitatively and quantitatively demonstrated by a 23.9\% reduction of error in an MPJPE metric. Our source code is available in GitHub.
△ Less
Submitted 28 February, 2024;
originally announced February 2024.
-
Experimental Study: Enhancing Voice Spoofing Detection Models with wav2vec 2.0
Authors:
Taein Kang,
Soyul Han,
Sunmook Choi,
Jaejin Seo,
Sanghyeok Chung,
Seungeun Lee,
Seungsang Oh,
Il-Youp Kwak
Abstract:
Conventional spoofing detection systems have heavily relied on the use of handcrafted features derived from speech data. However, a notable shift has recently emerged towards the direct utilization of raw speech waveforms, as demonstrated by methods like SincNet filters. This shift underscores the demand for more sophisticated audio sample features. Moreover, the success of deep learning models, p…
▽ More
Conventional spoofing detection systems have heavily relied on the use of handcrafted features derived from speech data. However, a notable shift has recently emerged towards the direct utilization of raw speech waveforms, as demonstrated by methods like SincNet filters. This shift underscores the demand for more sophisticated audio sample features. Moreover, the success of deep learning models, particularly those utilizing large pretrained wav2vec 2.0 as a featurization front-end, highlights the importance of refined feature encoders. In response, this research assessed the representational capability of wav2vec 2.0 as an audio feature extractor, modifying the size of its pretrained Transformer layers through two key adjustments: (1) selecting a subset of layers starting from the leftmost one and (2) fine-tuning a portion of the selected layers from the rightmost one. We complemented this analysis with five spoofing detection back-end models, with a primary focus on AASIST, enabling us to pinpoint the optimal configuration for the selection and fine-tuning process. In contrast to conventional handcrafted features, our investigation identified several spoofing detection systems that achieve state-of-the-art performance in the ASVspoof 2019 LA dataset. This comprehensive exploration offers valuable insights into feature selection strategies, advancing the field of spoofing detection.
△ Less
Submitted 26 February, 2024;
originally announced February 2024.
-
Mitigating Biases for Instruction-following Language Models via Bias Neurons Elimination
Authors:
Nakyeong Yang,
Taegwan Kang,
Jungkyu Choi,
Honglak Lee,
Kyomin Jung
Abstract:
Instruction-following language models often show undesirable biases. These undesirable biases may be accelerated in the real-world usage of language models, where a wide range of instructions is used through zero-shot example prompting. To solve this problem, we first define the bias neuron, which significantly affects biased outputs, and prove its existence empirically. Furthermore, we propose a…
▽ More
Instruction-following language models often show undesirable biases. These undesirable biases may be accelerated in the real-world usage of language models, where a wide range of instructions is used through zero-shot example prompting. To solve this problem, we first define the bias neuron, which significantly affects biased outputs, and prove its existence empirically. Furthermore, we propose a novel and practical bias mitigation method, CRISPR, to eliminate bias neurons of language models in instruction-following settings. CRISPR automatically determines biased outputs and categorizes neurons that affect the biased outputs as bias neurons using an explainability method. Experimental results demonstrate the effectiveness of our method in mitigating biases under zero-shot instruction-following settings without losing the model's task performance and existing knowledge. The experimental results reveal the generalizability of our method as it shows robustness under various instructions and datasets. Surprisingly, our method can mitigate the bias in language models by eliminating only a few neurons (at least three).
△ Less
Submitted 5 June, 2024; v1 submitted 16 November, 2023;
originally announced November 2023.
-
Expression Domain Translation Network for Cross-domain Head Reenactment
Authors:
Taewoong Kang,
Jeongsik Oh,
Jaeseong Lee,
Sunghyun Park,
Jaegul Choo
Abstract:
Despite the remarkable advancements in head reenactment, the existing methods face challenges in cross-domain head reenactment, which aims to transfer human motions to domains outside the human, including cartoon characters. It is still difficult to extract motion from out-of-domain images due to the distinct appearances, such as large eyes. Recently, previous work introduced a large-scale anime d…
▽ More
Despite the remarkable advancements in head reenactment, the existing methods face challenges in cross-domain head reenactment, which aims to transfer human motions to domains outside the human, including cartoon characters. It is still difficult to extract motion from out-of-domain images due to the distinct appearances, such as large eyes. Recently, previous work introduced a large-scale anime dataset called AnimeCeleb and a cross-domain head reenactment model, including an optimization-based mapping function to translate the human domain's expressions to the anime domain. However, we found that the mapping function, which relies on a subset of expressions, imposes limitations on the mapping of various expressions. To solve this challenge, we introduce a novel expression domain translation network that transforms human expressions into anime expressions. Specifically, to maintain the geometric consistency of expressions between the input and output of the expression domain translation network, we employ a 3D geometric-aware loss function that reduces the distances between the vertices in the 3D mesh of the human and anime. By doing so, it forces high-fidelity and one-to-one mapping with respect to two cross-expression domains. Our method outperforms existing methods in both qualitative and quantitative analysis, marking a significant advancement in the field of cross-domain head reenactment.
△ Less
Submitted 6 November, 2023; v1 submitted 16 October, 2023;
originally announced October 2023.
-
Ego3DPose: Capturing 3D Cues from Binocular Egocentric Views
Authors:
Taeho Kang,
Kyungjin Lee,
Jinrui Zhang,
Youngki Lee
Abstract:
We present Ego3DPose, a highly accurate binocular egocentric 3D pose reconstruction system. The binocular egocentric setup offers practicality and usefulness in various applications, however, it remains largely under-explored. It has been suffering from low pose estimation accuracy due to viewing distortion, severe self-occlusion, and limited field-of-view of the joints in egocentric 2D images. He…
▽ More
We present Ego3DPose, a highly accurate binocular egocentric 3D pose reconstruction system. The binocular egocentric setup offers practicality and usefulness in various applications, however, it remains largely under-explored. It has been suffering from low pose estimation accuracy due to viewing distortion, severe self-occlusion, and limited field-of-view of the joints in egocentric 2D images. Here, we notice that two important 3D cues, stereo correspondences, and perspective, contained in the egocentric binocular input are neglected. Current methods heavily rely on 2D image features, implicitly learning 3D information, which introduces biases towards commonly observed motions and leads to low overall accuracy. We observe that they not only fail in challenging occlusion cases but also in estimating visible joint positions. To address these challenges, we propose two novel approaches. First, we design a two-path network architecture with a path that estimates pose per limb independently with its binocular heatmaps. Without full-body information provided, it alleviates bias toward trained full-body distribution. Second, we leverage the egocentric view of body limbs, which exhibits strong perspective variance (e.g., a significantly large-size hand when it is close to the camera). We propose a new perspective-aware representation using trigonometry, enabling the network to estimate the 3D orientation of limbs. Finally, we develop an end-to-end pose reconstruction network that synergizes both techniques. Our comprehensive evaluations demonstrate that Ego3DPose outperforms state-of-the-art models by a pose estimation error (i.e., MPJPE) reduction of 23.1% in the UnrealEgo dataset. Our qualitative results highlight the superiority of our approach across a range of scenarios and challenges.
△ Less
Submitted 21 September, 2023;
originally announced September 2023.
-
An Empirical Study on Fault Detection and Root Cause Analysis of Indium Tin Oxide Electrodes by Processing S-parameter Patterns
Authors:
Tae Yeob Kang,
Haebom Lee,
Sungho Suh
Abstract:
In the field of optoelectronics, indium tin oxide (ITO) electrodes play a crucial role in various applications, such as displays, sensors, and solar cells. Effective fault diagnosis and root cause analysis of the ITO electrodes are essential to ensure the performance and reliability of the devices. However, traditional visual inspection is challenging with transparent ITO electrodes, and existing…
▽ More
In the field of optoelectronics, indium tin oxide (ITO) electrodes play a crucial role in various applications, such as displays, sensors, and solar cells. Effective fault diagnosis and root cause analysis of the ITO electrodes are essential to ensure the performance and reliability of the devices. However, traditional visual inspection is challenging with transparent ITO electrodes, and existing fault diagnosis methods have limitations in determining the root causes of the defects, often requiring destructive evaluations and secondary material characterization techniques. In this study, a fault diagnosis method with root cause analysis is proposed using scattering parameter (S-parameter) patterns, offering early detection, high diagnostic accuracy, and noise robustness. A comprehensive S-parameter pattern database is obtained according to various defect states of the ITO electrodes. Deep learning (DL) approaches, including multilayer perceptron (MLP), convolutional neural network (CNN), and transformer, are then used to simultaneously analyze the cause and severity of defects. Notably, it is demonstrated that the diagnostic performance under additive noise levels can be significantly enhanced by combining different channels of the S-parameters as input to the learning algorithms, as confirmed through the t-distributed stochastic neighbor embedding (t-SNE) dimension reduction visualization of the S-parameter patterns.
△ Less
Submitted 10 June, 2024; v1 submitted 16 August, 2023;
originally announced August 2023.
-
One for Multiple: Physics-informed Synthetic Data Boosts Generalizable Deep Learning for Fast MRI Reconstruction
Authors:
Zi Wang,
Xiaotong Yu,
Chengyan Wang,
Weibo Chen,
Jiazheng Wang,
Ying-Hua Chu,
Hongwei Sun,
Rushuai Li,
Peiyong Li,
Fan Yang,
Haiwei Han,
Taishan Kang,
Jianzhong Lin,
Chen Yang,
Shufu Chang,
Zhang Shi,
Sha Hua,
Yan Li,
Juan Hu,
Liuhong Zhu,
Jianjun Zhou,
Meijing Lin,
Jiefeng Guo,
Congbo Cai,
Zhong Chen
, et al. (3 additional authors not shown)
Abstract:
Magnetic resonance imaging (MRI) is a widely used radiological modality renowned for its radiation-free, comprehensive insights into the human body, facilitating medical diagnoses. However, the drawback of prolonged scan times hinders its accessibility. The k-space undersampling offers a solution, yet the resultant artifacts necessitate meticulous removal during image reconstruction. Although Deep…
▽ More
Magnetic resonance imaging (MRI) is a widely used radiological modality renowned for its radiation-free, comprehensive insights into the human body, facilitating medical diagnoses. However, the drawback of prolonged scan times hinders its accessibility. The k-space undersampling offers a solution, yet the resultant artifacts necessitate meticulous removal during image reconstruction. Although Deep Learning (DL) has proven effective for fast MRI image reconstruction, its broader applicability across various imaging scenarios has been constrained. Challenges include the high cost and privacy restrictions associated with acquiring large-scale, diverse training data, coupled with the inherent difficulty of addressing mismatches between training and target data in existing DL methodologies. Here, we present a novel Physics-Informed Synthetic data learning framework for Fast MRI, called PISF. PISF marks a breakthrough by enabling generalized DL for multi-scenario MRI reconstruction through a single trained model. Our approach separates the reconstruction of a 2D image into many 1D basic problems, commencing with 1D data synthesis to facilitate generalization. We demonstrate that training DL models on synthetic data, coupled with enhanced learning techniques, yields in vivo MRI reconstructions comparable to or surpassing those of models trained on matched realistic datasets, reducing the reliance on real-world MRI data by up to 96%. Additionally, PISF exhibits remarkable generalizability across multiple vendors and imaging centers. Its adaptability to diverse patient populations has been validated through evaluations by ten experienced medical professionals. PISF presents a feasible and cost-effective way to significantly boost the widespread adoption of DL in various fast MRI applications.
△ Less
Submitted 28 February, 2024; v1 submitted 24 July, 2023;
originally announced July 2023.
-
Magnetic Resonance Spectroscopy Quantification Aided by Deep Estimations of Imperfection Factors and Macromolecular Signal
Authors:
Dicheng Chen,
Meijin Lin,
Huiting Liu,
Jiayu Li,
Yirong Zhou,
Taishan Kang,
Liangjie Lin,
Zhigang Wu,
Jiazheng Wang,
Jing Li,
Jianzhong Lin,
Xi Chen,
Di Guo,
Xiaobo Qu
Abstract:
Objective: Magnetic Resonance Spectroscopy (MRS) is an important technique for biomedical detection. However, it is challenging to accurately quantify metabolites with proton MRS due to serious overlaps of metabolite signals, imperfections because of non-ideal acquisition conditions, and interference with strong background signals mainly from macromolecules. The most popular method, LCModel, adopt…
▽ More
Objective: Magnetic Resonance Spectroscopy (MRS) is an important technique for biomedical detection. However, it is challenging to accurately quantify metabolites with proton MRS due to serious overlaps of metabolite signals, imperfections because of non-ideal acquisition conditions, and interference with strong background signals mainly from macromolecules. The most popular method, LCModel, adopts complicated non-linear least square to quantify metabolites and addresses these problems by designing empirical priors such as basis-sets, imperfection factors. However, when the signal-to-noise ratio of MRS signal is low, the solution may have large deviation. Methods: Linear Least Squares (LLS) is integrated with deep learning to reduce the complexity of solving this overall quantification. First, a neural network is designed to explicitly predict the imperfection factors and the overall signal from macromolecules. Then, metabolite quantification is solved analytically with the introduced LLS. In our Quantification Network (QNet), LLS takes part in the backpropagation of network training, which allows the feedback of the quantification error into metabolite spectrum estimation. This scheme greatly improves the generalization to metabolite concentrations unseen for training compared to the end-to-end deep learning method. Results: Experiments show that compared with LCModel, the proposed QNet, has smaller quantification errors for simulated data, and presents more stable quantification for 20 healthy in vivo data at a wide range of signal-to-noise ratio. QNet also outperforms other end-to-end deep learning methods. Conclusion: This study provides an intelligent, reliable and robust MRS quantification. Significance: QNet is the first LLS quantification aided by deep learning.
△ Less
Submitted 9 October, 2023; v1 submitted 16 June, 2023;
originally announced June 2023.
-
Robustness of SAM: Segment Anything Under Corruptions and Beyond
Authors:
Yu Qiao,
Chaoning Zhang,
Taegoo Kang,
Donghun Kim,
Chenshuang Zhang,
Choong Seon Hong
Abstract:
Segment anything model (SAM), as the name suggests, is claimed to be capable of cutting out any object and demonstrates impressive zero-shot transfer performance with the guidance of prompts. However, there is currently a lack of comprehensive evaluation regarding its robustness under various corruptions. Understanding the robustness of SAM across different corruption scenarios is crucial for its…
▽ More
Segment anything model (SAM), as the name suggests, is claimed to be capable of cutting out any object and demonstrates impressive zero-shot transfer performance with the guidance of prompts. However, there is currently a lack of comprehensive evaluation regarding its robustness under various corruptions. Understanding the robustness of SAM across different corruption scenarios is crucial for its real-world deployment. Prior works show that SAM is biased towards texture (style) rather than shape, motivated by which we start by investigating its robustness against style transfer, which is synthetic corruption. Following by interpreting the effects of synthetic corruption as style changes, we proceed to conduct a comprehensive evaluation for its robustness against 15 types of common corruption. These corruptions mainly fall into categories such as digital, noise, weather, and blur, and within each corruption category, we explore 5 severity levels to simulate real-world corruption scenarios. Beyond the corruptions, we further assess the robustness of SAM against local occlusion and local adversarial patch attacks. To the best of our knowledge, our work is the first of its kind to evaluate the robustness of SAM under style change, local occlusion, and local adversarial patch attacks. Given that patch attacks visible to human eyes are easily detectable, we further assess its robustness against global adversarial attacks that are imperceptible to human eyes. Overall, this work provides a comprehensive empirical study of the robustness of SAM, evaluating its performance under various corruptions and extending the assessment to critical aspects such as local occlusion, local adversarial patch attacks, and global adversarial attacks. These evaluations yield valuable insights into the practical applicability and effectiveness of SAM in addressing real-world challenges.
△ Less
Submitted 4 September, 2023; v1 submitted 13 June, 2023;
originally announced June 2023.
-
A Survey on Segment Anything Model (SAM): Vision Foundation Model Meets Prompt Engineering
Authors:
Chaoning Zhang,
Joseph Cho,
Fachrina Dewi Puspitasari,
Sheng Zheng,
Chenghao Li,
Yu Qiao,
Taegoo Kang,
Xinru Shan,
Chenshuang Zhang,
Caiyan Qin,
Francois Rameau,
Lik-Hang Lee,
Sung-Ho Bae,
Choong Seon Hong
Abstract:
The Segment Anything Model (SAM), developed by Meta AI Research, represents a significant breakthrough in computer vision, offering a robust framework for image and video segmentation. This survey provides a comprehensive exploration of the SAM family, including SAM and SAM 2, highlighting their advancements in granularity and contextual understanding. Our study demonstrates SAM's versatility acro…
▽ More
The Segment Anything Model (SAM), developed by Meta AI Research, represents a significant breakthrough in computer vision, offering a robust framework for image and video segmentation. This survey provides a comprehensive exploration of the SAM family, including SAM and SAM 2, highlighting their advancements in granularity and contextual understanding. Our study demonstrates SAM's versatility across a wide range of applications while identifying areas where improvements are needed, particularly in scenarios requiring high granularity and in the absence of explicit prompts. By mapping the evolution and capabilities of SAM models, we offer insights into their strengths and limitations and suggest future research directions, including domain-specific adaptations and enhanced memory and propagation mechanisms. We believe that this survey comprehensively covers the breadth of SAM's applications and challenges, setting the stage for ongoing advancements in segmentation technology.
△ Less
Submitted 19 October, 2024; v1 submitted 12 May, 2023;
originally announced June 2023.
-
Gotta Go Fast: Measuring Input/Output Latencies of Virtual Reality 3D Engines for Cognitive Experiments
Authors:
Taeho Kang,
Christian Wallraven
Abstract:
Virtual Reality (VR) is seeing increased adoption across many fields. The field of experimental cognitive science is also testing utilization of the technology combined with physiological measures such as electroencephalography (EEG) and eye tracking. Quantitative measures of human behavior and cognition process, however, are sensitive to minuscule time resolutions that are often overlooked in the…
▽ More
Virtual Reality (VR) is seeing increased adoption across many fields. The field of experimental cognitive science is also testing utilization of the technology combined with physiological measures such as electroencephalography (EEG) and eye tracking. Quantitative measures of human behavior and cognition process, however, are sensitive to minuscule time resolutions that are often overlooked in the scope of consumer-level VR hardware and software stacks. In this preliminary study, we implement VR testing environments in two prominent 3D Virtual Reality frameworks (Unity and Unreal Engine) to measure latency values for stimulus onset execution code to Head-Mount Display (HMD) pixel change, as well as the latency between human behavioral response input to its registration in the engine environment under a typical cognitive experiment hardware setup. We find that whereas the specifics of the latency may further be influenced by different hardware and software setups, the variations in consumer hardware is apparent regardless and report detailed statistics on these latencies. Such consideration should be taken into account when designing VR-based cognitive experiments that measure human behavior.
△ Less
Submitted 5 June, 2023;
originally announced June 2023.
-
Differentially Private Topological Data Analysis
Authors:
Taegyu Kang,
Sehwan Kim,
Jinwon Sohn,
Jordan Awan
Abstract:
This paper is the first to attempt differentially private (DP) topological data analysis (TDA), producing near-optimal private persistence diagrams. We analyze the sensitivity of persistence diagrams in terms of the bottleneck distance, and we show that the commonly used Čech complex has sensitivity that does not decrease as the sample size $n$ increases. This makes it challenging for the persiste…
▽ More
This paper is the first to attempt differentially private (DP) topological data analysis (TDA), producing near-optimal private persistence diagrams. We analyze the sensitivity of persistence diagrams in terms of the bottleneck distance, and we show that the commonly used Čech complex has sensitivity that does not decrease as the sample size $n$ increases. This makes it challenging for the persistence diagrams of Čech complexes to be privatized. As an alternative, we show that the persistence diagram obtained by the $L^1$-distance to measure (DTM) has sensitivity $O(1/n)$. Based on the sensitivity analysis, we propose using the exponential mechanism whose utility function is defined in terms of the bottleneck distance of the $L^1$-DTM persistence diagrams. We also derive upper and lower bounds of the accuracy of our privacy mechanism; the obtained bounds indicate that the privacy error of our mechanism is near-optimal. We demonstrate the performance of our privatized persistence diagrams through simulations as well as on a real dataset tracking human movement.
△ Less
Submitted 3 November, 2023; v1 submitted 5 May, 2023;
originally announced May 2023.
-
Attack-SAM: Towards Attacking Segment Anything Model With Adversarial Examples
Authors:
Chenshuang Zhang,
Chaoning Zhang,
Taegoo Kang,
Donghun Kim,
Sung-Ho Bae,
In So Kweon
Abstract:
Segment Anything Model (SAM) has attracted significant attention recently, due to its impressive performance on various downstream tasks in a zero-short manner. Computer vision (CV) area might follow the natural language processing (NLP) area to embark on a path from task-specific vision models toward foundation models. However, deep vision models are widely recognized as vulnerable to adversarial…
▽ More
Segment Anything Model (SAM) has attracted significant attention recently, due to its impressive performance on various downstream tasks in a zero-short manner. Computer vision (CV) area might follow the natural language processing (NLP) area to embark on a path from task-specific vision models toward foundation models. However, deep vision models are widely recognized as vulnerable to adversarial examples, which fool the model to make wrong predictions with imperceptible perturbation. Such vulnerability to adversarial attacks causes serious concerns when applying deep models to security-sensitive applications. Therefore, it is critical to know whether the vision foundation model SAM can also be fooled by adversarial attacks. To the best of our knowledge, our work is the first of its kind to conduct a comprehensive investigation on how to attack SAM with adversarial examples. With the basic attack goal set to mask removal, we investigate the adversarial robustness of SAM in the full white-box setting and transfer-based black-box settings. Beyond the basic goal of mask removal, we further investigate and find that it is possible to generate any desired mask by the adversarial attack.
△ Less
Submitted 8 May, 2023; v1 submitted 1 May, 2023;
originally announced May 2023.
-
Hold the Suspect! : An Analysis on Media Framing of Itaewon Halloween Crowd Crush
Authors:
TaeYoung Kang
Abstract:
Based on the 10.9K articles from top 40 news providers of South Korea, this paper analyzed the media framing of Itaewon Halloween Crowd Crush during the first 72 hours after the incident. By adopting word-vector embedding and clustering, we figured out that conservative media focused on political parties' responses and the suspect's identity while the liberal media covered the responsibility of th…
▽ More
Based on the 10.9K articles from top 40 news providers of South Korea, this paper analyzed the media framing of Itaewon Halloween Crowd Crush during the first 72 hours after the incident. By adopting word-vector embedding and clustering, we figured out that conservative media focused on political parties' responses and the suspect's identity while the liberal media covered the responsibility of the government and possible unequal spillover effect on the low-income industry workers. Although the social tragedy was not directly connected to institutional politics, the media clearly exhibited political bias in the coverage process.
△ Less
Submitted 23 April, 2023;
originally announced April 2023.
-
Non-destructive Fault Diagnosis of Electronic Interconnects by Learning Signal Patterns of Reflection Coefficient in the Frequency Domain
Authors:
Tae Yeob Kang,
Haebom Lee,
Sungho Suh
Abstract:
Fault detection and diagnosis of the interconnects are crucial for prognostics and health management (PHM) of electronics. Traditional methods, which rely on electronic signals as prognostic factors, often struggle to accurately identify the root causes of defects without resorting to destructive testing. Furthermore, these methods are vulnerable to noise interference, which can result in false al…
▽ More
Fault detection and diagnosis of the interconnects are crucial for prognostics and health management (PHM) of electronics. Traditional methods, which rely on electronic signals as prognostic factors, often struggle to accurately identify the root causes of defects without resorting to destructive testing. Furthermore, these methods are vulnerable to noise interference, which can result in false alarms. To address these limitations, in this paper, we propose a novel, non-destructive approach for early fault detection and accurate diagnosis of interconnect defects, with improved noise resilience. Our approach uniquely utilizes the signal patterns of the reflection coefficient across a range of frequencies, enabling both root cause identification and severity assessment. This approach departs from conventional time-series analysis and effectively transforms the signal data into a format suitable for advanced learning algorithms. Additionally, we introduce a novel severity rating ensemble learning (SREL) approach, which enhances diagnostic accuracy and robustness in noisy environments. Experimental results demonstrate that the proposed method is effective for fault detection and diagnosis and has the potential to extend to real-world industrial applications.
△ Less
Submitted 4 October, 2024; v1 submitted 20 April, 2023;
originally announced April 2023.
-
A Survey on Graph Diffusion Models: Generative AI in Science for Molecule, Protein and Material
Authors:
Mengchun Zhang,
Maryam Qamar,
Taegoo Kang,
Yuna Jung,
Chenshuang Zhang,
Sung-Ho Bae,
Chaoning Zhang
Abstract:
Diffusion models have become a new SOTA generative modeling method in various fields, for which there are multiple survey works that provide an overall survey. With the number of articles on diffusion models increasing exponentially in the past few years, there is an increasing need for surveys of diffusion models on specific fields. In this work, we are committed to conducting a survey on the gra…
▽ More
Diffusion models have become a new SOTA generative modeling method in various fields, for which there are multiple survey works that provide an overall survey. With the number of articles on diffusion models increasing exponentially in the past few years, there is an increasing need for surveys of diffusion models on specific fields. In this work, we are committed to conducting a survey on the graph diffusion models. Even though our focus is to cover the progress of diffusion models in graphs, we first briefly summarize how other generative modeling methods are used for graphs. After that, we introduce the mechanism of diffusion models in various forms, which facilitates the discussion on the graph diffusion models. The applications of graph diffusion models mainly fall into the category of AI-generated content (AIGC) in science, for which we mainly focus on how graph diffusion models are utilized for generating molecules and proteins but also cover other cases, including materials design. Moreover, we discuss the issue of evaluating diffusion models in the graph domain and the existing challenges.
△ Less
Submitted 4 April, 2023;
originally announced April 2023.
-
Kinematic-aware Hierarchical Attention Network for Human Pose Estimation in Videos
Authors:
Kyung-Min Jin,
Byoung-Sung Lim,
Gun-Hee Lee,
Tae-Kyung Kang,
Seong-Whan Lee
Abstract:
Previous video-based human pose estimation methods have shown promising results by leveraging aggregated features of consecutive frames. However, most approaches compromise accuracy to mitigate jitter or do not sufficiently comprehend the temporal aspects of human motion. Furthermore, occlusion increases uncertainty between consecutive frames, which results in unsmooth results. To address these is…
▽ More
Previous video-based human pose estimation methods have shown promising results by leveraging aggregated features of consecutive frames. However, most approaches compromise accuracy to mitigate jitter or do not sufficiently comprehend the temporal aspects of human motion. Furthermore, occlusion increases uncertainty between consecutive frames, which results in unsmooth results. To address these issues, we design an architecture that exploits the keypoint kinematic features with the following components. First, we effectively capture the temporal features by leveraging individual keypoint's velocity and acceleration. Second, the proposed hierarchical transformer encoder aggregates spatio-temporal dependencies and refines the 2D or 3D input pose estimated from existing estimators. Finally, we provide an online cross-supervision between the refined input pose generated from the encoder and the final pose from our decoder to enable joint optimization. We demonstrate comprehensive results and validate the effectiveness of our model in various tasks: 2D pose estimation, 3D pose estimation, body mesh recovery, and sparsely annotated multi-human pose estimation. Our code is available at https://github.com/KyungMinJin/HANet.
△ Less
Submitted 28 November, 2022;
originally announced November 2022.
-
Suffering from Vaccines or from Government? : Partisan Bias in COVID-19 Vaccine Adverse Events Coverage
Authors:
TaeYoung Kang,
Hanbin Lee
Abstract:
Vaccine adverse events have been presumed to be a relatively objective measure that is immune to political polarization. The real-world data, however, shows the correlation between presidential disapproval ratings and the subjective severity of adverse events. This paper investigates the partisan bias in COVID vaccine adverse events coverage with language models that can classify the topic of vacc…
▽ More
Vaccine adverse events have been presumed to be a relatively objective measure that is immune to political polarization. The real-world data, however, shows the correlation between presidential disapproval ratings and the subjective severity of adverse events. This paper investigates the partisan bias in COVID vaccine adverse events coverage with language models that can classify the topic of vaccine-related articles and the political disposition of news comments. Based on 90K news articles from 52 major newspaper companies, we found that conservative media are inclined to report adverse events more frequently than their liberal counterparts, while the coverage itself was statistically uncorrelated with the severity of real-world adverse events. The users who support the conservative opposing party were more likely to write the popular comments from 2.3K random sampled articles on news platforms. This research implies that bipartisanship can still play a significant role in forming public opinion on the COVID vaccine even after the majority of the population's vaccination
△ Less
Submitted 19 November, 2022;
originally announced November 2022.
-
Physics-informed Deep Diffusion MRI Reconstruction with Synthetic Data: Break Training Data Bottleneck in Artificial Intelligence
Authors:
Chen Qian,
Haoyu Zhang,
Yuncheng Gao,
Mingyang Han,
Zi Wang,
Dan Ruan,
Yu Shen,
Yaping Wu,
Yirong Zhou,
Chengyan Wang,
Boyu Jiang,
Ran Tao,
Zhigang Wu,
Jiazheng Wang,
Liuhong Zhu,
Yi Guo,
Taishan Kang,
Jianzhong Lin,
Tao Gong,
Chen Yang,
Guoqiang Fei,
Meijin Lin,
Di Guo,
Jianjun Zhou,
Meiyun Wang
, et al. (1 additional authors not shown)
Abstract:
Diffusion magnetic resonance imaging (MRI) is the only imaging modality for non-invasive movement detection of in vivo water molecules, with significant clinical and research applications. Diffusion weighted imaging (DWI) MRI acquired by multi-shot techniques can achieve higher resolution, better signal-to-noise ratio, and lower geometric distortion than single-shot, but suffers from inter-shot mo…
▽ More
Diffusion magnetic resonance imaging (MRI) is the only imaging modality for non-invasive movement detection of in vivo water molecules, with significant clinical and research applications. Diffusion weighted imaging (DWI) MRI acquired by multi-shot techniques can achieve higher resolution, better signal-to-noise ratio, and lower geometric distortion than single-shot, but suffers from inter-shot motion-induced artifacts. These artifacts cannot be removed prospectively, leading to the absence of artifact-free training labels. Thus, the potential of deep learning in multi-shot DWI reconstruction remains largely untapped. To break the training data bottleneck, here, we propose a Physics-Informed Deep DWI reconstruction method (PIDD) to synthesize high-quality paired training data by leveraging the physical diffusion model (magnitude synthesis) and inter-shot motion-induced phase model (motion phase synthesis). The network is trained only once with 100,000 synthetic samples, achieving encouraging results on multiple realistic in vivo data reconstructions. Advantages over conventional methods include: (a) Better motion artifact suppression and reconstruction stability; (b) Outstanding generalization to multi-scenario reconstructions, including multi-resolution, multi-b-value, multi-under-sampling, multi-vendor, and multi-center; (c) Excellent clinical adaptability to patients with verifications by seven experienced doctors (p<0.001). In conclusion, PIDD presents a novel deep learning framework by exploiting the power of MRI physics, providing a cost-effective and explainable way to break the data bottleneck in deep learning medical imaging.
△ Less
Submitted 3 May, 2025; v1 submitted 20 October, 2022;
originally announced October 2022.
-
Bosonic Qiskit
Authors:
Timothy J Stavenger,
Eleanor Crane,
Kevin Smith,
Christopher T Kang,
Steven M Girvin,
Nathan Wiebe
Abstract:
The practical benefits of hybrid quantum information processing hardware that contains continuous-variable objects (bosonic modes such as mechanical or electromagnetic oscillators) in addition to traditional (discrete-variable) qubits have recently been demonstrated by experiments with bosonic codes that reach the break-even point for quantum error correction and by efficient Gaussian boson sampli…
▽ More
The practical benefits of hybrid quantum information processing hardware that contains continuous-variable objects (bosonic modes such as mechanical or electromagnetic oscillators) in addition to traditional (discrete-variable) qubits have recently been demonstrated by experiments with bosonic codes that reach the break-even point for quantum error correction and by efficient Gaussian boson sampling simulation of the Franck-Condon spectra of triatomic molecules that is well beyond the capabilities of current qubit-only hardware. The goal of this Co-design Center for Quantum Advantage (C2QA) project is to develop an instruction set architecture (ISA) for hybrid qubit/bosonic mode systems that contains an inventory of the fundamental operations and measurements that are possible in such hardware. The corresponding abstract machine model (AMM) would also contain a description of the appropriate error models associated with the gates, measurements and time evolution of the hardware. This information has been implemented as an extension of Qiskit. Qiskit is an opensource software development toolkit (SDK) for simulating the quantum state of a quantum circuit on a system with Python 3.7+ and for running the same circuits on prototype hardware within the IBM Quantum Lab. We introduce the Bosonic Qiskit software to enable the simulation of hybrid qubit/bosonic systems using the existing Qiskit software development kit. This implementation can be used for simulating new hybrid systems, verifying proposed physical systems, and modeling systems larger than can currently be constructed. We also cover tutorials and example use cases included within the software to study Jaynes- Cummings models, bosonic Hubbard models, plotting Wigner functions and animations, and calculating maximum likelihood estimations using Wigner functions.
△ Less
Submitted 2 December, 2022; v1 submitted 22 September, 2022;
originally announced September 2022.
-
HTNet: Anchor-free Temporal Action Localization with Hierarchical Transformers
Authors:
Tae-Kyung Kang,
Gun-Hee Lee,
Seong-Whan Lee
Abstract:
Temporal action localization (TAL) is a task of identifying a set of actions in a video, which involves localizing the start and end frames and classifying each action instance. Existing methods have addressed this task by using predefined anchor windows or heuristic bottom-up boundary-matching strategies, which are major bottlenecks in inference time. Additionally, the main challenge is the inabi…
▽ More
Temporal action localization (TAL) is a task of identifying a set of actions in a video, which involves localizing the start and end frames and classifying each action instance. Existing methods have addressed this task by using predefined anchor windows or heuristic bottom-up boundary-matching strategies, which are major bottlenecks in inference time. Additionally, the main challenge is the inability to capture long-range actions due to a lack of global contextual information. In this paper, we present a novel anchor-free framework, referred to as HTNet, which predicts a set of <start time, end time, class> triplets from a video based on a Transformer architecture. After the prediction of coarse boundaries, we refine it through a background feature sampling (BFS) module and hierarchical Transformers, which enables our model to aggregate global contextual information and effectively exploit the inherent semantic relationships in a video. We demonstrate how our method localizes accurate action instances and achieves state-of-the-art performance on two TAL benchmark datasets: THUMOS14 and ActivityNet 1.3.
△ Less
Submitted 20 July, 2022; v1 submitted 20 July, 2022;
originally announced July 2022.
-
Korean Online Hate Speech Dataset for Multilabel Classification: How Can Social Science Improve Dataset on Hate Speech?
Authors:
TaeYoung Kang,
Eunrang Kwon,
Junbum Lee,
Youngeun Nam,
Junmo Song,
JeongKyu Suh
Abstract:
We suggest a multilabel Korean online hate speech dataset that covers seven categories of hate speech: (1) Race and Nationality, (2) Religion, (3) Regionalism, (4) Ageism, (5) Misogyny, (6) Sexual Minorities, and (7) Male. Our 35K dataset consists of 24K online comments with Krippendorff's Alpha label accordance of .713, 2.2K neutral sentences from Wikipedia, 1.7K additionally labeled sentences ge…
▽ More
We suggest a multilabel Korean online hate speech dataset that covers seven categories of hate speech: (1) Race and Nationality, (2) Religion, (3) Regionalism, (4) Ageism, (5) Misogyny, (6) Sexual Minorities, and (7) Male. Our 35K dataset consists of 24K online comments with Krippendorff's Alpha label accordance of .713, 2.2K neutral sentences from Wikipedia, 1.7K additionally labeled sentences generated by the Human-in-the-Loop procedure and rule-generated 7.1K neutral sentences. The base model with 24K initial dataset achieved the accuracy of LRAP .892, but improved to .919 after being combined with 11K additional data. Unlike the conventional binary hate and non-hate dichotomy approach, we designed a dataset considering both the cultural and linguistic context to overcome the limitations of western culture-based English texts. Thus, this paper is not only limited to presenting a local hate speech dataset but extends as a manual for building a more generalized hate speech dataset with diverse cultural backgrounds based on social science perspectives.
△ Less
Submitted 8 April, 2022; v1 submitted 7 April, 2022;
originally announced April 2022.
-
MIDAS: Multi-sensorial Immersive Dynamic Autonomous System Improves Motivation of Stroke Affected Patients for Hand Rehabilitation
Authors:
Fok-Chi-Seng Fok Kow,
Anoop Kumar Sinha,
Zhang Jin Ming,
Bao Songyu,
Jake Tan Jun Kang,
Hong Yan Jack Jeffrey,
Galina Mihaleva,
Nadia Magnenat Thalmann,
Yiyu Cai
Abstract:
Majority of stroke survivors are left with poorly functioning paretic hands. Current rehabilitation devices have failed to motivate the patients enough to continue rehabilitation exercises. The objective of this project, MIDAS (Multi-sensorial Immersive Dynamic Autonomous System) is a proof of concept by using an immersive system to improve motivation of stroke patients for hand rehabilitation. MI…
▽ More
Majority of stroke survivors are left with poorly functioning paretic hands. Current rehabilitation devices have failed to motivate the patients enough to continue rehabilitation exercises. The objective of this project, MIDAS (Multi-sensorial Immersive Dynamic Autonomous System) is a proof of concept by using an immersive system to improve motivation of stroke patients for hand rehabilitation. MIDAS is intended for stroke patients who suffer from light to mild stroke. MIDAS is lightweight and portable. It consists of a hand exoskeleton subsystem, a Virtual Reality (VR) subsystem, and an olfactory subsystem. Altogether, MIDAS engages four out of five senses during rehabilitation. To evaluate the efficacy of MIDAS a pilot study consisting of three sessions is carried out on five stroke affected patients. Subsystems of MIDAS are added progressively in each session. The game environment, sonic effects, and scent released is carefully chosen to enhance the immersive experience. 60% of the scores of user experience are above 40 (out of 56). 96% Self Rehabilitation Motivation Scale (SRMS) rating shows that the participants are motivated to use MIDAS and 87% rating shows that MIDAS is exciting for rehabilitation. Participants experienced elevated motivation to continue stroke rehabilitation using MIDAS and no undesired side effects were reported.
△ Less
Submitted 20 March, 2022;
originally announced March 2022.
-
Optimization of a Real-Time Wavelet-Based Algorithm for Improving Speech Intelligibility
Authors:
Tianqu Kang,
Anh-Dung Dinh,
Binghong Wang,
Tianyuan Du,
Yijia Chen,
Kevin Chau
Abstract:
The optimization of a wavelet-based algorithm to improve speech intelligibility along with the full data set and results are reported. The discrete-time speech signal is split into frequency sub-bands via a multi-level discrete wavelet transform. Various gains are applied to the sub-band signals before they are recombined to form a modified version of the speech. The sub-band gains are adjusted wh…
▽ More
The optimization of a wavelet-based algorithm to improve speech intelligibility along with the full data set and results are reported. The discrete-time speech signal is split into frequency sub-bands via a multi-level discrete wavelet transform. Various gains are applied to the sub-band signals before they are recombined to form a modified version of the speech. The sub-band gains are adjusted while keeping the overall signal energy unchanged, and the speech intelligibility under various background interference and simulated hearing loss conditions is enhanced and evaluated objectively and quantitatively using Google Speech-to-Text transcription. A universal set of sub-band gains can work over a range of noise-to-signal ratios up to 4.8 dB. For noise-free speech, overall intelligibility is improved, and the Google transcription accuracy is increased by 16.9 percentage points on average and 86.7 maximum by reallocating the spectral energy toward the mid-frequency sub-bands. For speech already corrupted by noise, improving intelligibility is challenging but still realizable with an increased transcription accuracy of 9.5 percentage points on average and 71.4 maximum. The proposed algorithm is implementable for real-time speech processing and comparatively simpler than previous algorithms. Potential applications include speech enhancement, hearing aids, machine listening, and a better understanding of speech intelligibility.
△ Less
Submitted 21 July, 2022; v1 submitted 5 February, 2022;
originally announced February 2022.
-
Data Augmentation using Random Image Cropping for High-resolution Virtual Try-On (VITON-CROP)
Authors:
Taewon Kang,
Sunghyun Park,
Seunghwan Choi,
Jaegul Choo
Abstract:
Image-based virtual try-on provides the capacity to transfer a clothing item onto a photo of a given person, which is usually accomplished by warping the item to a given human pose and adjusting the warped item to the person. However, the results of real-world synthetic images (e.g., selfies) from the previous method is not realistic because of the limitations which result in the neck being misrep…
▽ More
Image-based virtual try-on provides the capacity to transfer a clothing item onto a photo of a given person, which is usually accomplished by warping the item to a given human pose and adjusting the warped item to the person. However, the results of real-world synthetic images (e.g., selfies) from the previous method is not realistic because of the limitations which result in the neck being misrepresented and significant changes to the style of the garment. To address these challenges, we propose a novel method to solve this unique issue, called VITON-CROP. VITON-CROP synthesizes images more robustly when integrated with random crop augmentation compared to the existing state-of-the-art virtual try-on models. In the experiments, we demonstrate that VITON-CROP is superior to VITON-HD both qualitatively and quantitatively.
△ Less
Submitted 16 November, 2021;
originally announced November 2021.
-
Indoor Navigation Algorithm Based on a Smartphone Inertial Measurement Unit and Map Matching
Authors:
Taewon Kang,
Younghoon Shin
Abstract:
We propose an indoor navigation algorithm based on pedestrian dead reckoning (PDR) using an inertial measurement unit in a smartphone and map matching. The proposed indoor navigation system is user-friendly and convenient because it requires no additional device except a smartphone and works with a pedestrian in a casual posture who is walking with a smartphone in their hand. Because the performan…
▽ More
We propose an indoor navigation algorithm based on pedestrian dead reckoning (PDR) using an inertial measurement unit in a smartphone and map matching. The proposed indoor navigation system is user-friendly and convenient because it requires no additional device except a smartphone and works with a pedestrian in a casual posture who is walking with a smartphone in their hand. Because the performance of the PDR decreases over time, we greatly reduced the position error of the trajectory estimated by PDR using a map matching method with a known indoor map. To verify the proposed indoor navigation algorithm, we conducted an experiment in a real indoor environment using a commercial Android smartphone. The performance of our algorithm was demonstrated through the results of the experiment.
△ Less
Submitted 23 September, 2021;
originally announced September 2021.
-
GeoT: A Geometry-aware Transformer for Reliable Molecular Property Prediction and Chemically Interpretable Representation Learning
Authors:
Bumju Kwak,
Jiwon Park,
Taewon Kang,
Jeonghee Jo,
Byunghan Lee,
Sungroh Yoon
Abstract:
In recent years, molecular representation learning has emerged as a key area of focus in various chemical tasks. However, many existing models fail to fully consider the geometric information of molecular structures, resulting in less intuitive representations. Moreover, the widely used message-passing mechanism is limited to provide the interpretation of experimental results from a chemical persp…
▽ More
In recent years, molecular representation learning has emerged as a key area of focus in various chemical tasks. However, many existing models fail to fully consider the geometric information of molecular structures, resulting in less intuitive representations. Moreover, the widely used message-passing mechanism is limited to provide the interpretation of experimental results from a chemical perspective. To address these challenges, we introduce a novel Transformer-based framework for molecular representation learning, named the Geometry-aware Transformer (GeoT). GeoT learns molecular graph structures through attention-based mechanisms specifically designed to offer reliable interpretability, as well as molecular property prediction. Consequently, GeoT can generate attention maps of interatomic relationships associated with training objectives. In addition, GeoT demonstrates comparable performance to MPNN-based models while achieving reduced computational complexity. Our comprehensive experiments, including an empirical simulation, reveal that GeoT effectively learns the chemical insights into molecular structures, bridging the gap between artificial intelligence and molecular sciences.
△ Less
Submitted 28 June, 2023; v1 submitted 29 June, 2021;
originally announced June 2021.
-
Multiple GAN Inversion for Exemplar-based Image-to-Image Translation
Authors:
Taewon Kang
Abstract:
Existing state-of-the-art techniques in exemplar-based image-to-image translation hold several critical concerns. Existing methods related to exemplar-based image-to-image translation are impossible to translate on an image tuple input (source, target) that is not aligned. Additionally, we can confirm that the existing method exhibits limited generalization ability to unseen images. In order to ov…
▽ More
Existing state-of-the-art techniques in exemplar-based image-to-image translation hold several critical concerns. Existing methods related to exemplar-based image-to-image translation are impossible to translate on an image tuple input (source, target) that is not aligned. Additionally, we can confirm that the existing method exhibits limited generalization ability to unseen images. In order to overcome this limitation, we propose Multiple GAN Inversion for Exemplar-based Image-to-Image Translation. Our novel Multiple GAN Inversion avoids human intervention by using a self-deciding algorithm to choose the number of layers using Fréchet Inception Distance(FID), which selects more plausible image reconstruction results among multiple hypotheses without any training or supervision. Experimental results have in fact, shown the advantage of the proposed method compared to existing state-of-the-art exemplar-based image-to-image translation methods.
△ Less
Submitted 19 August, 2021; v1 submitted 26 March, 2021;
originally announced March 2021.
-
Online Exemplar Fine-Tuning for Image-to-Image Translation
Authors:
Taewon Kang,
Soohyun Kim,
Sunwoo Kim,
Seungryong Kim
Abstract:
Existing techniques to solve exemplar-based image-to-image translation within deep convolutional neural networks (CNNs) generally require a training phase to optimize the network parameters on domain-specific and task-specific benchmarks, thus having limited applicability and generalization ability. In this paper, we propose a novel framework, for the first time, to solve exemplar-based translatio…
▽ More
Existing techniques to solve exemplar-based image-to-image translation within deep convolutional neural networks (CNNs) generally require a training phase to optimize the network parameters on domain-specific and task-specific benchmarks, thus having limited applicability and generalization ability. In this paper, we propose a novel framework, for the first time, to solve exemplar-based translation through an online optimization given an input image pair, called online exemplar fine-tuning (OEFT), in which we fine-tune the off-the-shelf and general-purpose networks to the input image pair themselves. We design two sub-networks, namely correspondence fine-tuning and multiple GAN inversion, and optimize these network parameters and latent codes, starting from the pre-trained ones, with well-defined loss functions. Our framework does not require the off-line training phase, which has been the main challenge of existing methods, but the pre-trained networks to enable optimization in online. Experimental results prove that our framework is effective in having a generalization power to unseen image pairs and clearly even outperforms the state-of-the-arts needing the intensive training phase.
△ Less
Submitted 18 November, 2020;
originally announced November 2020.