-
PicPersona-TOD : A Dataset for Personalizing Utterance Style in Task-Oriented Dialogue with Image Persona
Authors:
Jihyun Lee,
Yejin Jeon,
Seungyeon Seo,
Gary Geunbae Lee
Abstract:
Task-Oriented Dialogue (TOD) systems are designed to fulfill user requests through natural language interactions, yet existing systems often produce generic, monotonic responses that lack individuality and fail to adapt to users' personal attributes. To address this, we introduce PicPersona-TOD, a novel dataset that incorporates user images as part of the persona, enabling personalized responses t…
▽ More
Task-Oriented Dialogue (TOD) systems are designed to fulfill user requests through natural language interactions, yet existing systems often produce generic, monotonic responses that lack individuality and fail to adapt to users' personal attributes. To address this, we introduce PicPersona-TOD, a novel dataset that incorporates user images as part of the persona, enabling personalized responses tailored to user-specific factors such as age or emotional context. This is facilitated by first impressions, dialogue policy-guided prompting, and the use of external knowledge to reduce hallucinations. Human evaluations confirm that our dataset enhances user experience, with personalized responses contributing to a more engaging interaction. Additionally, we introduce a new NLG model, Pictor, which not only personalizes responses, but also demonstrates robust performance across unseen domains https://github.com/JihyunLee1/PicPersona.
△ Less
Submitted 24 April, 2025;
originally announced April 2025.
-
Dynamic Contrastive Skill Learning with State-Transition Based Skill Clustering and Dynamic Length Adjustment
Authors:
Jinwoo Choi,
Seung-Woo Seo
Abstract:
Reinforcement learning (RL) has made significant progress in various domains, but scaling it to long-horizon tasks with complex decision-making remains challenging. Skill learning attempts to address this by abstracting actions into higher-level behaviors. However, current approaches often fail to recognize semantically similar behaviors as the same skill and use fixed skill lengths, limiting flex…
▽ More
Reinforcement learning (RL) has made significant progress in various domains, but scaling it to long-horizon tasks with complex decision-making remains challenging. Skill learning attempts to address this by abstracting actions into higher-level behaviors. However, current approaches often fail to recognize semantically similar behaviors as the same skill and use fixed skill lengths, limiting flexibility and generalization. To address this, we propose Dynamic Contrastive Skill Learning (DCSL), a novel framework that redefines skill representation and learning. DCSL introduces three key ideas: state-transition based skill representation, skill similarity function learning, and dynamic skill length adjustment. By focusing on state transitions and leveraging contrastive learning, DCSL effectively captures the semantic context of behaviors and adapts skill lengths to match the appropriate temporal extent of behaviors. Our approach enables more flexible and adaptive skill extraction, particularly in complex or noisy datasets, and demonstrates competitive performance compared to existing methods in task completion and efficiency.
△ Less
Submitted 20 April, 2025;
originally announced April 2025.
-
Progress on Self Identifying Codes
Authors:
Devin Jean,
Suk Seo
Abstract:
The concept of an identifying code for a graph was introduced by Karpovsky, Chakrabarty, and Levitin in 1998 as the problem of covering the vertices of a graph such that we can uniquely identify any vertex in the graph by examining the vertices that cover it. An application of an identifying code would be to detect a faulty processor in a multiprocessor system. In 2020, a variation of identify cod…
▽ More
The concept of an identifying code for a graph was introduced by Karpovsky, Chakrabarty, and Levitin in 1998 as the problem of covering the vertices of a graph such that we can uniquely identify any vertex in the graph by examining the vertices that cover it. An application of an identifying code would be to detect a faulty processor in a multiprocessor system. In 2020, a variation of identify code called "self-identifying code" was introduced by Junnila and Laihonen, which simplifies the task of locating the malfunctioning processor. In this paper, we continue to explore self-identifying codes. In particular, we prove the problem of determining the minimum cardinality of a self-identifying code for an arbitrary graph is NP-complete and we investigate minimum-sized self-identifying code in several classes of graphs, including cubic graphs and infinite grids.
△ Less
Submitted 18 April, 2025;
originally announced April 2025.
-
Compositional Flows for 3D Molecule and Synthesis Pathway Co-design
Authors:
Tony Shen,
Seonghwan Seo,
Ross Irwin,
Kieran Didi,
Simon Olsson,
Woo Youn Kim,
Martin Ester
Abstract:
Many generative applications, such as synthesis-based 3D molecular design, involve constructing compositional objects with continuous features. Here, we introduce Compositional Generative Flows (CGFlow), a novel framework that extends flow matching to generate objects in compositional steps while modeling continuous states. Our key insight is that modeling compositional state transitions can be fo…
▽ More
Many generative applications, such as synthesis-based 3D molecular design, involve constructing compositional objects with continuous features. Here, we introduce Compositional Generative Flows (CGFlow), a novel framework that extends flow matching to generate objects in compositional steps while modeling continuous states. Our key insight is that modeling compositional state transitions can be formulated as a straightforward extension of the flow matching interpolation process. We further build upon the theoretical foundations of generative flow networks (GFlowNets), enabling reward-guided sampling of compositional structures. We apply CGFlow to synthesizable drug design by jointly designing the molecule's synthetic pathway with its 3D binding pose. Our approach achieves state-of-the-art binding affinity on all 15 targets from the LIT-PCBA benchmark, and 5.8$\times$ improvement in sampling efficiency compared to 2D synthesis-based baseline. To our best knowledge, our method is also the first to achieve state of-art-performance in both Vina Dock (-9.38) and AiZynth success rate (62.2\%) on the CrossDocked benchmark.
△ Less
Submitted 10 April, 2025;
originally announced April 2025.
-
LaMOuR: Leveraging Language Models for Out-of-Distribution Recovery in Reinforcement Learning
Authors:
Chan Kim,
Seung-Woo Seo,
Seong-Woo Kim
Abstract:
Deep Reinforcement Learning (DRL) has demonstrated strong performance in robotic control but remains susceptible to out-of-distribution (OOD) states, often resulting in unreliable actions and task failure. While previous methods have focused on minimizing or preventing OOD occurrences, they largely neglect recovery once an agent encounters such states. Although the latest research has attempted to…
▽ More
Deep Reinforcement Learning (DRL) has demonstrated strong performance in robotic control but remains susceptible to out-of-distribution (OOD) states, often resulting in unreliable actions and task failure. While previous methods have focused on minimizing or preventing OOD occurrences, they largely neglect recovery once an agent encounters such states. Although the latest research has attempted to address this by guiding agents back to in-distribution states, their reliance on uncertainty estimation hinders scalability in complex environments. To overcome this limitation, we introduce Language Models for Out-of-Distribution Recovery (LaMOuR), which enables recovery learning without relying on uncertainty estimation. LaMOuR generates dense reward codes that guide the agent back to a state where it can successfully perform its original task, leveraging the capabilities of LVLMs in image description, logical reasoning, and code generation. Experimental results show that LaMOuR substantially enhances recovery efficiency across diverse locomotion tasks and even generalizes effectively to complex environments, including humanoid locomotion and mobile manipulation, where existing methods struggle. The code and supplementary materials are available at https://lamour-rl.github.io/.
△ Less
Submitted 28 March, 2025; v1 submitted 21 March, 2025;
originally announced March 2025.
-
DivCon-NeRF: Generating Augmented Rays with Diversity and Consistency for Few-shot View Synthesis
Authors:
Ingyun Lee,
Jae Won Jang,
Seunghyeon Seo,
Nojun Kwak
Abstract:
Neural Radiance Field (NeRF) has shown remarkable performance in novel view synthesis but requires many multiview images, making it impractical for few-shot scenarios. Ray augmentation was proposed to prevent overfitting for sparse training data by generating additional rays. However, existing methods, which generate augmented rays only near the original rays, produce severe floaters and appearanc…
▽ More
Neural Radiance Field (NeRF) has shown remarkable performance in novel view synthesis but requires many multiview images, making it impractical for few-shot scenarios. Ray augmentation was proposed to prevent overfitting for sparse training data by generating additional rays. However, existing methods, which generate augmented rays only near the original rays, produce severe floaters and appearance distortion due to limited viewpoints and inconsistent rays obstructed by nearby obstacles and complex surfaces. To address these problems, we propose DivCon-NeRF, which significantly enhances both diversity and consistency. It employs surface-sphere augmentation, which preserves the distance between the original camera and the predicted surface point. This allows the model to compare the order of high-probability surface points and filter out inconsistent rays easily without requiring the exact depth. By introducing inner-sphere augmentation, DivCon-NeRF randomizes angles and distances for diverse viewpoints, further increasing diversity. Consequently, our method significantly reduces floaters and visual distortions, achieving state-of-the-art performance on the Blender, LLFF, and DTU datasets. Our code will be publicly available.
△ Less
Submitted 17 March, 2025;
originally announced March 2025.
-
ROODI: Reconstructing Occluded Objects with Denoising Inpainters
Authors:
Yeonjin Chang,
Erqun Dong,
Seunghyeon Seo,
Nojun Kwak,
Kwang Moo Yi
Abstract:
While the quality of novel-view images has improved dramatically with 3D Gaussian Splatting, extracting specific objects from scenes remains challenging. Isolating individual 3D Gaussian primitives for each object and handling occlusions in scenes remain far from being solved. We propose a novel object extraction method based on two key principles: (1) being object-centric by pruning irrelevant pr…
▽ More
While the quality of novel-view images has improved dramatically with 3D Gaussian Splatting, extracting specific objects from scenes remains challenging. Isolating individual 3D Gaussian primitives for each object and handling occlusions in scenes remain far from being solved. We propose a novel object extraction method based on two key principles: (1) being object-centric by pruning irrelevant primitives; and (2) leveraging generative inpainting to compensate for missing observations caused by occlusions. For pruning, we analyze the local structure of primitives using K-nearest neighbors, and retain only relevant ones. For inpainting, we employ an off-the-shelf diffusion-based inpainter combined with occlusion reasoning, utilizing the 3D representation of the entire scene. Our findings highlight the crucial synergy between pruning and inpainting, both of which significantly enhance extraction performance. We evaluate our method on a standard real-world dataset and introduce a synthetic dataset for quantitative analysis. Our approach outperforms the state-of-the-art, demonstrating its effectiveness in object extraction from complex scenes.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
LLMDR: LLM-Driven Deadlock Detection and Resolution in Multi-Agent Pathfinding
Authors:
Seungbae Seo,
Junghwan Kim,
Minjeong Shin,
Bongwon Suh
Abstract:
Multi-Agent Pathfinding (MAPF) is a core challenge in multi-agent systems. Existing learning-based MAPF methods often struggle with scalability, particularly when addressing complex scenarios that are prone to deadlocks. To address these challenges, we introduce LLMDR (LLM-Driven Deadlock Detection and Resolution), an approach designed to resolve deadlocks and improve the performance of learnt MAP…
▽ More
Multi-Agent Pathfinding (MAPF) is a core challenge in multi-agent systems. Existing learning-based MAPF methods often struggle with scalability, particularly when addressing complex scenarios that are prone to deadlocks. To address these challenges, we introduce LLMDR (LLM-Driven Deadlock Detection and Resolution), an approach designed to resolve deadlocks and improve the performance of learnt MAPF models. LLMDR integrates the inference capabilities of large language models (LLMs) with learnt MAPF models and prioritized planning, enabling it to detect deadlocks and provide customized resolution strategies. We evaluate LLMDR on standard MAPF benchmark maps with varying agent numbers, measuring its performance when combined with several base models. The results demonstrate that LLMDR improves the performance of learnt MAPF models, particularly in deadlock-prone scenarios, with notable improvements in success rates. These findings show the potential of integrating LLMs to improve the scalability of learning-based MAPF methods.
The source code for LLMDR is available at: https://github.com/ssbacc/llmdr-dhc
△ Less
Submitted 1 March, 2025;
originally announced March 2025.
-
Socratic: Enhancing Human Teamwork via AI-enabled Coaching
Authors:
Sangwon Seo,
Bing Han,
Rayan E. Harari,
Roger D. Dias,
Marco A. Zenati,
Eduardo Salas,
Vaibhav Unhelkar
Abstract:
Coaches are vital for effective collaboration, but cost and resource constraints often limit their availability during real-world tasks. This limitation poses serious challenges in life-critical domains that rely on effective teamwork, such as healthcare and disaster response. To address this gap, we propose and realize an innovative application of AI: task-time team coaching. Specifically, we int…
▽ More
Coaches are vital for effective collaboration, but cost and resource constraints often limit their availability during real-world tasks. This limitation poses serious challenges in life-critical domains that rely on effective teamwork, such as healthcare and disaster response. To address this gap, we propose and realize an innovative application of AI: task-time team coaching. Specifically, we introduce Socratic, a novel AI system that complements human coaches by providing real-time guidance during task execution. Socratic monitors team behavior, detects misalignments in team members' shared understanding, and delivers automated interventions to improve team performance. We validated Socratic through two human subject experiments involving dyadic collaboration. The results demonstrate that the system significantly enhances team performance with minimal interventions. Participants also perceived Socratic as helpful and trustworthy, supporting its potential for adoption. Our findings also suggest promising directions both for AI research and its practical applications to enhance human teamwork.
△ Less
Submitted 24 February, 2025;
originally announced February 2025.
-
Hierarchical Imitation Learning of Team Behavior from Heterogeneous Demonstrations
Authors:
Sangwon Seo,
Vaibhav Unhelkar
Abstract:
Successful collaboration requires team members to stay aligned, especially in complex sequential tasks. Team members must dynamically coordinate which subtasks to perform and in what order. However, real-world constraints like partial observability and limited communication bandwidth often lead to suboptimal collaboration. Even among expert teams, the same task can be executed in multiple ways. To…
▽ More
Successful collaboration requires team members to stay aligned, especially in complex sequential tasks. Team members must dynamically coordinate which subtasks to perform and in what order. However, real-world constraints like partial observability and limited communication bandwidth often lead to suboptimal collaboration. Even among expert teams, the same task can be executed in multiple ways. To develop multi-agent systems and human-AI teams for such tasks, we are interested in data-driven learning of multimodal team behaviors. Multi-Agent Imitation Learning (MAIL) provides a promising framework for data-driven learning of team behavior from demonstrations, but existing methods struggle with heterogeneous demonstrations, as they assume that all demonstrations originate from a single team policy. Hence, in this work, we introduce DTIL: a hierarchical MAIL algorithm designed to learn multimodal team behaviors in complex sequential tasks. DTIL represents each team member with a hierarchical policy and learns these policies from heterogeneous team demonstrations in a factored manner. By employing a distribution-matching approach, DTIL mitigates compounding errors and scales effectively to long horizons and continuous state representations. Experimental results show that DTIL outperforms MAIL baselines and accurately models team behavior across a variety of collaborative scenarios.
△ Less
Submitted 24 February, 2025;
originally announced February 2025.
-
Fine-Grained Video Captioning through Scene Graph Consolidation
Authors:
Sanghyeok Chu,
Seonguk Seo,
Bohyung Han
Abstract:
Recent advances in visual language models (VLMs) have significantly improved image captioning, but extending these gains to video understanding remains challenging due to the scarcity of fine-grained video captioning datasets. To bridge this gap, we propose a novel zero-shot video captioning approach that combines frame-level scene graphs from a video to obtain intermediate representations for cap…
▽ More
Recent advances in visual language models (VLMs) have significantly improved image captioning, but extending these gains to video understanding remains challenging due to the scarcity of fine-grained video captioning datasets. To bridge this gap, we propose a novel zero-shot video captioning approach that combines frame-level scene graphs from a video to obtain intermediate representations for caption generation. Our method first generates frame-level captions using an image VLM, converts them into scene graphs, and consolidates these graphs to produce comprehensive video-level descriptions. To achieve this, we leverage a lightweight graph-to-text model trained solely on text corpora, eliminating the need for video captioning annotations. Experiments on the MSR-VTT and ActivityNet Captions datasets show that our approach outperforms zero-shot video captioning baselines, demonstrating that aggregating frame-level scene graphs yields rich video understanding without requiring large-scale paired data or high inference cost.
△ Less
Submitted 22 February, 2025;
originally announced February 2025.
-
Delay Analysis of 5G HARQ in the Presence of Decoding and Feedback Latencies
Authors:
Vishnu N Moothedath,
Sangwon Seo,
Neda Petreska,
Bernhard Kloiber,
James Gross
Abstract:
The growing demand for stringent quality of service (QoS) guarantees in 5G networks requires accurate characterisation of delay performance, often measured using Delay Violation Probability (DVP) for a given target delay. Widely used retransmission schemes like Automatic Repeat reQuest (ARQ) and Hybrid ARQ (HARQ) improve QoS through effective feedback, incremental redundancy (IR), and parallel ret…
▽ More
The growing demand for stringent quality of service (QoS) guarantees in 5G networks requires accurate characterisation of delay performance, often measured using Delay Violation Probability (DVP) for a given target delay. Widely used retransmission schemes like Automatic Repeat reQuest (ARQ) and Hybrid ARQ (HARQ) improve QoS through effective feedback, incremental redundancy (IR), and parallel retransmission processes. However, existing works to quantify the DVP under these retransmission schemes overlook practical aspects such as decoding complexity, feedback delays, and the resulting need for multiple parallel ARQ/HARQ processes that enable packet transmissions without waiting for previous feedback, thus exploiting valuable transmission opportunities. This work proposes a comprehensive multi-server delay model for ARQ/HARQ that incorporates these aspects. Using a finite blocklength error model, we derive closed-form expressions and algorithms for accurate DVP evaluation under realistic 5G configurations aligned with 3GPP standards. Our numerical evaluations demonstrate notable improvements in DVP accuracy over the state-of-the-art, highlight the impact of parameter tuning and resource allocation, and reveal how DVP affects system throughput.
△ Less
Submitted 12 February, 2025;
originally announced February 2025.
-
Color Universal Design Neural Network for the Color Vision Deficiencies
Authors:
Sunyong Seo,
Jinho Park
Abstract:
Information regarding images should be visually understood by anyone, including those with color deficiency. However, such information is not recognizable if the color that seems to be distorted to the color deficiencies meets an adjacent object. The aim of this paper is to propose a color universal design network, called CUD-Net, that generates images that are visually understandable by individua…
▽ More
Information regarding images should be visually understood by anyone, including those with color deficiency. However, such information is not recognizable if the color that seems to be distorted to the color deficiencies meets an adjacent object. The aim of this paper is to propose a color universal design network, called CUD-Net, that generates images that are visually understandable by individuals with color deficiency. CUD-Net is a convolutional deep neural network that can preserve color and distinguish colors for input images by regressing the node point of a piecewise linear function and using a specific filter for each image. To generate CUD images for color deficiencies, we follow a four-step process. First, we refine the CUD dataset based on specific criteria by color experts. Second, we expand the input image information through pre-processing that is specialized for color deficiency vision. Third, we employ a multi-modality fusion architecture to combine features and process the expanded images. Finally, we propose a conjugate loss function based on the composition of the predicted image through the model to address one-to-many problems that arise from the dataset. Our approach is able to produce high-quality CUD images that maintain color and contrast stability. The code for CUD-Net is available on the GitHub repository
△ Less
Submitted 11 February, 2025;
originally announced February 2025.
-
MultiFloodSynth: Multi-Annotated Flood Synthetic Dataset Generation
Authors:
YoonJe Kang,
Yonghoon Jung,
Wonseop Shin,
Bumsoo Kim,
Sanghyun Seo
Abstract:
In this paper, we present synthetic data generation framework for flood hazard detection system. For high fidelity and quality, we characterize several real-world properties into virtual world and simulate the flood situation by controlling them. For the sake of efficiency, recent generative models in image-to-3D and urban city synthesis are leveraged to easily composite flood environments so that…
▽ More
In this paper, we present synthetic data generation framework for flood hazard detection system. For high fidelity and quality, we characterize several real-world properties into virtual world and simulate the flood situation by controlling them. For the sake of efficiency, recent generative models in image-to-3D and urban city synthesis are leveraged to easily composite flood environments so that we avoid data bias due to the hand-crafted manner. Based on our framework, we build the flood synthetic dataset with 5 levels, dubbed MultiFloodSynth which contains rich annotation types like normal map, segmentation, 3D bounding box for a variety of downstream task. In experiments, our dataset demonstrate the enhanced performance of flood hazard detection with on-par realism compared with real dataset.
△ Less
Submitted 13 February, 2025; v1 submitted 6 February, 2025;
originally announced February 2025.
-
Full-scale Representation Guided Network for Retinal Vessel Segmentation
Authors:
Sunyong Seo,
Huisu Yoon,
Semin Kim,
Jongha Lee
Abstract:
The U-Net architecture and its variants have remained state-of-the-art (SOTA) for retinal vessel segmentation over the past decade. In this study, we introduce a Full Scale Guided Network (FSG-Net), where the feature representation network with modernized convolution blocks extracts full-scale information and the guided convolution block refines that information. Attention-guided filter is introdu…
▽ More
The U-Net architecture and its variants have remained state-of-the-art (SOTA) for retinal vessel segmentation over the past decade. In this study, we introduce a Full Scale Guided Network (FSG-Net), where the feature representation network with modernized convolution blocks extracts full-scale information and the guided convolution block refines that information. Attention-guided filter is introduced to the guided convolution block under the interpretation that the filter behaves like the unsharp mask filter. Passing full-scale information to the attention block allows for the generation of improved attention maps, which are then passed to the attention-guided filter, resulting in performance enhancement of the segmentation network. The structure preceding the guided convolution block can be replaced by any U-Net variant, which enhances the scalability of the proposed approach. For a fair comparison, we re-implemented recent studies available in public repositories to evaluate their scalability and reproducibility. Our experiments also show that the proposed network demonstrates competitive results compared to current SOTA models on various public datasets. Ablation studies demonstrate that the proposed model is competitive with much smaller parameter sizes. Lastly, by applying the proposed model to facial wrinkle segmentation, we confirmed the potential for scalability to similar tasks in other domains. Our code is available on https://github.com/ZombaSY/FSG-Net-pytorch.
△ Less
Submitted 31 January, 2025;
originally announced January 2025.
-
DIAL: Distribution-Informed Adaptive Learning of Multi-Task Constraints for Safety-Critical Systems
Authors:
Se-Wook Yoo,
Seung-Woo Seo
Abstract:
Safe reinforcement learning has traditionally relied on predefined constraint functions to ensure safety in complex real-world tasks, such as autonomous driving. However, defining these functions accurately for varied tasks is a persistent challenge. Recent research highlights the potential of leveraging pre-acquired task-agnostic knowledge to enhance both safety and sample efficiency in related t…
▽ More
Safe reinforcement learning has traditionally relied on predefined constraint functions to ensure safety in complex real-world tasks, such as autonomous driving. However, defining these functions accurately for varied tasks is a persistent challenge. Recent research highlights the potential of leveraging pre-acquired task-agnostic knowledge to enhance both safety and sample efficiency in related tasks. Building on this insight, we propose a novel method to learn shared constraint distributions across multiple tasks. Our approach identifies the shared constraints through imitation learning and then adapts to new tasks by adjusting risk levels within these learned distributions. This adaptability addresses variations in risk sensitivity stemming from expert-specific biases, ensuring consistent adherence to general safety principles even with imperfect demonstrations. Our method can be applied to control and navigation domains, including multi-task and meta-task scenarios, accommodating constraints such as maintaining safe distances or adhering to speed limits. Experimental results validate the efficacy of our approach, demonstrating superior safety performance and success rates compared to baselines, all without requiring task-specific constraint definitions. These findings underscore the versatility and practicality of our method across a wide range of real-world tasks.
△ Less
Submitted 29 January, 2025;
originally announced January 2025.
-
Hermit Kingdom Through the Lens of Multiple Perspectives: A Case Study of LLM Hallucination on North Korea
Authors:
Eunjung Cho,
Won Ik Cho,
Soomin Seo
Abstract:
Hallucination in large language models (LLMs) remains a significant challenge for their safe deployment, particularly due to its potential to spread misinformation. Most existing solutions address this challenge by focusing on aligning the models with credible sources or by improving how models communicate their confidence (or lack thereof) in their outputs. While these measures may be effective i…
▽ More
Hallucination in large language models (LLMs) remains a significant challenge for their safe deployment, particularly due to its potential to spread misinformation. Most existing solutions address this challenge by focusing on aligning the models with credible sources or by improving how models communicate their confidence (or lack thereof) in their outputs. While these measures may be effective in most contexts, they may fall short in scenarios requiring more nuanced approaches, especially in situations where access to accurate data is limited or determining credible sources is challenging. In this study, we take North Korea - a country characterised by an extreme lack of reliable sources and the prevalence of sensationalist falsehoods - as a case study. We explore and evaluate how some of the best-performing multilingual LLMs and specific language-based models generate information about North Korea in three languages spoken in countries with significant geo-political interests: English (United States, United Kingdom), Korean (South Korea), and Mandarin Chinese (China). Our findings reveal significant differences, suggesting that the choice of model and language can lead to vastly different understandings of North Korea, which has important implications given the global security challenges the country poses.
△ Less
Submitted 10 January, 2025;
originally announced January 2025.
-
Re-evaluating Group Robustness via Adaptive Class-Specific Scaling
Authors:
Seonguk Seo,
Bohyung Han
Abstract:
Group distributionally robust optimization, which aims to improve robust accuracies -- worst-group and unbiased accuracies -- is a prominent algorithm used to mitigate spurious correlations and address dataset bias. Although existing approaches have reported improvements in robust accuracies, these gains often come at the cost of average accuracy due to inherent trade-offs. To control this trade-o…
▽ More
Group distributionally robust optimization, which aims to improve robust accuracies -- worst-group and unbiased accuracies -- is a prominent algorithm used to mitigate spurious correlations and address dataset bias. Although existing approaches have reported improvements in robust accuracies, these gains often come at the cost of average accuracy due to inherent trade-offs. To control this trade-off flexibly and efficiently, we propose a simple class-specific scaling strategy, directly applicable to existing debiasing algorithms with no additional training. We further develop an instance-wise adaptive scaling technique to alleviate this trade-off, even leading to improvements in both robust and average accuracies. Our approach reveals that a naïve ERM baseline matches or even outperforms the recent debiasing methods by simply adopting the class-specific scaling technique. Additionally, we introduce a novel unified metric that quantifies the trade-off between the two accuracies as a scalar value, allowing for a comprehensive evaluation of existing algorithms. By tackling the inherent trade-off and offering a performance landscape, our approach provides valuable insights into robust techniques beyond just robust accuracy. We validate the effectiveness of our framework through experiments across datasets in computer vision and natural language processing domains.
△ Less
Submitted 19 December, 2024;
originally announced December 2024.
-
JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts
Authors:
Taein Son,
Soo Won Seo,
Jisong Kim,
Seok Hwan Lee,
Jun Won Choi
Abstract:
Video Action Detection (VAD) entails localizing and categorizing action instances within videos, which inherently consist of diverse information sources such as audio, visual cues, and surrounding scene contexts. Leveraging this multi-modal information effectively for VAD poses a significant challenge, as the model must identify action-relevant cues with precision. In this study, we introduce a no…
▽ More
Video Action Detection (VAD) entails localizing and categorizing action instances within videos, which inherently consist of diverse information sources such as audio, visual cues, and surrounding scene contexts. Leveraging this multi-modal information effectively for VAD poses a significant challenge, as the model must identify action-relevant cues with precision. In this study, we introduce a novel multi-modal VAD architecture, referred to as the Joint Actor-centric Visual, Audio, Language Encoder (JoVALE). JoVALE is the first VAD method to integrate audio and visual features with scene descriptive context sourced from large-capacity image captioning models. At the heart of JoVALE is the actor-centric aggregation of audio, visual, and scene descriptive information, enabling adaptive integration of crucial features for recognizing each actor's actions. We have developed a Transformer-based architecture, the Actor-centric Multi-modal Fusion Network, specifically designed to capture the dynamic interactions among actors and their multi-modal contexts. Our evaluation on three prominent VAD benchmarks, including AVA, UCF101-24, and JHMDB51-21, demonstrates that incorporating multi-modal information significantly enhances performance, setting new state-of-the-art performances in the field.
△ Less
Submitted 3 February, 2025; v1 submitted 18 December, 2024;
originally announced December 2024.
-
Learning to Translate Noise for Robust Image Denoising
Authors:
Inju Ha,
Donghun Ryou,
Seonguk Seo,
Bohyung Han
Abstract:
Deep learning-based image denoising techniques often struggle with poor generalization performance to out-of-distribution real-world noise. To tackle this challenge, we propose a novel noise translation framework that performs denoising on an image with translated noise rather than directly denoising an original noisy image. Specifically, our approach translates complex, unknown real-world noise i…
▽ More
Deep learning-based image denoising techniques often struggle with poor generalization performance to out-of-distribution real-world noise. To tackle this challenge, we propose a novel noise translation framework that performs denoising on an image with translated noise rather than directly denoising an original noisy image. Specifically, our approach translates complex, unknown real-world noise into Gaussian noise, which is spatially uncorrelated and independent of image content, through a noise translation network. The translated noisy images are then processed by an image denoising network pretrained to effectively remove Gaussian noise, enabling robust and consistent denoising performance. We also design well-motivated loss functions and architectures for the noise translation network by leveraging the mathematical properties of Gaussian noise. Experimental results demonstrate that the proposed method substantially improves robustness and generalizability, outperforming state-of-the-art methods across diverse benchmarks. Visualized denoising results and the source code are available on our project page.
△ Less
Submitted 5 December, 2024;
originally announced December 2024.
-
Class Granularity: How richly does your knowledge graph represent the real world?
Authors:
Sumin Seo,
Heeseon Cheon,
Hyunho Kim
Abstract:
To effectively manage and utilize knowledge graphs, it is crucial to have metrics that can assess the quality of knowledge graphs from various perspectives. While there have been studies on knowledge graph quality metrics, there has been a lack of research on metrics that measure how richly ontologies, which form the backbone of knowledge graphs, are defined or the impact of richly defined ontolog…
▽ More
To effectively manage and utilize knowledge graphs, it is crucial to have metrics that can assess the quality of knowledge graphs from various perspectives. While there have been studies on knowledge graph quality metrics, there has been a lack of research on metrics that measure how richly ontologies, which form the backbone of knowledge graphs, are defined or the impact of richly defined ontologies. In this study, we propose a new metric called Class Granularity, which measures how well a knowledge graph is structured in terms of how finely classes with unique characteristics are defined. Furthermore, this research presents potential impact of Class Granularity in knowledge graph's on downstream tasks. In particular, we explore its influence on graph embedding and provide experimental results. Additionally, this research goes beyond traditional Linked Open Data comparison studies, which mainly focus on factors like scale and class distribution, by using Class Granularity to compare four different LOD sources.
△ Less
Submitted 10 November, 2024;
originally announced November 2024.
-
Multi-aspect Depression Severity Assessment via Inductive Dialogue System
Authors:
Chaebin Lee,
Seungyeon Seo,
Heejin Do,
Gary Geunbae Lee
Abstract:
With the advancement of chatbots and the growing demand for automatic depression detection, identifying depression in patient conversations has gained more attention. However, prior methods often assess depression in a binary way or only a single score without diverse feedback and lack focus on enhancing dialogue responses. In this paper, we present a novel task of multi-aspect depression severity…
▽ More
With the advancement of chatbots and the growing demand for automatic depression detection, identifying depression in patient conversations has gained more attention. However, prior methods often assess depression in a binary way or only a single score without diverse feedback and lack focus on enhancing dialogue responses. In this paper, we present a novel task of multi-aspect depression severity assessment via an inductive dialogue system (MaDSA), evaluating a patient's depression level on multiple criteria by incorporating an assessment-aided response generation. Further, we propose a foundational system for MaDSA, which induces psychological dialogue responses with an auxiliary emotion classification task within a hierarchical severity assessment structure. We synthesize the conversational dataset annotated with eight aspects of depression severity alongside emotion labels, proven robust via human evaluations. Experimental results show potential for our preliminary work on MaDSA.
△ Less
Submitted 29 October, 2024;
originally announced October 2024.
-
Generative Flows on Synthetic Pathway for Drug Design
Authors:
Seonghwan Seo,
Minsu Kim,
Tony Shen,
Martin Ester,
Jinkyoo Park,
Sungsoo Ahn,
Woo Youn Kim
Abstract:
Generative models in drug discovery have recently gained attention as efficient alternatives to brute-force virtual screening. However, most existing models do not account for synthesizability, limiting their practical use in real-world scenarios. In this paper, we propose RxnFlow, which sequentially assembles molecules using predefined molecular building blocks and chemical reaction templates to…
▽ More
Generative models in drug discovery have recently gained attention as efficient alternatives to brute-force virtual screening. However, most existing models do not account for synthesizability, limiting their practical use in real-world scenarios. In this paper, we propose RxnFlow, which sequentially assembles molecules using predefined molecular building blocks and chemical reaction templates to constrain the synthetic chemical pathway. We then train on this sequential generating process with the objective of generative flow networks (GFlowNets) to generate both highly rewarded and diverse molecules. To mitigate the large action space of synthetic pathways in GFlowNets, we implement a novel action space subsampling method. This enables RxnFlow to learn generative flows over extensive action spaces comprising combinations of 1.2 million building blocks and 71 reaction templates without significant computational overhead. Additionally, RxnFlow can employ modified or expanded action spaces for generation without retraining, allowing for the introduction of additional objectives or the incorporation of newly discovered building blocks. We experimentally demonstrate that RxnFlow outperforms existing reaction-based and fragment-based models in pocket-specific optimization across various target pockets. Furthermore, RxnFlow achieves state-of-the-art performance on CrossDocked2020 for pocket-conditional generation, with an average Vina score of -8.85 kcal/mol and 34.8% synthesizability.
△ Less
Submitted 6 March, 2025; v1 submitted 6 October, 2024;
originally announced October 2024.
-
Zero-Shot Multi-Hop Question Answering via Monte-Carlo Tree Search with Large Language Models
Authors:
Seongmin Lee,
Jaewook Shin,
Youngjin Ahn,
Seokin Seo,
Ohjoon Kwon,
Kee-Eung Kim
Abstract:
Recent advances in large language models (LLMs) have significantly impacted the domain of multi-hop question answering (MHQA), where systems are required to aggregate information and infer answers from disparate pieces of text. However, the autoregressive nature of LLMs inherently poses a challenge as errors may accumulate if mistakes are made in the intermediate reasoning steps. This paper introd…
▽ More
Recent advances in large language models (LLMs) have significantly impacted the domain of multi-hop question answering (MHQA), where systems are required to aggregate information and infer answers from disparate pieces of text. However, the autoregressive nature of LLMs inherently poses a challenge as errors may accumulate if mistakes are made in the intermediate reasoning steps. This paper introduces Monte-Carlo tree search for Zero-shot multi-hop Question Answering (MZQA), a framework based on Monte-Carlo tree search (MCTS) to identify optimal reasoning paths in MHQA tasks, mitigating the error propagation from sequential reasoning processes. Unlike previous works, we propose a zero-shot prompting method, which relies solely on instructions without the support of hand-crafted few-shot examples that typically require domain expertise. We also introduce a behavioral cloning approach (MZQA-BC) trained on self-generated MCTS inference trajectories, achieving an over 10-fold increase in reasoning speed with bare compromise in performance. The efficacy of our method is validated on standard benchmarks such as HotpotQA, 2WikiMultihopQA, and MuSiQue, demonstrating that it outperforms existing frameworks.
△ Less
Submitted 1 October, 2024; v1 submitted 28 September, 2024;
originally announced September 2024.
-
E2Map: Experience-and-Emotion Map for Self-Reflective Robot Navigation with Language Models
Authors:
Chan Kim,
Keonwoo Kim,
Mintaek Oh,
Hanbi Baek,
Jiyang Lee,
Donghwi Jung,
Soojin Woo,
Younkyung Woo,
John Tucker,
Roya Firoozi,
Seung-Woo Seo,
Mac Schwager,
Seong-Woo Kim
Abstract:
Large language models (LLMs) have shown significant potential in guiding embodied agents to execute language instructions across a range of tasks, including robotic manipulation and navigation. However, existing methods are primarily designed for static environments and do not leverage the agent's own experiences to refine its initial plans. Given that real-world environments are inherently stocha…
▽ More
Large language models (LLMs) have shown significant potential in guiding embodied agents to execute language instructions across a range of tasks, including robotic manipulation and navigation. However, existing methods are primarily designed for static environments and do not leverage the agent's own experiences to refine its initial plans. Given that real-world environments are inherently stochastic, initial plans based solely on LLMs' general knowledge may fail to achieve their objectives, unlike in static scenarios. To address this limitation, this study introduces the Experience-and-Emotion Map (E2Map), which integrates not only LLM knowledge but also the agent's real-world experiences, drawing inspiration from human emotional responses. The proposed methodology enables one-shot behavior adjustments by updating the E2Map based on the agent's experiences. Our evaluation in stochastic navigation environments, including both simulations and real-world scenarios, demonstrates that the proposed method significantly enhances performance in stochastic environments compared to existing LLM-based approaches. Code and supplementary materials are available at https://e2map.github.io/.
△ Less
Submitted 2 February, 2025; v1 submitted 16 September, 2024;
originally announced September 2024.
-
DiagESC: Dialogue Synthesis for Integrating Depression Diagnosis into Emotional Support Conversation
Authors:
Seungyeon Seo,
Gary Geunbae Lee
Abstract:
Dialogue systems for mental health care aim to provide appropriate support to individuals experiencing mental distress. While extensive research has been conducted to deliver adequate emotional support, existing studies cannot identify individuals who require professional medical intervention and cannot offer suitable guidance. We introduce the Diagnostic Emotional Support Conversation task for an…
▽ More
Dialogue systems for mental health care aim to provide appropriate support to individuals experiencing mental distress. While extensive research has been conducted to deliver adequate emotional support, existing studies cannot identify individuals who require professional medical intervention and cannot offer suitable guidance. We introduce the Diagnostic Emotional Support Conversation task for an advanced mental health management system. We develop the DESC dataset to assess depression symptoms while maintaining user experience by utilizing task-specific utterance generation prompts and a strict filtering algorithm. Evaluations by professional psychological counselors indicate that DESC has a superior ability to diagnose depression than existing data. Additionally, conversational quality evaluation reveals that DESC maintains fluent, consistent, and coherent dialogues.
△ Less
Submitted 12 August, 2024;
originally announced August 2024.
-
JARViS: Detecting Actions in Video Using Unified Actor-Scene Context Relation Modeling
Authors:
Seok Hwan Lee,
Taein Son,
Soo Won Seo,
Jisong Kim,
Jun Won Choi
Abstract:
Video action detection (VAD) is a formidable vision task that involves the localization and classification of actions within the spatial and temporal dimensions of a video clip. Among the myriad VAD architectures, two-stage VAD methods utilize a pre-trained person detector to extract the region of interest features, subsequently employing these features for action detection. However, the performan…
▽ More
Video action detection (VAD) is a formidable vision task that involves the localization and classification of actions within the spatial and temporal dimensions of a video clip. Among the myriad VAD architectures, two-stage VAD methods utilize a pre-trained person detector to extract the region of interest features, subsequently employing these features for action detection. However, the performance of two-stage VAD methods has been limited as they depend solely on localized actor features to infer action semantics. In this study, we propose a new two-stage VAD framework called Joint Actor-scene context Relation modeling based on Visual Semantics (JARViS), which effectively consolidates cross-modal action semantics distributed globally across spatial and temporal dimensions using Transformer attention. JARViS employs a person detector to produce densely sampled actor features from a keyframe. Concurrently, it uses a video backbone to create spatio-temporal scene features from a video clip. Finally, the fine-grained interactions between actors and scenes are modeled through a Unified Action-Scene Context Transformer to directly output the final set of actions in parallel. Our experimental results demonstrate that JARViS outperforms existing methods by significant margins and achieves state-of-the-art performance on three popular VAD datasets, including AVA, UCF101-24, and JHMDB51-21.
△ Less
Submitted 17 September, 2024; v1 submitted 7 August, 2024;
originally announced August 2024.
-
Revisiting Machine Unlearning with Dimensional Alignment
Authors:
Seonguk Seo,
Dongwan Kim,
Bohyung Han
Abstract:
Machine unlearning, an emerging research topic focusing on compliance with data privacy regulations, enables trained models to remove the information learned from specific data. While many existing methods indirectly address this issue by intentionally injecting incorrect supervisions, they can drastically and unpredictably alter the decision boundaries and feature spaces, leading to training inst…
▽ More
Machine unlearning, an emerging research topic focusing on compliance with data privacy regulations, enables trained models to remove the information learned from specific data. While many existing methods indirectly address this issue by intentionally injecting incorrect supervisions, they can drastically and unpredictably alter the decision boundaries and feature spaces, leading to training instability and undesired side effects. To fundamentally approach this task, we first analyze the changes in latent feature spaces between original and retrained models, and observe that the feature representations of samples not involved in training are closely aligned with the feature manifolds of previously seen samples in training. Based on these findings, we introduce a novel evaluation metric for machine unlearning, coined dimensional alignment, which measures the alignment between the eigenspaces of the forget and retain set samples. We employ this metric as a regularizer loss to build a robust and stable unlearning framework, which is further enhanced by integrating a self-distillation loss and an alternating training scheme. Our framework effectively eliminates information from the forget set and preserves knowledge from the retain set. Lastly, we identify critical flaws in established evaluation metrics for machine unlearning, and introduce new evaluation tools that more accurately reflect the fundamental goals of machine unlearning.
△ Less
Submitted 20 December, 2024; v1 submitted 24 July, 2024;
originally announced July 2024.
-
SPIN: SE(3)-Invariant Physics Informed Network for Binding Affinity Prediction
Authors:
Seungyeon Choi,
Sangmin Seo,
Sanghyun Park
Abstract:
Accurate prediction of protein-ligand binding affinity is crucial for rapid and efficient drug development. Recently, the importance of predicting binding affinity has led to increased attention on research that models the three-dimensional structure of protein-ligand complexes using graph neural networks to predict binding affinity. However, traditional methods often fail to accurately model the…
▽ More
Accurate prediction of protein-ligand binding affinity is crucial for rapid and efficient drug development. Recently, the importance of predicting binding affinity has led to increased attention on research that models the three-dimensional structure of protein-ligand complexes using graph neural networks to predict binding affinity. However, traditional methods often fail to accurately model the complex's spatial information or rely solely on geometric features, neglecting the principles of protein-ligand binding. This can lead to overfitting, resulting in models that perform poorly on independent datasets and ultimately reducing their usefulness in real drug development. To address this issue, we propose SPIN, a model designed to achieve superior generalization by incorporating various inductive biases applicable to this task, beyond merely training on empirical data from datasets. For prediction, we defined two types of inductive biases: a geometric perspective that maintains consistent binding affinity predictions regardless of the complexs rotations and translations, and a physicochemical perspective that necessitates minimal binding free energy along their reaction coordinate for effective protein-ligand binding. These prior knowledge inputs enable the SPIN to outperform comparative models in benchmark sets such as CASF-2016 and CSAR HiQ. Furthermore, we demonstrated the practicality of our model through virtual screening experiments and validated the reliability and potential of our proposed model based on experiments assessing its interpretability.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Self-Explainable Temporal Graph Networks based on Graph Information Bottleneck
Authors:
Sangwoo Seo,
Sungwon Kim,
Jihyeong Jung,
Yoonho Lee,
Chanyoung Park
Abstract:
Temporal Graph Neural Networks (TGNN) have the ability to capture both the graph topology and dynamic dependencies of interactions within a graph over time. There has been a growing need to explain the predictions of TGNN models due to the difficulty in identifying how past events influence their predictions. Since the explanation model for a static graph cannot be readily applied to temporal grap…
▽ More
Temporal Graph Neural Networks (TGNN) have the ability to capture both the graph topology and dynamic dependencies of interactions within a graph over time. There has been a growing need to explain the predictions of TGNN models due to the difficulty in identifying how past events influence their predictions. Since the explanation model for a static graph cannot be readily applied to temporal graphs due to its inability to capture temporal dependencies, recent studies proposed explanation models for temporal graphs. However, existing explanation models for temporal graphs rely on post-hoc explanations, requiring separate models for prediction and explanation, which is limited in two aspects: efficiency and accuracy of explanation. In this work, we propose a novel built-in explanation framework for temporal graphs, called Self-Explainable Temporal Graph Networks based on Graph Information Bottleneck (TGIB). TGIB provides explanations for event occurrences by introducing stochasticity in each temporal event based on the Information Bottleneck theory. Experimental results demonstrate the superiority of TGIB in terms of both the link prediction performance and explainability compared to state-of-the-art methods. This is the first work that simultaneously performs prediction and explanation for temporal graphs in an end-to-end manner.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
Authors:
Bahare Fatemi,
Mehran Kazemi,
Anton Tsitsulin,
Karishma Malkan,
Jinyeong Yim,
John Palowitch,
Sungyong Seo,
Jonathan Halcrow,
Bryan Perozzi
Abstract:
Large language models (LLMs) have showcased remarkable reasoning capabilities, yet they remain susceptible to errors, particularly in temporal reasoning tasks involving complex temporal logic. Existing research has explored LLM performance on temporal reasoning using diverse datasets and benchmarks. However, these studies often rely on real-world data that LLMs may have encountered during pre-trai…
▽ More
Large language models (LLMs) have showcased remarkable reasoning capabilities, yet they remain susceptible to errors, particularly in temporal reasoning tasks involving complex temporal logic. Existing research has explored LLM performance on temporal reasoning using diverse datasets and benchmarks. However, these studies often rely on real-world data that LLMs may have encountered during pre-training or employ anonymization techniques that can inadvertently introduce factual inconsistencies. In this work, we address these limitations by introducing novel synthetic datasets specifically designed to assess LLM temporal reasoning abilities in various scenarios. The diversity of question types across these datasets enables systematic investigation into the impact of the problem structure, size, question type, fact order, and other factors on LLM performance. Our findings provide valuable insights into the strengths and weaknesses of current LLMs in temporal reasoning tasks. To foster further research in this area, we are open-sourcing the datasets and evaluation framework used in our experiments: https://huggingface.co/datasets/baharef/ToT.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
REVECA: Adaptive Planning and Trajectory-based Validation in Cooperative Language Agents using Information Relevance and Relative Proximity
Authors:
SeungWon Seo,
SeongRae Noh,
Junhyeok Lee,
SooBin Lim,
Won Hee Lee,
HyeongYeop Kang
Abstract:
We address the challenge of multi-agent cooperation, where agents achieve a common goal by cooperating with decentralized agents under complex partial observations. Existing cooperative agent systems often struggle with efficiently processing continuously accumulating information, managing globally suboptimal planning due to lack of consideration of collaborators, and addressing false planning cau…
▽ More
We address the challenge of multi-agent cooperation, where agents achieve a common goal by cooperating with decentralized agents under complex partial observations. Existing cooperative agent systems often struggle with efficiently processing continuously accumulating information, managing globally suboptimal planning due to lack of consideration of collaborators, and addressing false planning caused by environmental changes introduced by other collaborators. To overcome these challenges, we propose the RElevance, Proximity, and Validation-Enhanced Cooperative Language Agent (REVECA), a novel cognitive architecture powered by GPT-4o-mini. REVECA enables efficient memory management, optimal planning, and cost-effective prevention of false planning by leveraging Relevance Estimation, Adaptive Planning, and Trajectory-based Validation. Extensive experimental results demonstrate REVECA's superiority over existing methods across various benchmarks, while a user study reveals its potential for achieving trustworthy human-AI cooperation.
△ Less
Submitted 18 December, 2024; v1 submitted 26 May, 2024;
originally announced May 2024.
-
Autonomous Algorithm for Training Autonomous Vehicles with Minimal Human Intervention
Authors:
Sang-Hyun Lee,
Daehyeok Kwon,
Seung-Woo Seo
Abstract:
Recent reinforcement learning (RL) algorithms have demonstrated impressive results in simulated driving environments. However, autonomous vehicles trained in simulation often struggle to work well in the real world due to the fidelity gap between simulated and real-world environments. While directly training real-world autonomous vehicles with RL algorithms is a promising approach to bypass the fi…
▽ More
Recent reinforcement learning (RL) algorithms have demonstrated impressive results in simulated driving environments. However, autonomous vehicles trained in simulation often struggle to work well in the real world due to the fidelity gap between simulated and real-world environments. While directly training real-world autonomous vehicles with RL algorithms is a promising approach to bypass the fidelity gap problem, it presents several challenges. One critical yet often overlooked challenge is the need to reset a driving environment between every episode. This reset process demands significant human intervention, leading to poor training efficiency in the real world. In this paper, we introduce a novel autonomous algorithm that enables off-the-shelf RL algorithms to train autonomous vehicles with minimal human intervention. Our algorithm reduces unnecessary human intervention by aborting episodes to prevent unsafe states and identifying informative initial states for subsequent episodes. The key idea behind identifying informative initial states is to estimate the expected amount of information that can be obtained from under-explored but reachable states. Our algorithm also revisits rule-based autonomous driving algorithms and highlights their benefits in safely returning an autonomous vehicle to initial states. To evaluate how much human intervention is required during training, we implement challenging urban driving tasks that require an autonomous vehicle to reset to initial states on its own. The experimental results show that our autonomous algorithm is task-agnostic and achieves competitive driving performance with much less human intervention than baselines.
△ Less
Submitted 15 January, 2025; v1 submitted 22 May, 2024;
originally announced May 2024.
-
IDIL: Imitation Learning of Intent-Driven Expert Behavior
Authors:
Sangwon Seo,
Vaibhav Unhelkar
Abstract:
When faced with accomplishing a task, human experts exhibit intentional behavior. Their unique intents shape their plans and decisions, resulting in experts demonstrating diverse behaviors to accomplish the same task. Due to the uncertainties encountered in the real world and their bounded rationality, experts sometimes adjust their intents, which in turn influences their behaviors during task exe…
▽ More
When faced with accomplishing a task, human experts exhibit intentional behavior. Their unique intents shape their plans and decisions, resulting in experts demonstrating diverse behaviors to accomplish the same task. Due to the uncertainties encountered in the real world and their bounded rationality, experts sometimes adjust their intents, which in turn influences their behaviors during task execution. This paper introduces IDIL, a novel imitation learning algorithm to mimic these diverse intent-driven behaviors of experts. Iteratively, our approach estimates expert intent from heterogeneous demonstrations and then uses it to learn an intent-aware model of their behavior. Unlike contemporary approaches, IDIL is capable of addressing sequential tasks with high-dimensional state representations, while sidestepping the complexities and drawbacks associated with adversarial training (a mainstay of related techniques). Our empirical results suggest that the models generated by IDIL either match or surpass those produced by recent imitation learning benchmarks in metrics of task performance. Moreover, as it creates a generative model, IDIL demonstrates superior performance in intent inference metrics, crucial for human-agent interactions, and aptly captures a broad spectrum of expert behaviors.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
Traversability-aware Adaptive Optimization for Path Planning and Control in Mountainous Terrain
Authors:
Se-Wook Yoo,
E In Son,
Seung-Woo Seo
Abstract:
Autonomous navigation in extreme mountainous terrains poses challenges due to the presence of mobility-stressing elements and undulating surfaces, making it particularly difficult compared to conventional off-road driving scenarios. In such environments, estimating traversability solely based on exteroceptive sensors often leads to the inability to reach the goal due to a high prevalence of non-tr…
▽ More
Autonomous navigation in extreme mountainous terrains poses challenges due to the presence of mobility-stressing elements and undulating surfaces, making it particularly difficult compared to conventional off-road driving scenarios. In such environments, estimating traversability solely based on exteroceptive sensors often leads to the inability to reach the goal due to a high prevalence of non-traversable areas. In this paper, we consider traversability as a relative value that integrates the robot's internal state, such as speed and torque to exhibit resilient behavior to reach its goal successfully. We separate traversability into apparent traversability and relative traversability, then incorporate these distinctions in the optimization process of sampling-based planning and motion predictive control. Our method enables the robots to execute the desired behaviors more accurately while avoiding hazardous regions and getting stuck. Experiments conducted on simulation with 27 diverse types of mountainous terrain and real-world demonstrate the robustness of the proposed framework, with increasingly better performance observed in more complex environments.
△ Less
Submitted 4 April, 2024;
originally announced April 2024.
-
HyperCLOVA X Technical Report
Authors:
Kang Min Yoo,
Jaegeun Han,
Sookyo In,
Heewon Jeon,
Jisu Jeong,
Jaewook Kang,
Hyunwook Kim,
Kyung-Min Kim,
Munhyong Kim,
Sungju Kim,
Donghyun Kwak,
Hanock Kwak,
Se Jung Kwon,
Bado Lee,
Dongsoo Lee,
Gichang Lee,
Jooho Lee,
Baeseong Park,
Seongjin Shin,
Joonsang Yu,
Seolki Baek,
Sumin Byeon,
Eungsup Cho,
Dooseok Choe,
Jeesung Han
, et al. (371 additional authors not shown)
Abstract:
We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment t…
▽ More
We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment to responsible AI. The model is evaluated across various benchmarks, including comprehensive reasoning, knowledge, commonsense, factuality, coding, math, chatting, instruction-following, and harmlessness, in both Korean and English. HyperCLOVA X exhibits strong reasoning capabilities in Korean backed by a deep understanding of the language and cultural nuances. Further analysis of the inherent bilingual nature and its extension to multilingualism highlights the model's cross-lingual proficiency and strong generalization ability to untargeted languages, including machine translation between several language pairs and cross-lingual inference tasks. We believe that HyperCLOVA X can provide helpful guidance for regions or countries in developing their sovereign LLMs.
△ Less
Submitted 13 April, 2024; v1 submitted 2 April, 2024;
originally announced April 2024.
-
SCANNER: Knowledge-Enhanced Approach for Robust Multi-modal Named Entity Recognition of Unseen Entities
Authors:
Hyunjong Ok,
Taeho Kil,
Sukmin Seo,
Jaeho Lee
Abstract:
Recent advances in named entity recognition (NER) have pushed the boundary of the task to incorporate visual signals, leading to many variants, including multi-modal NER (MNER) or grounded MNER (GMNER). A key challenge to these tasks is that the model should be able to generalize to the entities unseen during the training, and should be able to handle the training samples with noisy annotations. T…
▽ More
Recent advances in named entity recognition (NER) have pushed the boundary of the task to incorporate visual signals, leading to many variants, including multi-modal NER (MNER) or grounded MNER (GMNER). A key challenge to these tasks is that the model should be able to generalize to the entities unseen during the training, and should be able to handle the training samples with noisy annotations. To address this obstacle, we propose SCANNER (Span CANdidate detection and recognition for NER), a model capable of effectively handling all three NER variants. SCANNER is a two-stage structure; we extract entity candidates in the first stage and use it as a query to get knowledge, effectively pulling knowledge from various sources. We can boost our performance by utilizing this entity-centric extracted knowledge to address unseen entities. Furthermore, to tackle the challenges arising from noisy annotations in NER datasets, we introduce a novel self-distillation method, enhancing the robustness and accuracy of our model in processing training data with inherent uncertainties. Our approach demonstrates competitive performance on the NER benchmark and surpasses existing methods on both MNER and GMNER benchmarks. Further analysis shows that the proposed distillation and knowledge utilization methods improve the performance of our model on various benchmarks.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
Unleash the Potential of CLIP for Video Highlight Detection
Authors:
Donghoon Han,
Seunghyeon Seo,
Eunhwan Park,
Seong-Uk Nam,
Nojun Kwak
Abstract:
Multimodal and large language models (LLMs) have revolutionized the utilization of open-world knowledge, unlocking novel potentials across various tasks and applications. Among these domains, the video domain has notably benefited from their capabilities. In this paper, we present Highlight-CLIP (HL-CLIP), a method designed to excel in the video highlight detection task by leveraging the pre-train…
▽ More
Multimodal and large language models (LLMs) have revolutionized the utilization of open-world knowledge, unlocking novel potentials across various tasks and applications. Among these domains, the video domain has notably benefited from their capabilities. In this paper, we present Highlight-CLIP (HL-CLIP), a method designed to excel in the video highlight detection task by leveraging the pre-trained knowledge embedded in multimodal models. By simply fine-tuning the multimodal encoder in combination with our innovative saliency pooling technique, we have achieved the state-of-the-art performance in the highlight detection task, the QVHighlight Benchmark, to the best of our knowledge.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
Make VLM Recognize Visual Hallucination on Cartoon Character Image with Pose Information
Authors:
Bumsoo Kim,
Wonseop Shin,
Kyuchul Lee,
Yonghoon Jung,
Sanghyun Seo
Abstract:
Leveraging large-scale Text-to-Image (TTI) models have become a common technique for generating exemplar or training dataset in the fields of image synthesis, video editing, 3D reconstruction. However, semantic structural visual hallucinations involving perceptually severe defects remain a concern, especially in the domain of non-photorealistic rendering (NPR) such as cartoons and pixelization-sty…
▽ More
Leveraging large-scale Text-to-Image (TTI) models have become a common technique for generating exemplar or training dataset in the fields of image synthesis, video editing, 3D reconstruction. However, semantic structural visual hallucinations involving perceptually severe defects remain a concern, especially in the domain of non-photorealistic rendering (NPR) such as cartoons and pixelization-style character. To detect these hallucinations in NPR, We propose a novel semantic structural hallucination detection system using Vision-Language Model (VLM). Our approach is to leverage the emerging capability of large language model, in-context learning which denotes that VLM has seen some examples by user for specific downstream task, here hallucination detection. Based on in-context learning, we introduce pose-aware in-context visual learning (PA-ICVL) which improve the overall performance of VLM by further inputting visual data beyond prompts, RGB images and pose information. By incorporating pose guidance, we enable VLMs to make more accurate decisions. Experimental results demonstrate significant improvements in identifying visual hallucinations compared to baseline methods relying solely on RGB images. Within selected two VLMs, GPT-4v, Gemini pro vision, our proposed PA-ICVL improves the hallucination detection with 50% to 78%, 57% to 80%, respectively. This research advances a capability of TTI models toward real-world applications by mitigating visual hallucinations via in-context visual learning, expanding their potential in non-photorealistic domains. In addition, it showcase how users can boost the downstream-specialized capability of open VLM by harnessing additional conditions. We collect synthetic cartoon-hallucination dataset with TTI models, this dataset and final tuned VLM will be publicly available.
△ Less
Submitted 22 January, 2025; v1 submitted 22 March, 2024;
originally announced March 2024.
-
ARC-NeRF: Area Ray Casting for Broader Unseen View Coverage in Few-shot Object Rendering
Authors:
Seunghyeon Seo,
Yeonjin Chang,
Jayeon Yoo,
Seungwoo Lee,
Hojun Lee,
Nojun Kwak
Abstract:
Recent advancements in the Neural Radiance Field (NeRF) have enhanced its capabilities for novel view synthesis, yet its reliance on dense multi-view training images poses a practical challenge, often leading to artifacts and a lack of fine object details. Addressing this, we propose ARC-NeRF, an effective regularization-based approach with a novel Area Ray Casting strategy. While the previous ray…
▽ More
Recent advancements in the Neural Radiance Field (NeRF) have enhanced its capabilities for novel view synthesis, yet its reliance on dense multi-view training images poses a practical challenge, often leading to artifacts and a lack of fine object details. Addressing this, we propose ARC-NeRF, an effective regularization-based approach with a novel Area Ray Casting strategy. While the previous ray augmentation methods are limited to covering only a single unseen view per extra ray, our proposed Area Ray covers a broader range of unseen views with just a single ray and enables an adaptive high-frequency regularization based on target pixel photo-consistency. Moreover, we propose luminance consistency regularization, which enhances the consistency of relative luminance between the original and Area Ray, leading to more accurate object textures. The relative luminance, as a free lunch extra data easily derived from RGB images, can be effectively utilized in few-shot scenarios where available training data is limited. Our ARC-NeRF outperforms its baseline and achieves competitive results on multiple benchmarks with sharply rendered fine details.
△ Less
Submitted 7 April, 2025; v1 submitted 16 March, 2024;
originally announced March 2024.
-
Never Tell the Trick: Covert Interactive Mixed Reality System for Immersive Storytelling
Authors:
Chanwoo Lee,
Kyubeom Shim,
Sanggyo Seo,
Gwonu Ryu,
Yongsoon Choi
Abstract:
This study explores the integration of Ultra-Wideband (UWB) technology into Mixed Reality (MR) Systems for immersive storytelling. Addressing the limitations of existing technologies like Microsoft Kinect and HTC Vive, the research focuses on overcoming challenges in robustness to occlusion, tracking volume, and cost efficiency in props tracking. Utilizing UWB technology, the interactive MR system…
▽ More
This study explores the integration of Ultra-Wideband (UWB) technology into Mixed Reality (MR) Systems for immersive storytelling. Addressing the limitations of existing technologies like Microsoft Kinect and HTC Vive, the research focuses on overcoming challenges in robustness to occlusion, tracking volume, and cost efficiency in props tracking. Utilizing UWB technology, the interactive MR system enhances the scope of performance art by enabling larger tracking areas, more reliable and cheaper multi-prop tracking, and reducing occlusion issues. Preliminary user tests suggest meaningful improvements in immersive experience, promising a new possibility in Extended Reality (XR) theater, performance art and immersive game.
△ Less
Submitted 3 March, 2024;
originally announced March 2024.
-
Results and Lessons Learned from Autonomous Driving Transportation Services in Airfield, Crowded Indoor, and Urban Environments
Authors:
Doosan Baek,
Sanghyun Kim,
Seung-Woo Seo,
Sang-Hyun Lee
Abstract:
Autonomous vehicles have been actively investigated over the past few decades. Several recent works show the potential of autonomous vehicles in urban environments with impressive experimental results. However, these works note that autonomous vehicles are still occasionally inferior to expert drivers in complex scenarios. Furthermore, they do not focus on the possibilities of autonomous driving t…
▽ More
Autonomous vehicles have been actively investigated over the past few decades. Several recent works show the potential of autonomous vehicles in urban environments with impressive experimental results. However, these works note that autonomous vehicles are still occasionally inferior to expert drivers in complex scenarios. Furthermore, they do not focus on the possibilities of autonomous driving transportation services in other areas beyond urban environments. This paper presents the research results and lessons learned from autonomous driving transportation services in airfield, crowded indoor, and urban environments. We discuss how we address several unique challenges in these diverse environments. We also offer an overview of remaining challenges that have not received much attention but must be addressed. This paper aims to share our unique experience to support researchers who are interested in exploring autonomous driving transportation services in various real-world environments.
△ Less
Submitted 20 March, 2024; v1 submitted 2 March, 2024;
originally announced March 2024.
-
StochCA: A Novel Approach for Exploiting Pretrained Models with Cross-Attention
Authors:
Seungwon Seo,
Suho Lee,
Sangheum Hwang
Abstract:
Utilizing large-scale pretrained models is a well-known strategy to enhance performance on various target tasks. It is typically achieved through fine-tuning pretrained models on target tasks. However, naïve fine-tuning may not fully leverage knowledge embedded in pretrained models. In this study, we introduce a novel fine-tuning method, called stochastic cross-attention (StochCA), specific to Tra…
▽ More
Utilizing large-scale pretrained models is a well-known strategy to enhance performance on various target tasks. It is typically achieved through fine-tuning pretrained models on target tasks. However, naïve fine-tuning may not fully leverage knowledge embedded in pretrained models. In this study, we introduce a novel fine-tuning method, called stochastic cross-attention (StochCA), specific to Transformer architectures. This method modifies the Transformer's self-attention mechanism to selectively utilize knowledge from pretrained models during fine-tuning. Specifically, in each block, instead of self-attention, cross-attention is performed stochastically according to the predefined probability, where keys and values are extracted from the corresponding block of a pretrained model. By doing so, queries and channel-mixing multi-layer perceptron layers of a target model are fine-tuned to target tasks to learn how to effectively exploit rich representations of pretrained models. To verify the effectiveness of StochCA, extensive experiments are conducted on benchmarks in the areas of transfer learning and domain generalization, where the exploitation of pretrained models is critical. Our experimental results show the superiority of StochCA over state-of-the-art approaches in both areas. Furthermore, we demonstrate that StochCA is complementary to existing approaches, i.e., it can be combined with them to further improve performance. Our code is available at https://github.com/daintlab/stochastic_cross_attention
△ Less
Submitted 28 September, 2024; v1 submitted 25 February, 2024;
originally announced February 2024.
-
Follow the Footprints: Self-supervised Traversability Estimation for Off-road Vehicle Navigation based on Geometric and Visual Cues
Authors:
Yurim Jeon,
E In Son,
Seung-Woo Seo
Abstract:
In this study, we address the off-road traversability estimation problem, that predicts areas where a robot can navigate in off-road environments. An off-road environment is an unstructured environment comprising a combination of traversable and non-traversable spaces, which presents a challenge for estimating traversability. This study highlights three primary factors that affect a robot's traver…
▽ More
In this study, we address the off-road traversability estimation problem, that predicts areas where a robot can navigate in off-road environments. An off-road environment is an unstructured environment comprising a combination of traversable and non-traversable spaces, which presents a challenge for estimating traversability. This study highlights three primary factors that affect a robot's traversability in an off-road environment: surface slope, semantic information, and robot platform. We present two strategies for estimating traversability, using a guide filter network (GFN) and footprint supervision module (FSM). The first strategy involves building a novel GFN using a newly designed guide filter layer. The GFN interprets the surface and semantic information from the input data and integrates them to extract features optimized for traversability estimation. The second strategy involves developing an FSM, which is a self-supervision module that utilizes the path traversed by the robot in pre-driving, also known as a footprint. This enables the prediction of traversability that reflects the characteristics of the robot platform. Based on these two strategies, the proposed method overcomes the limitations of existing methods, which require laborious human supervision and lack scalability. Extensive experiments in diverse conditions, including automobiles and unmanned ground vehicles, herbfields, woodlands, and farmlands, demonstrate that the proposed method is compatible for various robot platforms and adaptable to a range of terrains. Code is available at https://github.com/yurimjeon1892/FtFoot.
△ Less
Submitted 23 February, 2024;
originally announced February 2024.
-
Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation
Authors:
Heeseung Kim,
Soonshin Seo,
Kyeongseok Jeong,
Ohsung Kwon,
Soyoon Kim,
Jungwhan Kim,
Jaehong Lee,
Eunwoo Song,
Myungwoo Oh,
Jung-Woo Ha,
Sungroh Yoon,
Kang Min Yoo
Abstract:
Recent work shows promising results in expanding the capabilities of large language models (LLM) to directly understand and synthesize speech. However, an LLM-based strategy for modeling spoken dialogs remains elusive, calling for further investigation. This paper introduces an extensive speech-text LLM framework, the Unified Spoken Dialog Model (USDM), designed to generate coherent spoken respons…
▽ More
Recent work shows promising results in expanding the capabilities of large language models (LLM) to directly understand and synthesize speech. However, an LLM-based strategy for modeling spoken dialogs remains elusive, calling for further investigation. This paper introduces an extensive speech-text LLM framework, the Unified Spoken Dialog Model (USDM), designed to generate coherent spoken responses with naturally occurring prosodic features relevant to the given input speech without relying on explicit automatic speech recognition (ASR) or text-to-speech (TTS) systems. We have verified the inclusion of prosody in speech tokens that predominantly contain semantic information and have used this foundation to construct a prosody-infused speech-text model. Additionally, we propose a generalized speech-text pretraining scheme that enhances the capture of cross-modal semantics. To construct USDM, we fine-tune our speech-text model on spoken dialog data using a multi-step spoken dialog template that stimulates the chain-of-reasoning capabilities exhibited by the underlying LLM. Automatic and human evaluations on the DailyTalk dataset demonstrate that our approach effectively generates natural-sounding spoken responses, surpassing previous and cascaded baselines. Our code and checkpoints are available at https://github.com/naver-ai/usdm.
△ Less
Submitted 27 November, 2024; v1 submitted 8 February, 2024;
originally announced February 2024.
-
Minecraft-ify: Minecraft Style Image Generation with Text-guided Image Editing for In-Game Application
Authors:
Bumsoo Kim,
Sanghyun Byun,
Yonghoon Jung,
Wonseop Shin,
Sareer UI Amin,
Sanghyun Seo
Abstract:
In this paper, we first present the character texture generation system \textit{Minecraft-ify}, specified to Minecraft video game toward in-game application. Ours can generate face-focused image for texture mapping tailored to 3D virtual character having cube manifold. While existing projects or works only generate texture, proposed system can inverse the user-provided real image, or generate aver…
▽ More
In this paper, we first present the character texture generation system \textit{Minecraft-ify}, specified to Minecraft video game toward in-game application. Ours can generate face-focused image for texture mapping tailored to 3D virtual character having cube manifold. While existing projects or works only generate texture, proposed system can inverse the user-provided real image, or generate average/random appearance from learned distribution. Moreover, it can be manipulated with text-guidance using StyleGAN and StyleCLIP. These features provide a more extended user experience with enlarged freedom as a user-friendly AI-tool. Project page can be found at https://gh-bumsookim.github.io/Minecraft-ify/
△ Less
Submitted 3 March, 2024; v1 submitted 8 February, 2024;
originally announced February 2024.
-
ToonAging: Face Re-Aging upon Artistic Portrait Style Transfer
Authors:
Bumsoo Kim,
Abdul Muqeet,
Kyuchul Lee,
Sanghyun Seo
Abstract:
Face re-aging is a prominent field in computer vision and graphics, with significant applications in photorealistic domains such as movies, advertising, and live streaming. Recently, the need to apply face re-aging to non-photorealistic images, like comics, illustrations, and animations, has emerged as an extension in various entertainment sectors. However, the lack of a network that can seamlessl…
▽ More
Face re-aging is a prominent field in computer vision and graphics, with significant applications in photorealistic domains such as movies, advertising, and live streaming. Recently, the need to apply face re-aging to non-photorealistic images, like comics, illustrations, and animations, has emerged as an extension in various entertainment sectors. However, the lack of a network that can seamlessly edit the apparent age in NPR images has limited these tasks to a naive, sequential approach. This often results in unpleasant artifacts and a loss of facial attributes due to domain discrepancies. In this paper, we introduce a novel one-stage method for face re-aging combined with portrait style transfer, executed in a single generative step. We leverage existing face re-aging and style transfer networks, both trained within the same PR domain. Our method uniquely fuses distinct latent vectors, each responsible for managing aging-related attributes and NPR appearance. By adopting an exemplar-based approach, our method offers greater flexibility compared to domain-level fine-tuning approaches, which typically require separate training or fine-tuning for each domain. This effectively addresses the limitation of requiring paired datasets for re-aging and domain-level, data-driven approaches for stylization. Our experiments show that our model can effortlessly generate re-aged images while simultaneously transferring the style of examples, maintaining both natural appearance and controllability.
△ Less
Submitted 28 May, 2024; v1 submitted 5 February, 2024;
originally announced February 2024.
-
Knowledge Distillation from Language-Oriented to Emergent Communication for Multi-Agent Remote Control
Authors:
Yongjun Kim,
Sejin Seo,
Jihong Park,
Mehdi Bennis,
Seong-Lyun Kim,
Junil Choi
Abstract:
In this work, we compare emergent communication (EC) built upon multi-agent deep reinforcement learning (MADRL) and language-oriented semantic communication (LSC) empowered by a pre-trained large language model (LLM) using human language. In a multi-agent remote navigation task, with multimodal input data comprising location and channel maps, it is shown that EC incurs high training cost and strug…
▽ More
In this work, we compare emergent communication (EC) built upon multi-agent deep reinforcement learning (MADRL) and language-oriented semantic communication (LSC) empowered by a pre-trained large language model (LLM) using human language. In a multi-agent remote navigation task, with multimodal input data comprising location and channel maps, it is shown that EC incurs high training cost and struggles when using multimodal data, whereas LSC yields high inference computing cost due to the LLM's large size. To address their respective bottlenecks, we propose a novel framework of language-guided EC (LEC) by guiding the EC training using LSC via knowledge distillation (KD). Simulations corroborate that LEC achieves faster travel time while avoiding areas with poor channel conditions, as well as speeding up the MADRL training convergence by up to 61.8% compared to EC.
△ Less
Submitted 3 March, 2024; v1 submitted 23 January, 2024;
originally announced January 2024.
-
Relaxed Contrastive Learning for Federated Learning
Authors:
Seonguk Seo,
Jinkyu Kim,
Geeho Kim,
Bohyung Han
Abstract:
We propose a novel contrastive learning framework to effectively address the challenges of data heterogeneity in federated learning. We first analyze the inconsistency of gradient updates across clients during local training and establish its dependence on the distribution of feature representations, leading to the derivation of the supervised contrastive learning (SCL) objective to mitigate local…
▽ More
We propose a novel contrastive learning framework to effectively address the challenges of data heterogeneity in federated learning. We first analyze the inconsistency of gradient updates across clients during local training and establish its dependence on the distribution of feature representations, leading to the derivation of the supervised contrastive learning (SCL) objective to mitigate local deviations. In addition, we show that a naïve adoption of SCL in federated learning leads to representation collapse, resulting in slow convergence and limited performance gains. To address this issue, we introduce a relaxed contrastive learning loss that imposes a divergence penalty on excessively similar sample pairs within each class. This strategy prevents collapsed representations and enhances feature transferability, facilitating collaborative training and leading to significant performance improvements. Our framework outperforms all existing federated learning approaches by huge margins on the standard benchmarks through extensive experimental results.
△ Less
Submitted 31 May, 2024; v1 submitted 9 January, 2024;
originally announced January 2024.
-
Interpreting Adaptive Gradient Methods by Parameter Scaling for Learning-Rate-Free Optimization
Authors:
Min-Kook Suh,
Seung-Woo Seo
Abstract:
We address the challenge of estimating the learning rate for adaptive gradient methods used in training deep neural networks. While several learning-rate-free approaches have been proposed, they are typically tailored for steepest descent. However, although steepest descent methods offer an intuitive approach to finding minima, many deep learning applications require adaptive gradient methods to a…
▽ More
We address the challenge of estimating the learning rate for adaptive gradient methods used in training deep neural networks. While several learning-rate-free approaches have been proposed, they are typically tailored for steepest descent. However, although steepest descent methods offer an intuitive approach to finding minima, many deep learning applications require adaptive gradient methods to achieve faster convergence. In this paper, we interpret adaptive gradient methods as steepest descent applied on parameter-scaled networks, proposing learning-rate-free adaptive gradient methods. Experimental results verify the effectiveness of this approach, demonstrating comparable performance to hand-tuned learning rates across various scenarios. This work extends the applicability of learning-rate-free methods, enhancing training with adaptive gradient methods.
△ Less
Submitted 6 January, 2024;
originally announced January 2024.