-
Code Execution as Grounded Supervision for LLM Reasoning
Authors:
Dongwon Jung,
Wenxuan Zhou,
Muhao Chen
Abstract:
Training large language models (LLMs) with chain-of-thought (CoT) supervision has proven effective for enhancing their reasoning abilities. However, obtaining reliable and accurate reasoning supervision remains a significant challenge. We propose a scalable method for generating a high-quality CoT supervision dataset by leveraging the determinism of program execution. Unlike existing reasoning dat…
▽ More
Training large language models (LLMs) with chain-of-thought (CoT) supervision has proven effective for enhancing their reasoning abilities. However, obtaining reliable and accurate reasoning supervision remains a significant challenge. We propose a scalable method for generating a high-quality CoT supervision dataset by leveraging the determinism of program execution. Unlike existing reasoning dataset generation methods that rely on costly human annotations or error-prone LLM-generated CoT, our approach extracts verifiable, step-by-step reasoning traces from code execution and transforms them into a natural language CoT reasoning. Experiments on reasoning benchmarks across various domains show that our method effectively equips LLMs with transferable reasoning abilities across diverse tasks. Furthermore, the ablation studies validate that our method produces highly accurate reasoning data and reduces overall token length during inference by reducing meaningless repetition and overthinking.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
UAV4D: Dynamic Neural Rendering of Human-Centric UAV Imagery using Gaussian Splatting
Authors:
Jaehoon Choi,
Dongki Jung,
Christopher Maxey,
Yonghan Lee,
Sungmin Eum,
Dinesh Manocha,
Heesung Kwon
Abstract:
Despite significant advancements in dynamic neural rendering, existing methods fail to address the unique challenges posed by UAV-captured scenarios, particularly those involving monocular camera setups, top-down perspective, and multiple small, moving humans, which are not adequately represented in existing datasets. In this work, we introduce UAV4D, a framework for enabling photorealistic render…
▽ More
Despite significant advancements in dynamic neural rendering, existing methods fail to address the unique challenges posed by UAV-captured scenarios, particularly those involving monocular camera setups, top-down perspective, and multiple small, moving humans, which are not adequately represented in existing datasets. In this work, we introduce UAV4D, a framework for enabling photorealistic rendering for dynamic real-world scenes captured by UAVs. Specifically, we address the challenge of reconstructing dynamic scenes with multiple moving pedestrians from monocular video data without the need for additional sensors. We use a combination of a 3D foundation model and a human mesh reconstruction model to reconstruct both the scene background and humans. We propose a novel approach to resolve the scene scale ambiguity and place both humans and the scene in world coordinates by identifying human-scene contact points. Additionally, we exploit the SMPL model and background mesh to initialize Gaussian splats, enabling holistic scene rendering. We evaluated our method on three complex UAV-captured datasets: VisDrone, Manipal-UAV, and Okutama-Action, each with distinct characteristics and 10~50 humans. Our results demonstrate the benefits of our approach over existing methods in novel view synthesis, achieving a 1.5 dB PSNR improvement and superior visual sharpness.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
Locality-Aware Zero-Shot Human-Object Interaction Detection
Authors:
Sanghyun Kim,
Deunsol Jung,
Minsu Cho
Abstract:
Recent methods for zero-shot Human-Object Interaction (HOI) detection typically leverage the generalization ability of large Vision-Language Model (VLM), i.e., CLIP, on unseen categories, showing impressive results on various zero-shot settings. However, existing methods struggle to adapt CLIP representations for human-object pairs, as CLIP tends to overlook fine-grained information necessary for…
▽ More
Recent methods for zero-shot Human-Object Interaction (HOI) detection typically leverage the generalization ability of large Vision-Language Model (VLM), i.e., CLIP, on unseen categories, showing impressive results on various zero-shot settings. However, existing methods struggle to adapt CLIP representations for human-object pairs, as CLIP tends to overlook fine-grained information necessary for distinguishing interactions. To address this issue, we devise, LAIN, a novel zero-shot HOI detection framework enhancing the locality and interaction awareness of CLIP representations. The locality awareness, which involves capturing fine-grained details and the spatial structure of individual objects, is achieved by aggregating the information and spatial priors of adjacent neighborhood patches. The interaction awareness, which involves identifying whether and how a human is interacting with an object, is achieved by capturing the interaction pattern between the human and the object. By infusing locality and interaction awareness into CLIP representation, LAIN captures detailed information about the human-object pairs. Our extensive experiments on existing benchmarks show that LAIN outperforms previous methods on various zero-shot settings, demonstrating the importance of locality and interaction awareness for effective zero-shot HOI detection.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
CReSt: A Comprehensive Benchmark for Retrieval-Augmented Generation with Complex Reasoning over Structured Documents
Authors:
Minsoo Khang,
Sangjun Park,
Teakgyu Hong,
Dawoon Jung
Abstract:
Large Language Models (LLMs) have made substantial progress in recent years, yet evaluating their capabilities in practical Retrieval-Augmented Generation (RAG) scenarios remains challenging. In practical applications, LLMs must demonstrate complex reasoning, refuse to answer appropriately, provide precise citations, and effectively understand document layout. These capabilities are crucial for ad…
▽ More
Large Language Models (LLMs) have made substantial progress in recent years, yet evaluating their capabilities in practical Retrieval-Augmented Generation (RAG) scenarios remains challenging. In practical applications, LLMs must demonstrate complex reasoning, refuse to answer appropriately, provide precise citations, and effectively understand document layout. These capabilities are crucial for advanced task handling, uncertainty awareness, maintaining reliability, and structural understanding. While some of the prior works address these aspects individually, there is a need for a unified framework that evaluates them collectively in practical RAG scenarios. To address this, we present CReSt (A Comprehensive Benchmark for Retrieval-Augmented Generation with Complex Reasoning over Structured Documents), a benchmark designed to assess these key dimensions holistically. CReSt comprises 2,245 human-annotated examples in English and Korean, designed to capture practical RAG scenarios that require complex reasoning over structured documents. It also introduces a tailored evaluation methodology to comprehensively assess model performance in these critical areas. Our evaluation shows that even advanced LLMs struggle to perform consistently across these dimensions, underscoring key areas for improvement. We release CReSt to support further research and the development of more robust RAG systems. The dataset and code are available at: https://github.com/UpstageAI/CReSt.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
Learning Dense Hand Contact Estimation from Imbalanced Data
Authors:
Daniel Sungho Jung,
Kyoung Mu Lee
Abstract:
Hands are essential to human interaction, and understanding contact between hands and the world can promote comprehensive understanding of their function. Recently, there have been growing number of hand interaction datasets that cover interaction with object, other hand, scene, and body. Despite the significance of the task and increasing high-quality data, how to effectively learn dense hand con…
▽ More
Hands are essential to human interaction, and understanding contact between hands and the world can promote comprehensive understanding of their function. Recently, there have been growing number of hand interaction datasets that cover interaction with object, other hand, scene, and body. Despite the significance of the task and increasing high-quality data, how to effectively learn dense hand contact estimation remains largely underexplored. There are two major challenges for learning dense hand contact estimation. First, there exists class imbalance issue from hand contact datasets where majority of samples are not in contact. Second, hand contact datasets contain spatial imbalance issue with most of hand contact exhibited in finger tips, resulting in challenges for generalization towards contacts in other hand regions. To tackle these issues, we present a framework that learns dense HAnd COntact estimation (HACO) from imbalanced data. To resolve the class imbalance issue, we introduce balanced contact sampling, which builds and samples from multiple sampling groups that fairly represent diverse contact statistics for both contact and non-contact samples. Moreover, to address the spatial imbalance issue, we propose vertex-level class-balanced (VCB) loss, which incorporates spatially varying contact distribution by separately reweighting loss contribution of each vertex based on its contact frequency across dataset. As a result, we effectively learn to predict dense hand contact estimation with large-scale hand contact data without suffering from class and spatial imbalance issue. The codes will be released.
△ Less
Submitted 16 May, 2025;
originally announced May 2025.
-
Domain Adversarial Training for Mitigating Gender Bias in Speech-based Mental Health Detection
Authors:
June-Woo Kim,
Haram Yoon,
Wonkyo Oh,
Dawoon Jung,
Sung-Hoon Yoon,
Dae-Jin Kim,
Dong-Ho Lee,
Sang-Yeol Lee,
Chan-Mo Yang
Abstract:
Speech-based AI models are emerging as powerful tools for detecting depression and the presence of Post-traumatic stress disorder (PTSD), offering a non-invasive and cost-effective way to assess mental health. However, these models often struggle with gender bias, which can lead to unfair and inaccurate predictions. In this study, our study addresses this issue by introducing a domain adversarial…
▽ More
Speech-based AI models are emerging as powerful tools for detecting depression and the presence of Post-traumatic stress disorder (PTSD), offering a non-invasive and cost-effective way to assess mental health. However, these models often struggle with gender bias, which can lead to unfair and inaccurate predictions. In this study, our study addresses this issue by introducing a domain adversarial training approach that explicitly considers gender differences in speech-based depression and PTSD detection. Specifically, we treat different genders as distinct domains and integrate this information into a pretrained speech foundation model. We then validate its effectiveness on the E-DAIC dataset to assess its impact on performance. Experimental results show that our method notably improves detection performance, increasing the F1-score by up to 13.29 percentage points compared to the baseline. This highlights the importance of addressing demographic disparities in AI-driven mental health assessment.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
UAVTwin: Neural Digital Twins for UAVs using Gaussian Splatting
Authors:
Jaehoon Choi,
Dongki Jung,
Yonghan Lee,
Sungmin Eum,
Dinesh Manocha,
Heesung Kwon
Abstract:
We present UAVTwin, a method for creating digital twins from real-world environments and facilitating data augmentation for training downstream models embedded in unmanned aerial vehicles (UAVs). Specifically, our approach focuses on synthesizing foreground components, such as various human instances in motion within complex scene backgrounds, from UAV perspectives. This is achieved by integrating…
▽ More
We present UAVTwin, a method for creating digital twins from real-world environments and facilitating data augmentation for training downstream models embedded in unmanned aerial vehicles (UAVs). Specifically, our approach focuses on synthesizing foreground components, such as various human instances in motion within complex scene backgrounds, from UAV perspectives. This is achieved by integrating 3D Gaussian Splatting (3DGS) for reconstructing backgrounds along with controllable synthetic human models that display diverse appearances and actions in multiple poses. To the best of our knowledge, UAVTwin is the first approach for UAV-based perception that is capable of generating high-fidelity digital twins based on 3DGS. The proposed work significantly enhances downstream models through data augmentation for real-world environments with multiple dynamic objects and significant appearance variations-both of which typically introduce artifacts in 3DGS-based modeling. To tackle these challenges, we propose a novel appearance modeling strategy and a mask refinement module to enhance the training of 3D Gaussian Splatting. We demonstrate the high quality of neural rendering by achieving a 1.23 dB improvement in PSNR compared to recent methods. Furthermore, we validate the effectiveness of data augmentation by showing a 2.5% to 13.7% improvement in mAP for the human detection task.
△ Less
Submitted 2 April, 2025;
originally announced April 2025.
-
Investigating Large Language Models in Diagnosing Students' Cognitive Skills in Math Problem-solving
Authors:
Hyoungwook Jin,
Yoonsu Kim,
Dongyun Jung,
Seungju Kim,
Kiyoon Choi,
Jinho Son,
Juho Kim
Abstract:
Mathematics learning entails mastery of both content knowledge and cognitive processing of knowing, applying, and reasoning with it. Automated math assessment primarily has focused on grading students' exhibition of content knowledge by finding textual evidence, such as specific numbers, formulas, and statements. Recent advancements in problem-solving, image recognition, and reasoning capabilities…
▽ More
Mathematics learning entails mastery of both content knowledge and cognitive processing of knowing, applying, and reasoning with it. Automated math assessment primarily has focused on grading students' exhibition of content knowledge by finding textual evidence, such as specific numbers, formulas, and statements. Recent advancements in problem-solving, image recognition, and reasoning capabilities of large language models (LLMs) show promise for nuanced evaluation of students' cognitive skills. Diagnosing cognitive skills needs to infer students' thinking processes beyond textual evidence, which is an underexplored task in LLM-based automated assessment. In this work, we investigate how state-of-the-art LLMs diagnose students' cognitive skills in mathematics. We constructed MathCog, a novel benchmark dataset comprising 639 student responses to 110 expert-curated middle school math problems, each annotated with detailed teachers' diagnoses based on cognitive skill checklists. Using MathCog, we evaluated 16 closed and open LLMs of varying model sizes and vendors. Our evaluation reveals that even the state-of-the-art LLMs struggle with the task, all F1 scores below 0.5, and tend to exhibit strong false confidence for incorrect cases ($r_s=.617$). We also found that model size positively correlates with the diagnosis performance ($r_s=.771$). Finally, we discuss the implications of these findings, the overconfidence issue, and directions for improving automated cognitive skill diagnosis.
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
Advancements in Multimodal Differential Evolution: A Comprehensive Review and Future Perspectives
Authors:
Dikshit Chauhan,
Shivani,
Donghwi Jung,
Anupam Yadav
Abstract:
Multi-modal optimization involves identifying multiple global and local optima of a function, offering valuable insights into diverse optimal solutions within the search space. Evolutionary algorithms (EAs) excel at finding multiple solutions in a single run, providing a distinct advantage over classical optimization techniques that often require multiple restarts without guarantee of obtaining di…
▽ More
Multi-modal optimization involves identifying multiple global and local optima of a function, offering valuable insights into diverse optimal solutions within the search space. Evolutionary algorithms (EAs) excel at finding multiple solutions in a single run, providing a distinct advantage over classical optimization techniques that often require multiple restarts without guarantee of obtaining diverse solutions. Among these EAs, differential evolution (DE) stands out as a powerful and versatile optimizer for continuous parameter spaces. DE has shown significant success in multi-modal optimization by utilizing its population-based search to promote the formation of multiple stable subpopulations, each targeting different optima. Recent advancements in DE for multi-modal optimization have focused on niching methods, parameter adaptation, hybridization with other algorithms including machine learning, and applications across various domains. Given these developments, it is an opportune moment to present a critical review of the latest literature and identify key future research directions. This paper offers a comprehensive overview of recent DE advancements in multimodal optimization, including methods for handling multiple optima, hybridization with EAs, and machine learning, and highlights a range of real-world applications. Additionally, the paper outlines a set of compelling open problems and future research issues from multiple perspectives
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
FLEX: A Benchmark for Evaluating Robustness of Fairness in Large Language Models
Authors:
Dahyun Jung,
Seungyoon Lee,
Hyeonseok Moon,
Chanjun Park,
Heuiseok Lim
Abstract:
Recent advancements in Large Language Models (LLMs) have significantly enhanced interactions between users and models. These advancements concurrently underscore the need for rigorous safety evaluations due to the manifestation of social biases, which can lead to harmful societal impacts. Despite these concerns, existing benchmarks may overlook the intrinsic weaknesses of LLMs, which can generate…
▽ More
Recent advancements in Large Language Models (LLMs) have significantly enhanced interactions between users and models. These advancements concurrently underscore the need for rigorous safety evaluations due to the manifestation of social biases, which can lead to harmful societal impacts. Despite these concerns, existing benchmarks may overlook the intrinsic weaknesses of LLMs, which can generate biased responses even with simple adversarial instructions. To address this critical gap, we introduce a new benchmark, Fairness Benchmark in LLM under Extreme Scenarios (FLEX), designed to test whether LLMs can sustain fairness even when exposed to prompts constructed to induce bias. To thoroughly evaluate the robustness of LLMs, we integrate prompts that amplify potential biases into the fairness assessment. Comparative experiments between FLEX and existing benchmarks demonstrate that traditional evaluations may underestimate the inherent risks in models. This highlights the need for more stringent LLM evaluation benchmarks to guarantee safety and fairness.
△ Less
Submitted 25 March, 2025;
originally announced March 2025.
-
EDM: Equirectangular Projection-Oriented Dense Kernelized Feature Matching
Authors:
Dongki Jung,
Jaehoon Choi,
Yonghan Lee,
Somi Jeong,
Taejae Lee,
Dinesh Manocha,
Suyong Yeon
Abstract:
We introduce the first learning-based dense matching algorithm, termed Equirectangular Projection-Oriented Dense Kernelized Feature Matching (EDM), specifically designed for omnidirectional images. Equirectangular projection (ERP) images, with their large fields of view, are particularly suited for dense matching techniques that aim to establish comprehensive correspondences across images. However…
▽ More
We introduce the first learning-based dense matching algorithm, termed Equirectangular Projection-Oriented Dense Kernelized Feature Matching (EDM), specifically designed for omnidirectional images. Equirectangular projection (ERP) images, with their large fields of view, are particularly suited for dense matching techniques that aim to establish comprehensive correspondences across images. However, ERP images are subject to significant distortions, which we address by leveraging the spherical camera model and geodesic flow refinement in the dense matching method. To further mitigate these distortions, we propose spherical positional embeddings based on 3D Cartesian coordinates of the feature grid. Additionally, our method incorporates bidirectional transformations between spherical and Cartesian coordinate systems during refinement, utilizing a unit sphere to improve matching performance. We demonstrate that our proposed method achieves notable performance enhancements, with improvements of +26.72 and +42.62 in AUC@5° on the Matterport3D and Stanford2D3D datasets.
△ Less
Submitted 27 February, 2025;
originally announced February 2025.
-
Kanana: Compute-efficient Bilingual Language Models
Authors:
Kanana LLM Team,
Yunju Bak,
Hojin Lee,
Minho Ryu,
Jiyeon Ham,
Seungjae Jung,
Daniel Wontae Nam,
Taegyeong Eo,
Donghun Lee,
Doohae Jung,
Boseop Kim,
Nayeon Kim,
Jaesun Park,
Hyunho Kim,
Hyunwoong Ko,
Changmin Lee,
Kyoung-Woon On,
Seulye Baeg,
Junrae Cho,
Sunghee Jung,
Jieun Kang,
EungGyun Kim,
Eunhwa Kim,
Byeongil Ko,
Daniel Lee
, et al. (4 additional authors not shown)
Abstract:
We introduce Kanana, a series of bilingual language models that demonstrate exceeding performance in Korean and competitive performance in English. The computational cost of Kanana is significantly lower than that of state-of-the-art models of similar size. The report details the techniques employed during pre-training to achieve compute-efficient yet competitive models, including high quality dat…
▽ More
We introduce Kanana, a series of bilingual language models that demonstrate exceeding performance in Korean and competitive performance in English. The computational cost of Kanana is significantly lower than that of state-of-the-art models of similar size. The report details the techniques employed during pre-training to achieve compute-efficient yet competitive models, including high quality data filtering, staged pre-training, depth up-scaling, and pruning and distillation. Furthermore, the report outlines the methodologies utilized during the post-training of the Kanana models, encompassing supervised fine-tuning and preference optimization, aimed at enhancing their capability for seamless interaction with users. Lastly, the report elaborates on plausible approaches used for language model adaptation to specific scenarios, such as embedding, retrieval augmented generation, and function calling. The Kanana model series spans from 2.1B to 32.5B parameters with 2.1B models (base, instruct, embedding) publicly released to promote research on Korean language models.
△ Less
Submitted 28 February, 2025; v1 submitted 26 February, 2025;
originally announced February 2025.
-
CoME: An Unlearning-based Approach to Conflict-free Model Editing
Authors:
Dahyun Jung,
Jaehyung Seo,
Jaewook Lee,
Chanjun Park,
Heuiseok Lim
Abstract:
Large language models (LLMs) often retain outdated or incorrect information from pre-training, which undermines their reliability. While model editing methods have been developed to address such errors without full re-training, they frequently suffer from knowledge conflicts, where outdated information interferes with new knowledge. In this work, we propose Conflict-free Model Editing (CoME), a no…
▽ More
Large language models (LLMs) often retain outdated or incorrect information from pre-training, which undermines their reliability. While model editing methods have been developed to address such errors without full re-training, they frequently suffer from knowledge conflicts, where outdated information interferes with new knowledge. In this work, we propose Conflict-free Model Editing (CoME), a novel framework that enhances the accuracy of knowledge updates in LLMs by selectively removing outdated knowledge. CoME leverages unlearning to mitigate knowledge interference, allowing new information to be integrated without compromising relevant linguistic features. Through experiments on GPT-J and LLaMA-3 using Counterfact and ZsRE datasets, we demonstrate that CoME improves both editing accuracy and model reliability when applied to existing editing methods. Our results highlight that the targeted removal of outdated knowledge is crucial for enhancing model editing effectiveness and maintaining the model's generative performance.
△ Less
Submitted 19 February, 2025;
originally announced February 2025.
-
IM360: Textured Mesh Reconstruction for Large-scale Indoor Mapping with 360$^\circ$ Cameras
Authors:
Dongki Jung,
Jaehoon Choi,
Yonghan Lee,
Dinesh Manocha
Abstract:
We present a novel 3D reconstruction pipeline for 360$^\circ$ cameras for 3D mapping and rendering of indoor environments. Traditional Structure-from-Motion (SfM) methods may not work well in large-scale indoor scenes due to the prevalence of textureless and repetitive regions. To overcome these challenges, our approach (IM360) leverages the wide field of view of omnidirectional images and integra…
▽ More
We present a novel 3D reconstruction pipeline for 360$^\circ$ cameras for 3D mapping and rendering of indoor environments. Traditional Structure-from-Motion (SfM) methods may not work well in large-scale indoor scenes due to the prevalence of textureless and repetitive regions. To overcome these challenges, our approach (IM360) leverages the wide field of view of omnidirectional images and integrates the spherical camera model into every core component of the SfM pipeline. In order to develop a comprehensive 3D reconstruction solution, we integrate a neural implicit surface reconstruction technique to generate high-quality surfaces from sparse input data. Additionally, we utilize a mesh-based neural rendering approach to refine texture maps and accurately capture view-dependent properties by combining diffuse and specular components. We evaluate our pipeline on large-scale indoor scenes from the Matterport3D and Stanford2D3D datasets. In practice, IM360 demonstrate superior performance in terms of textured mesh reconstruction over SOTA. We observe accuracy improvements in terms of camera localization and registration as well as rendering high frequency details.
△ Less
Submitted 19 February, 2025; v1 submitted 18 February, 2025;
originally announced February 2025.
-
System Message Generation for User Preferences using Open-Source Models
Authors:
Minbyul Jeong,
Jungho Cho,
Minsoo Khang,
Dawoon Jung,
Teakgyu Hong
Abstract:
System messages play a crucial role in interactions with large language models (LLMs), often serving as prompts to initiate conversations. Through system messages, users can assign specific roles, perform intended tasks, incorporate background information, and specify various output formats and communication styles. Despite such versatility, publicly available datasets often lack system messages a…
▽ More
System messages play a crucial role in interactions with large language models (LLMs), often serving as prompts to initiate conversations. Through system messages, users can assign specific roles, perform intended tasks, incorporate background information, and specify various output formats and communication styles. Despite such versatility, publicly available datasets often lack system messages and are subject to strict license constraints in industrial applications. Moreover, manually annotating system messages that align with user instructions is resource-intensive. In light of these challenges, we introduce SysGen, a pipeline for generating system messages that better align assistant responses with user instructions using existing supervised fine-tuning datasets that lack system messages. Training open-source models on SysGen data yields substantial improvements in both single-turn (Multifacet) and multi-turn (SysBench) conversation benchmarks. Notably, our method shows strong gains in shorter conversations, suggesting that it enhances early-stage interaction effectiveness. Our qualitative analysis further emphasizes the value of diverse and structured system messages in improving LLM adaptability across varied user scenarios.
△ Less
Submitted 22 May, 2025; v1 submitted 16 February, 2025;
originally announced February 2025.
-
Securing DRAM at Scale: ARFM-Driven Row Hammer Defense with Unveiling the Threat of Short tRC Patterns
Authors:
Nogeun Joo,
Donghyuk Kim,
Hyunjun Cho,
Junseok Noh,
Dongha Jung,
Joo-Young Kim
Abstract:
To address the issue of powerful row hammer (RH) attacks, our study involved an extensive analysis of the prevalent attack patterns in the field. We discovered a strong correlation between the timing and density of the active-to-active command period, ${tRC}$, and the likelihood of RH attacks. In this paper, we introduce MARC, an innovative ARFM-driven RH mitigation IP that significantly reinforce…
▽ More
To address the issue of powerful row hammer (RH) attacks, our study involved an extensive analysis of the prevalent attack patterns in the field. We discovered a strong correlation between the timing and density of the active-to-active command period, ${tRC}$, and the likelihood of RH attacks. In this paper, we introduce MARC, an innovative ARFM-driven RH mitigation IP that significantly reinforces existing RH mitigation IPs. MARC dynamically adjusts the frequency of RFM in response to the severity of the RH attack environment, offering a tailored security solution that not only detects the threats but also adapts to varying threat levels. MARC's detection mechanism has demonstrated remarkable efficiency, identifying over 99\% of attack patterns. Moreover, MARC is designed as a compact hardware module, facilitating tight integration either on the memory controller-side or DRAM-side within the memory system. It only occupies a negligible hardware area of 3363~\textit{$μm^2$}. By activating ARFM based on MARC's detection, the additional energy overhead is also negligible in normal workloads. We conduct experiments to compare the highest row count throughout the patterns, defined as max exposure, between the vanilla RH mitigation IPs and the MARC-enhanced versions of the same IPs, focusing on both DRAM-side and memory controller-side. On the DRAM-side, MARC + probabilistic scheme and MARC + counter-based tracking scheme achieve 8.1$\times$ and 1.5$\times$ improvement in max exposure ratio compared to the vanilla IPs, respectively. On the memory controller-side, the MARC + PARA and MARC + Graphene achieve 50$\times$ and 5.7$\times$ improvement in max exposure ratio compared to the vanilla IPs, respectively. MARC ensures optimal security without sacrificing system performance, making MARC a pioneering solution in the realm of RH attack mitigation.
△ Less
Submitted 24 January, 2025;
originally announced January 2025.
-
MEDFORM: A Foundation Model for Contrastive Learning of CT Imaging and Clinical Numeric Data in Multi-Cancer Analysis
Authors:
Daeun Jung,
Jaehyeok Jang,
Sooyoung Jang,
Yu Rang Park
Abstract:
Computed tomography (CT) and clinical numeric data are essential modalities for cancer evaluation, but building large-scale multimodal training datasets for developing medical foundation models remains challenging due to the structural complexity of multi-slice CT data and high cost of expert annotation. In this study, we propose MEDFORM, a multimodal pre-training strategy that guides CT image rep…
▽ More
Computed tomography (CT) and clinical numeric data are essential modalities for cancer evaluation, but building large-scale multimodal training datasets for developing medical foundation models remains challenging due to the structural complexity of multi-slice CT data and high cost of expert annotation. In this study, we propose MEDFORM, a multimodal pre-training strategy that guides CT image representation learning using complementary information from clinical data for medical foundation model development. MEDFORM efficiently processes CT slice through multiple instance learning (MIL) and adopts a dual pre-training strategy: first pretraining the CT slice feature extractor using SimCLR-based self-supervised learning, then aligning CT and clinical modalities through cross-modal contrastive learning. Our model was pre-trained on three different cancer types: lung cancer (141,171 slices), breast cancer (8,100 slices), colorectal cancer (10,393 slices). The experimental results demonstrated that this dual pre-training strategy improves cancer classification performance and maintains robust performance in few-shot learning scenarios. Code available at https://github.com/DigitalHealthcareLab/25MultiModalFoundationModel.git
△ Less
Submitted 22 January, 2025;
originally announced January 2025.
-
Know "No'' Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP
Authors:
Junsung Park,
Jungbeom Lee,
Jongyoon Song,
Sangwon Yu,
Dahuin Jung,
Sungroh Yoon
Abstract:
While CLIP has significantly advanced multimodal understanding by bridging vision and language, the inability to grasp negation - such as failing to differentiate concepts like "parking" from "no parking" - poses substantial challenges. By analyzing the data used in the public CLIP model's pre-training, we posit this limitation stems from a lack of negation-inclusive data. To address this, we intr…
▽ More
While CLIP has significantly advanced multimodal understanding by bridging vision and language, the inability to grasp negation - such as failing to differentiate concepts like "parking" from "no parking" - poses substantial challenges. By analyzing the data used in the public CLIP model's pre-training, we posit this limitation stems from a lack of negation-inclusive data. To address this, we introduce data generation pipelines that employ a large language model (LLM) and a multimodal LLM to produce negation-inclusive captions. Fine-tuning CLIP with data generated from our pipelines, we develop NegationCLIP, which enhances negation awareness while preserving the generality. Moreover, to enable a comprehensive evaluation of negation understanding, we propose NegRefCOCOg-a benchmark tailored to test VLMs' ability to interpret negation across diverse expressions and positions within a sentence. Experiments on various CLIP architectures validate the effectiveness of our data generation pipelines in enhancing CLIP's ability to perceive negation accurately. Additionally, NegationCLIP's enhanced negation awareness has practical applications across various multimodal tasks, demonstrated by performance gains in text-to-image generation and referring image segmentation.
△ Less
Submitted 31 March, 2025; v1 submitted 18 January, 2025;
originally announced January 2025.
-
GOTPR: General Outdoor Text-based Place Recognition Using Scene Graph Retrieval with OpenStreetMap
Authors:
Donghwi Jung,
Keonwoo Kim,
Seong-Woo Kim
Abstract:
We propose GOTPR, a robust place recognition method designed for outdoor environments where GPS signals are unavailable. Unlike existing approaches that use point cloud maps, which are large and difficult to store, GOTPR leverages scene graphs generated from text descriptions and maps for place recognition. This method improves scalability by replacing point clouds with compact data structures, al…
▽ More
We propose GOTPR, a robust place recognition method designed for outdoor environments where GPS signals are unavailable. Unlike existing approaches that use point cloud maps, which are large and difficult to store, GOTPR leverages scene graphs generated from text descriptions and maps for place recognition. This method improves scalability by replacing point clouds with compact data structures, allowing robots to efficiently store and utilize extensive map data. In addition, GOTPR eliminates the need for custom map creation by using publicly available OpenStreetMap data, which provides global spatial information. We evaluated its performance using the KITTI360Pose dataset with corresponding OpenStreetMap data, comparing it to existing point cloud-based place recognition methods. The results show that GOTPR achieves comparable accuracy while significantly reducing storage requirements. In city-scale tests, it completed processing within a few seconds, making it highly practical for real-world robotics applications. More information can be found at https://donghwijung.github.io/GOTPR_page/.
△ Less
Submitted 22 May, 2025; v1 submitted 14 January, 2025;
originally announced January 2025.
-
AuxDepthNet: Real-Time Monocular 3D Object Detection with Depth-Sensitive Features
Authors:
Ruochen Zhang,
Hyeung-Sik Choi,
Dongwook Jung,
Phan Huy Nam Anh,
Sang-Ki Jeong,
Zihao Zhu
Abstract:
Monocular 3D object detection is a challenging task in autonomous systems due to the lack of explicit depth information in single-view images. Existing methods often depend on external depth estimators or expensive sensors, which increase computational complexity and hinder real-time performance. To overcome these limitations, we propose AuxDepthNet, an efficient framework for real-time monocular…
▽ More
Monocular 3D object detection is a challenging task in autonomous systems due to the lack of explicit depth information in single-view images. Existing methods often depend on external depth estimators or expensive sensors, which increase computational complexity and hinder real-time performance. To overcome these limitations, we propose AuxDepthNet, an efficient framework for real-time monocular 3D object detection that eliminates the reliance on external depth maps or pre-trained depth models. AuxDepthNet introduces two key components: the Auxiliary Depth Feature (ADF) module, which implicitly learns depth-sensitive features to improve spatial reasoning and computational efficiency, and the Depth Position Mapping (DPM) module, which embeds depth positional information directly into the detection process to enable accurate object localization and 3D bounding box regression. Leveraging the DepthFusion Transformer architecture, AuxDepthNet globally integrates visual and depth-sensitive features through depth-guided interactions, ensuring robust and efficient detection. Extensive experiments on the KITTI dataset show that AuxDepthNet achieves state-of-the-art performance, with $\text{AP}_{3D}$ scores of 24.72\% (Easy), 18.63\% (Moderate), and 15.31\% (Hard), and $\text{AP}_{\text{BEV}}$ scores of 34.11\% (Easy), 25.18\% (Moderate), and 21.90\% (Hard) at an IoU threshold of 0.7.
△ Less
Submitted 7 January, 2025;
originally announced January 2025.
-
Modeling and Analysis of Hybrid GEO-LEO Satellite Networks
Authors:
Dong-Hyun Jung,
Hongjae Nam,
Junil Choi,
David J. Love
Abstract:
As the number of low Earth orbit (LEO) satellites rapidly increases, the consideration of frequency sharing or cooperation between geosynchronous Earth orbit (GEO) and LEO satellites is gaining attention. In this paper, we consider a hybrid GEO-LEO satellite network where GEO and LEO satellites are distributed according to independent Poisson point processes (PPPs) and share the same frequency res…
▽ More
As the number of low Earth orbit (LEO) satellites rapidly increases, the consideration of frequency sharing or cooperation between geosynchronous Earth orbit (GEO) and LEO satellites is gaining attention. In this paper, we consider a hybrid GEO-LEO satellite network where GEO and LEO satellites are distributed according to independent Poisson point processes (PPPs) and share the same frequency resources. Based on the properties of PPPs, we first analyze satellite-visible probabilities, distance distributions, and association probabilities. Then, we derive an analytical expression for the network's coverage probability. Through Monte Carlo simulations, we verify the analytical results and demonstrate the impact of system parameters on coverage performance. The analytical results effectively estimate the coverage performance in scenarios where GEO and LEO satellites cooperate or share the same resource.
△ Less
Submitted 18 October, 2024;
originally announced October 2024.
-
Mode-GS: Monocular Depth Guided Anchored 3D Gaussian Splatting for Robust Ground-View Scene Rendering
Authors:
Yonghan Lee,
Jaehoon Choi,
Dongki Jung,
Jaeseong Yun,
Soohyun Ryu,
Dinesh Manocha,
Suyong Yeon
Abstract:
We present a novel-view rendering algorithm, Mode-GS, for ground-robot trajectory datasets. Our approach is based on using anchored Gaussian splats, which are designed to overcome the limitations of existing 3D Gaussian splatting algorithms. Prior neural rendering methods suffer from severe splat drift due to scene complexity and insufficient multi-view observation, and can fail to fix splats on t…
▽ More
We present a novel-view rendering algorithm, Mode-GS, for ground-robot trajectory datasets. Our approach is based on using anchored Gaussian splats, which are designed to overcome the limitations of existing 3D Gaussian splatting algorithms. Prior neural rendering methods suffer from severe splat drift due to scene complexity and insufficient multi-view observation, and can fail to fix splats on the true geometry in ground-robot datasets. Our method integrates pixel-aligned anchors from monocular depths and generates Gaussian splats around these anchors using residual-form Gaussian decoders. To address the inherent scale ambiguity of monocular depth, we parameterize anchors with per-view depth-scales and employ scale-consistent depth loss for online scale calibration. Our method results in improved rendering performance, based on PSNR, SSIM, and LPIPS metrics, in ground scenes with free trajectory patterns, and achieves state-of-the-art rendering performance on the R3LIVE odometry dataset and the Tanks and Temples dataset.
△ Less
Submitted 6 October, 2024;
originally announced October 2024.
-
Efficient Training of Neural Stochastic Differential Equations by Matching Finite Dimensional Distributions
Authors:
Jianxin Zhang,
Josh Viktorov,
Doosan Jung,
Emily Pitler
Abstract:
Neural Stochastic Differential Equations (Neural SDEs) have emerged as powerful mesh-free generative models for continuous stochastic processes, with critical applications in fields such as finance, physics, and biology. Previous state-of-the-art methods have relied on adversarial training, such as GANs, or on minimizing distance measures between processes using signature kernels. However, GANs su…
▽ More
Neural Stochastic Differential Equations (Neural SDEs) have emerged as powerful mesh-free generative models for continuous stochastic processes, with critical applications in fields such as finance, physics, and biology. Previous state-of-the-art methods have relied on adversarial training, such as GANs, or on minimizing distance measures between processes using signature kernels. However, GANs suffer from issues like instability, mode collapse, and the need for specialized training techniques, while signature kernel-based methods require solving linear PDEs and backpropagating gradients through the solver, whose computational complexity scales quadratically with the discretization steps. In this paper, we identify a novel class of strictly proper scoring rules for comparing continuous Markov processes. This theoretical finding naturally leads to a novel approach called Finite Dimensional Matching (FDM) for training Neural SDEs. Our method leverages the Markov property of SDEs to provide a computationally efficient training objective. This scoring rule allows us to bypass the computational overhead associated with signature kernels and reduces the training complexity from $O(D^2)$ to $O(D)$ per epoch, where $D$ represents the number of discretization steps of the process. We demonstrate that FDM achieves superior performance, consistently outperforming existing methods in terms of both computational efficiency and generative quality.
△ Less
Submitted 26 March, 2025; v1 submitted 4 October, 2024;
originally announced October 2024.
-
Textual Training for the Hassle-Free Removal of Unwanted Visual Data: Case Studies on OOD and Hateful Image Detection
Authors:
Saehyung Lee,
Jisoo Mok,
Sangha Park,
Yongho Shin,
Dahuin Jung,
Sungroh Yoon
Abstract:
In our study, we explore methods for detecting unwanted content lurking in visual datasets. We provide a theoretical analysis demonstrating that a model capable of successfully partitioning visual data can be obtained using only textual data. Based on the analysis, we propose Hassle-Free Textual Training (HFTT), a streamlined method capable of acquiring detectors for unwanted visual content, using…
▽ More
In our study, we explore methods for detecting unwanted content lurking in visual datasets. We provide a theoretical analysis demonstrating that a model capable of successfully partitioning visual data can be obtained using only textual data. Based on the analysis, we propose Hassle-Free Textual Training (HFTT), a streamlined method capable of acquiring detectors for unwanted visual content, using only synthetic textual data in conjunction with pre-trained vision-language models. HFTT features an innovative objective function that significantly reduces the necessity for human involvement in data annotation. Furthermore, HFTT employs a clever textual data synthesis method, effectively emulating the integration of unknown visual data distribution into the training process at no extra cost. The unique characteristics of HFTT extend its utility beyond traditional out-of-distribution detection, making it applicable to tasks that address more abstract concepts. We complement our analyses with experiments in out-of-distribution detection and hateful image detection. Our codes are available at https://github.com/Saehyung-Lee/HFTT
△ Less
Submitted 23 October, 2024; v1 submitted 29 September, 2024;
originally announced September 2024.
-
Evaluating the Impact of a Specialized LLM on Physician Experience in Clinical Decision Support: A Comparison of Ask Avo and ChatGPT-4
Authors:
Daniel Jung,
Alex Butler,
Joongheum Park,
Yair Saperstein
Abstract:
The use of Large language models (LLMs) to augment clinical decision support systems is a topic with rapidly growing interest, but current shortcomings such as hallucinations and lack of clear source citations make them unreliable for use in the clinical environment. This study evaluates Ask Avo, an LLM-derived software by AvoMD that incorporates a proprietary Language Model Augmented Retrieval (L…
▽ More
The use of Large language models (LLMs) to augment clinical decision support systems is a topic with rapidly growing interest, but current shortcomings such as hallucinations and lack of clear source citations make them unreliable for use in the clinical environment. This study evaluates Ask Avo, an LLM-derived software by AvoMD that incorporates a proprietary Language Model Augmented Retrieval (LMAR) system, in-built visual citation cues, and prompt engineering designed for interactions with physicians, against ChatGPT-4 in end-user experience for physicians in a simulated clinical scenario environment. Eight clinical questions derived from medical guideline documents in various specialties were prompted to both models by 62 study participants, with each response rated on trustworthiness, actionability, relevancy, comprehensiveness, and friendly format from 1 to 5. Ask Avo significantly outperformed ChatGPT-4 in all criteria: trustworthiness (4.52 vs. 3.34, p<0.001), actionability (4.41 vs. 3.19, p<0.001), relevancy (4.55 vs. 3.49, p<0.001), comprehensiveness (4.50 vs. 3.37, p<0.001), and friendly format (4.52 vs. 3.60, p<0.001). Our findings suggest that specialized LLMs designed with the needs of clinicians in mind can offer substantial improvements in user experience over general-purpose LLMs. Ask Avo's evidence-based approach tailored to clinician needs shows promise in the adoption of LLM-augmented clinical decision support software.
△ Less
Submitted 6 September, 2024;
originally announced September 2024.
-
Point Cloud Structural Similarity-based Underwater Sonar Loop Detection
Authors:
Donghwi Jung,
Andres Pulido,
Jane Shin,
Seong-Woo Kim
Abstract:
In this letter, we propose a point cloud structural similarity-based loop detection method for underwater Simultaneous Localization and Mapping using sonar sensors. Existing sonar-based loop detection approaches often rely on 2D projection and keypoint extraction, which can lead to data loss and poor performance in feature-scarce environments. Additionally, methods based on neural networks or Bag-…
▽ More
In this letter, we propose a point cloud structural similarity-based loop detection method for underwater Simultaneous Localization and Mapping using sonar sensors. Existing sonar-based loop detection approaches often rely on 2D projection and keypoint extraction, which can lead to data loss and poor performance in feature-scarce environments. Additionally, methods based on neural networks or Bag-of-Words require extensive preprocessing, such as model training or vocabulary creation, reducing adaptability to new environments. To address these challenges, our method directly utilizes 3D sonar point clouds without projection and computes point-wise structural feature maps based on geometry, normals, and curvature. By leveraging rotation-invariant similarity comparisons, the proposed approach eliminates the need for keypoint detection and ensures robust loop detection across diverse underwater terrains. We validate our method using two real-world datasets: the Antarctica dataset obtained from deep underwater and the Seaward dataset collected from rivers and lakes. Experimental results show that our method achieves the highest loop detection performance compared to existing keypointbased and learning-based approaches while requiring no additional training or preprocessing. Our code is available at https://github.com/donghwijung/point_cloud_structural_similarity_based_underwater_sonar_loop_detection.
△ Less
Submitted 18 March, 2025; v1 submitted 21 September, 2024;
originally announced September 2024.
-
Familiarity-Aware Evidence Compression for Retrieval-Augmented Generation
Authors:
Dongwon Jung,
Qin Liu,
Tenghao Huang,
Ben Zhou,
Muhao Chen
Abstract:
Retrieval-augmented generation (RAG) improves large language models (LMs) by incorporating non-parametric knowledge through evidence retrieved from external sources. However, it often struggles to cope with inconsistent and irrelevant information that can distract the LM from its tasks, especially when multiple evidence pieces are required. While compressing the retrieved evidence with a compressi…
▽ More
Retrieval-augmented generation (RAG) improves large language models (LMs) by incorporating non-parametric knowledge through evidence retrieved from external sources. However, it often struggles to cope with inconsistent and irrelevant information that can distract the LM from its tasks, especially when multiple evidence pieces are required. While compressing the retrieved evidence with a compression model aims to address this issue, the compressed evidence may still be unfamiliar to the target model used for downstream tasks, potentially failing to utilize the evidence effectively. We propose FaviComp (Familarity-Aware Evidence Compression), a novel training-free evidence compression technique that makes retrieved evidence more familiar to the target model, while seamlessly integrating parametric knowledge from the model. Experimental results show that FaviComp consistently outperforms most recent evidence compression baselines across multiple open-domain QA datasets, improving accuracy by up to 28.1% while achieving high compression rates. Additionally, we demonstrate the effective integration of both parametric and non-parametric knowledge during evidence compression.
△ Less
Submitted 16 December, 2024; v1 submitted 19 September, 2024;
originally announced September 2024.
-
E2Map: Experience-and-Emotion Map for Self-Reflective Robot Navigation with Language Models
Authors:
Chan Kim,
Keonwoo Kim,
Mintaek Oh,
Hanbi Baek,
Jiyang Lee,
Donghwi Jung,
Soojin Woo,
Younkyung Woo,
John Tucker,
Roya Firoozi,
Seung-Woo Seo,
Mac Schwager,
Seong-Woo Kim
Abstract:
Large language models (LLMs) have shown significant potential in guiding embodied agents to execute language instructions across a range of tasks, including robotic manipulation and navigation. However, existing methods are primarily designed for static environments and do not leverage the agent's own experiences to refine its initial plans. Given that real-world environments are inherently stocha…
▽ More
Large language models (LLMs) have shown significant potential in guiding embodied agents to execute language instructions across a range of tasks, including robotic manipulation and navigation. However, existing methods are primarily designed for static environments and do not leverage the agent's own experiences to refine its initial plans. Given that real-world environments are inherently stochastic, initial plans based solely on LLMs' general knowledge may fail to achieve their objectives, unlike in static scenarios. To address this limitation, this study introduces the Experience-and-Emotion Map (E2Map), which integrates not only LLM knowledge but also the agent's real-world experiences, drawing inspiration from human emotional responses. The proposed methodology enables one-shot behavior adjustments by updating the E2Map based on the agent's experiences. Our evaluation in stochastic navigation environments, including both simulations and real-world scenarios, demonstrates that the proposed method significantly enhances performance in stochastic environments compared to existing LLM-based approaches. Code and supplementary materials are available at https://e2map.github.io/.
△ Less
Submitted 2 February, 2025; v1 submitted 16 September, 2024;
originally announced September 2024.
-
UV-Plane Beam Mapping for Non-Terrestrial Networks in 3GPP System-Level Simulations
Authors:
Dong-Hyun Jung,
Sucheol Kim,
Miyeon Lee,
Joon-Gyu Ryu,
Junil Choi
Abstract:
Due to the high altitudes and large beam sizes of satellites, the curvature of the Earth's surface can impact system-level performance. To consider this, 3GPP introduces the UV-plane beam mapping for system-level simulations of non-terrestrial networks (NTNs). This paper aims to provide a comprehensive understanding of how beams and user equipments (UEs) are placed on the UV-plane and subsequently…
▽ More
Due to the high altitudes and large beam sizes of satellites, the curvature of the Earth's surface can impact system-level performance. To consider this, 3GPP introduces the UV-plane beam mapping for system-level simulations of non-terrestrial networks (NTNs). This paper aims to provide a comprehensive understanding of how beams and user equipments (UEs) are placed on the UV-plane and subsequently mapped to the Earth's surface. We present a general process of projecting UEs on the UV-plane onto the Earth's surface. This process could offer a useful guideline for beam and UE deployment when evaluating the system-level performance of NTNs.
△ Less
Submitted 15 August, 2024;
originally announced August 2024.
-
Rate-Splitting for Joint Unicast and Multicast Transmission in LEO Satellite Networks with Non-Uniform Traffic Demand
Authors:
Jaehyup Seong,
Juha Park,
Dong-Hyun Jung,
Jeonghun Park,
Wonjae Shin
Abstract:
Low Earth orbit (LEO) satellite communications (SATCOM) with ubiquitous global connectivity is deemed a pivotal catalyst in advancing wireless communication systems for 5G and beyond. LEO SATCOM excels in delivering versatile information services across expansive areas, facilitating both unicast and multicast transmissions via high-speed broadband capability. Nonetheless, given the broadband cover…
▽ More
Low Earth orbit (LEO) satellite communications (SATCOM) with ubiquitous global connectivity is deemed a pivotal catalyst in advancing wireless communication systems for 5G and beyond. LEO SATCOM excels in delivering versatile information services across expansive areas, facilitating both unicast and multicast transmissions via high-speed broadband capability. Nonetheless, given the broadband coverage of LEO SATCOM, traffic demand distribution within the service area is non-uniform, and the time/frequency/power resources available at LEO satellites remain significantly limited. Motivated by these challenges, we propose a rate-matching framework for non-orthogonal unicast and multicast (NOUM) transmission. Our approach aims to minimize the difference between offered rates and traffic demands for both unicast and multicast messages. By multiplexing unicast and multicast transmissions over the same radio resource, rate-splitting multiple access (RSMA) is employed to manage interference between unicast and multicast streams, as well as inter-user interference under imperfect channel state information at the LEO satellite. To address the formulated problems non-smoothness and non-convexity, the common rate is approximated using the LogSumExp technique. Thereafter, we represent the common rate portion as the ratio of the approximated function, converting the problem into an unconstrained form. A generalized power iteration (GPI)-based algorithm, coined GPI-RS-NOUM, is proposed upon this reformulation. Through comprehensive numerical analysis across diverse simulation setups, we demonstrate that the proposed framework outperforms various benchmarks for LEO SATCOM with uneven traffic demands.
△ Less
Submitted 5 August, 2024;
originally announced August 2024.
-
Normality Addition via Normality Detection in Industrial Image Anomaly Detection Models
Authors:
Jihun Yi,
Dahuin Jung,
Sungroh Yoon
Abstract:
The task of image anomaly detection (IAD) aims to identify deviations from normality in image data. These anomalies are patterns that deviate significantly from what the IAD model has learned from the data during training. However, in real-world scenarios, the criteria for what constitutes normality often change, necessitating the reclassification of previously anomalous instances as normal. To ad…
▽ More
The task of image anomaly detection (IAD) aims to identify deviations from normality in image data. These anomalies are patterns that deviate significantly from what the IAD model has learned from the data during training. However, in real-world scenarios, the criteria for what constitutes normality often change, necessitating the reclassification of previously anomalous instances as normal. To address this challenge, we propose a new scenario termed "normality addition," involving the post-training adjustment of decision boundaries to incorporate new normalities. To address this challenge, we propose a method called Normality Addition via Normality Detection (NAND), leveraging a vision-language model. NAND performs normality detection which detect patterns related to the intended normality within images based on textual descriptions. We then modify the results of a pre-trained IAD model to implement this normality addition. Using the benchmark dataset in IAD, MVTec AD, we establish an evaluation protocol for the normality addition task and empirically demonstrate the effectiveness of the NAND method.
△ Less
Submitted 29 July, 2024;
originally announced July 2024.
-
Cactus: Towards Psychological Counseling Conversations using Cognitive Behavioral Theory
Authors:
Suyeon Lee,
Sunghwan Kim,
Minju Kim,
Dongjin Kang,
Dongil Yang,
Harim Kim,
Minseok Kang,
Dayi Jung,
Min Hee Kim,
Seungbeen Lee,
Kyoung-Mee Chung,
Youngjae Yu,
Dongha Lee,
Jinyoung Yeo
Abstract:
Recently, the demand for psychological counseling has significantly increased as more individuals express concerns about their mental health. This surge has accelerated efforts to improve the accessibility of counseling by using large language models (LLMs) as counselors. To ensure client privacy, training open-source LLMs faces a key challenge: the absence of realistic counseling datasets. To add…
▽ More
Recently, the demand for psychological counseling has significantly increased as more individuals express concerns about their mental health. This surge has accelerated efforts to improve the accessibility of counseling by using large language models (LLMs) as counselors. To ensure client privacy, training open-source LLMs faces a key challenge: the absence of realistic counseling datasets. To address this, we introduce Cactus, a multi-turn dialogue dataset that emulates real-life interactions using the goal-oriented and structured approach of Cognitive Behavioral Therapy (CBT). We create a diverse and realistic dataset by designing clients with varied, specific personas, and having counselors systematically apply CBT techniques in their interactions. To assess the quality of our data, we benchmark against established psychological criteria used to evaluate real counseling sessions, ensuring alignment with expert evaluations. Experimental results demonstrate that Camel, a model trained with Cactus, outperforms other models in counseling skills, highlighting its effectiveness and potential as a counseling agent. We make our data, model, and code publicly available.
△ Less
Submitted 6 October, 2024; v1 submitted 3 July, 2024;
originally announced July 2024.
-
No More Potentially Dynamic Objects: Static Point Cloud Map Generation based on 3D Object Detection and Ground Projection
Authors:
Soojin Woo,
Donghwi Jung,
Seong-Woo Kim
Abstract:
In this paper, we propose an algorithm to generate a static point cloud map based on LiDAR point cloud data. Our proposed pipeline detects dynamic objects using 3D object detectors and projects points of dynamic objects onto the ground. Typically, point cloud data acquired in real-time serves as a snapshot of the surrounding areas containing both static objects and dynamic objects. The static obje…
▽ More
In this paper, we propose an algorithm to generate a static point cloud map based on LiDAR point cloud data. Our proposed pipeline detects dynamic objects using 3D object detectors and projects points of dynamic objects onto the ground. Typically, point cloud data acquired in real-time serves as a snapshot of the surrounding areas containing both static objects and dynamic objects. The static objects include buildings and trees, otherwise, the dynamic objects contain objects such as parked cars that change their position over time. Removing dynamic objects from the point cloud map is crucial as they can degrade the quality and localization accuracy of the map. To address this issue, in this paper, we propose an algorithm that creates a map only consisting of static objects. We apply a 3D object detection algorithm to the point cloud data which are obtained from LiDAR to implement our pipeline. We then stack the points to create the map after performing ground segmentation and projection. As a result, not only we can eliminate currently dynamic objects at the time of map generation but also potentially dynamic objects such as parked vehicles. We validate the performance of our method using two kinds of datasets collected on real roads: KITTI and our dataset. The result demonstrates the capability of our proposal to create an accurate static map excluding dynamic objects from input point clouds. Also, we verified the improved performance of localization using a generated map based on our method.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
3D Operation of Autonomous Excavator based on Reinforcement Learning through Independent Reward for Individual Joints
Authors:
Yoonkyu Yoo,
Donghwi Jung,
Seong-Woo Kim
Abstract:
In this paper, we propose a control algorithm based on reinforcement learning, employing independent rewards for each joint to control excavators in a 3D space. The aim of this research is to address the challenges associated with achieving precise control of excavators, which are extensively utilized in construction sites but prove challenging to control with precision due to their hydraulic stru…
▽ More
In this paper, we propose a control algorithm based on reinforcement learning, employing independent rewards for each joint to control excavators in a 3D space. The aim of this research is to address the challenges associated with achieving precise control of excavators, which are extensively utilized in construction sites but prove challenging to control with precision due to their hydraulic structures. Traditional methods relied on operator expertise for precise excavator operation, occasionally resulting in safety accidents. Therefore, there have been endeavors to attain precise excavator control through equation-based control algorithms. However, these methods had the limitation of necessitating prior information related to physical values of the excavator, rendering them unsuitable for the diverse range of excavators used in the field. To overcome these limitations, we have explored reinforcement learning-based control methods that do not demand prior knowledge of specific equipment but instead utilize data to train models. Nevertheless, existing reinforcement learning-based methods overlooked cabin swing rotation and confined the bucket's workspace to a 2D plane. Control confined within such a limited area diminishes the applicability of the algorithm in construction sites. We address this issue by expanding the previous 2D plane workspace of the bucket operation into a 3D space, incorporating cabin swing rotation. By expanding the workspace into 3D, excavators can execute continuous operations without requiring human intervention. To accomplish this objective, distinct targets were established for each joint, facilitating the training of action values for each joint independently, regardless of the progress of other joint learning.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
Burst Image Super-Resolution with Base Frame Selection
Authors:
Sanghyun Kim,
Min Jung Lee,
Woohyeok Kim,
Deunsol Jung,
Jaesung Rim,
Sunghyun Cho,
Minsu Cho
Abstract:
Burst image super-resolution has been a topic of active research in recent years due to its ability to obtain a high-resolution image by using complementary information between multiple frames in the burst. In this work, we explore using burst shots with non-uniform exposures to confront real-world practical scenarios by introducing a new benchmark dataset, dubbed Non-uniformly Exposed Burst Image…
▽ More
Burst image super-resolution has been a topic of active research in recent years due to its ability to obtain a high-resolution image by using complementary information between multiple frames in the burst. In this work, we explore using burst shots with non-uniform exposures to confront real-world practical scenarios by introducing a new benchmark dataset, dubbed Non-uniformly Exposed Burst Image (NEBI), that includes the burst frames at varying exposure times to obtain a broader range of irradiance and motion characteristics within a scene. As burst shots with non-uniform exposures exhibit varying levels of degradation, fusing information of the burst shots into the first frame as a base frame may not result in optimal image quality. To address this limitation, we propose a Frame Selection Network (FSN) for non-uniform scenarios. This network seamlessly integrates into existing super-resolution methods in a plug-and-play manner with low computational costs. The comparative analysis reveals the effectiveness of the nonuniform setting for the practical scenario and our FSN on synthetic-/real- NEBI datasets.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Disentangled Motion Modeling for Video Frame Interpolation
Authors:
Jaihyun Lew,
Jooyoung Choi,
Chaehun Shin,
Dahuin Jung,
Sungroh Yoon
Abstract:
Video Frame Interpolation (VFI) aims to synthesize intermediate frames between existing frames to enhance visual smoothness and quality. Beyond the conventional methods based on the reconstruction loss, recent works have employed generative models for improved perceptual quality. However, they require complex training and large computational costs for pixel space modeling. In this paper, we introd…
▽ More
Video Frame Interpolation (VFI) aims to synthesize intermediate frames between existing frames to enhance visual smoothness and quality. Beyond the conventional methods based on the reconstruction loss, recent works have employed generative models for improved perceptual quality. However, they require complex training and large computational costs for pixel space modeling. In this paper, we introduce disentangled Motion Modeling (MoMo), a diffusion-based approach for VFI that enhances visual quality by focusing on intermediate motion modeling. We propose a disentangled two-stage training process. In the initial stage, frame synthesis and flow models are trained to generate accurate frames and flows optimal for synthesis. In the subsequent stage, we introduce a motion diffusion model, which incorporates our novel U-Net architecture specifically designed for optical flow, to generate bi-directional flows between frames. By learning the simpler low-frequency representation of motions, MoMo achieves superior perceptual quality with reduced computational demands compared to the generative modeling methods on the pixel space. MoMo surpasses state-of-the-art methods in perceptual metrics across various benchmarks, demonstrating its efficacy and efficiency in VFI.
△ Less
Submitted 18 December, 2024; v1 submitted 24 June, 2024;
originally announced June 2024.
-
Joint Reconstruction of 3D Human and Object via Contact-Based Refinement Transformer
Authors:
Hyeongjin Nam,
Daniel Sungho Jung,
Gyeongsik Moon,
Kyoung Mu Lee
Abstract:
Human-object contact serves as a strong cue to understand how humans physically interact with objects. Nevertheless, it is not widely explored to utilize human-object contact information for the joint reconstruction of 3D human and object from a single image. In this work, we present a novel joint 3D human-object reconstruction method (CONTHO) that effectively exploits contact information between…
▽ More
Human-object contact serves as a strong cue to understand how humans physically interact with objects. Nevertheless, it is not widely explored to utilize human-object contact information for the joint reconstruction of 3D human and object from a single image. In this work, we present a novel joint 3D human-object reconstruction method (CONTHO) that effectively exploits contact information between humans and objects. There are two core designs in our system: 1) 3D-guided contact estimation and 2) contact-based 3D human and object refinement. First, for accurate human-object contact estimation, CONTHO initially reconstructs 3D humans and objects and utilizes them as explicit 3D guidance for contact estimation. Second, to refine the initial reconstructions of 3D human and object, we propose a novel contact-based refinement Transformer that effectively aggregates human features and object features based on the estimated human-object contact. The proposed contact-based refinement prevents the learning of erroneous correlation between human and object, which enables accurate 3D reconstruction. As a result, our CONTHO achieves state-of-the-art performance in both human-object contact estimation and joint reconstruction of 3D human and object. The code is publicly available at https://github.com/dqj5182/CONTHO_RELEASE.
△ Less
Submitted 7 April, 2024;
originally announced April 2024.
-
Planning and Editing What You Retrieve for Enhanced Tool Learning
Authors:
Tenghao Huang,
Dongwon Jung,
Muhao Chen
Abstract:
Recent advancements in integrating external tools with Large Language Models (LLMs) have opened new frontiers, with applications in mathematical reasoning, code generators, and smart assistants. However, existing methods, relying on simple one-time retrieval strategies, fall short on effectively and accurately shortlisting relevant tools. This paper introduces a novel PLUTO (Planning, Learning, an…
▽ More
Recent advancements in integrating external tools with Large Language Models (LLMs) have opened new frontiers, with applications in mathematical reasoning, code generators, and smart assistants. However, existing methods, relying on simple one-time retrieval strategies, fall short on effectively and accurately shortlisting relevant tools. This paper introduces a novel PLUTO (Planning, Learning, and Understanding for TOols) approach, encompassing `Plan-and-Retrieve (P&R)` and `Edit-and-Ground (E&G)` paradigms. The P&R paradigm consists of a neural retrieval module for shortlisting relevant tools and an LLM-based query planner that decomposes complex queries into actionable tasks, enhancing the effectiveness of tool utilization. The E&G paradigm utilizes LLMs to enrich tool descriptions based on user scenarios, bridging the gap between user queries and tool functionalities. Experiment results demonstrate that these paradigms significantly improve the recall and NDCG in tool retrieval tasks, significantly surpassing current state-of-the-art models.
△ Less
Submitted 4 April, 2024; v1 submitted 30 March, 2024;
originally announced April 2024.
-
Efficient Diffusion-Driven Corruption Editor for Test-Time Adaptation
Authors:
Yeongtak Oh,
Jonghyun Lee,
Jooyoung Choi,
Dahuin Jung,
Uiwon Hwang,
Sungroh Yoon
Abstract:
Test-time adaptation (TTA) addresses the unforeseen distribution shifts occurring during test time. In TTA, performance, memory consumption, and time consumption are crucial considerations. A recent diffusion-based TTA approach for restoring corrupted images involves image-level updates. However, using pixel space diffusion significantly increases resource requirements compared to conventional mod…
▽ More
Test-time adaptation (TTA) addresses the unforeseen distribution shifts occurring during test time. In TTA, performance, memory consumption, and time consumption are crucial considerations. A recent diffusion-based TTA approach for restoring corrupted images involves image-level updates. However, using pixel space diffusion significantly increases resource requirements compared to conventional model updating TTA approaches, revealing limitations as a TTA method. To address this, we propose a novel TTA method that leverages an image editing model based on a latent diffusion model (LDM) and fine-tunes it using our newly introduced corruption modeling scheme. This scheme enhances the robustness of the diffusion model against distribution shifts by creating (clean, corrupted) image pairs and fine-tuning the model to edit corrupted images into clean ones. Moreover, we introduce a distilled variant to accelerate the model for corruption editing using only 4 network function evaluations (NFEs). We extensively validated our method across various architectures and datasets including image and video domains. Our model achieves the best performance with a 100 times faster runtime than that of a diffusion-based baseline. Furthermore, it is three times faster than the previous model updating TTA method that utilizes data augmentation, making an image-level updating approach more feasible.
△ Less
Submitted 11 July, 2024; v1 submitted 16 March, 2024;
originally announced March 2024.
-
SemanticDraw: Towards Real-Time Interactive Content Creation from Image Diffusion Models
Authors:
Jaerin Lee,
Daniel Sungho Jung,
Kanggeon Lee,
Kyoung Mu Lee
Abstract:
We introduce SemanticDraw, a new paradigm of interactive content creation where high-quality images are generated in near real-time from given multiple hand-drawn regions, each encoding prescribed semantic meaning. In order to maximize the productivity of content creators and to fully realize their artistic imagination, it requires both quick interactive interfaces and fine-grained regional contro…
▽ More
We introduce SemanticDraw, a new paradigm of interactive content creation where high-quality images are generated in near real-time from given multiple hand-drawn regions, each encoding prescribed semantic meaning. In order to maximize the productivity of content creators and to fully realize their artistic imagination, it requires both quick interactive interfaces and fine-grained regional controls in their tools. Despite astonishing generation quality from recent diffusion models, we find that existing approaches for regional controllability are very slow (52 seconds for $512 \times 512$ image) while not compatible with acceleration methods such as LCM, blocking their huge potential in interactive content creation. From this observation, we build our solution for interactive content creation in two steps: (1) we establish compatibility between region-based controls and acceleration techniques for diffusion models, maintaining high fidelity of multi-prompt image generation with $\times 10$ reduced number of inference steps, (2) we increase the generation throughput with our new multi-prompt stream batch pipeline, enabling low-latency generation from multiple, region-based text prompts on a single RTX 2080 Ti GPU. Our proposed framework is generalizable to any existing diffusion models and acceleration schedulers, allowing sub-second (0.64 seconds) image content creation application upon well-established image diffusion models. Our project page is: https://jaerinlee.com/research/semantic-draw
△ Less
Submitted 1 June, 2025; v1 submitted 13 March, 2024;
originally announced March 2024.
-
Entropy is not Enough for Test-Time Adaptation: From the Perspective of Disentangled Factors
Authors:
Jonghyun Lee,
Dahuin Jung,
Saehyung Lee,
Junsung Park,
Juhyeon Shin,
Uiwon Hwang,
Sungroh Yoon
Abstract:
Test-time adaptation (TTA) fine-tunes pre-trained deep neural networks for unseen test data. The primary challenge of TTA is limited access to the entire test dataset during online updates, causing error accumulation. To mitigate it, TTA methods have utilized the model output's entropy as a confidence metric that aims to determine which samples have a lower likelihood of causing error. Through exp…
▽ More
Test-time adaptation (TTA) fine-tunes pre-trained deep neural networks for unseen test data. The primary challenge of TTA is limited access to the entire test dataset during online updates, causing error accumulation. To mitigate it, TTA methods have utilized the model output's entropy as a confidence metric that aims to determine which samples have a lower likelihood of causing error. Through experimental studies, however, we observed the unreliability of entropy as a confidence metric for TTA under biased scenarios and theoretically revealed that it stems from the neglect of the influence of latent disentangled factors of data on predictions. Building upon these findings, we introduce a novel TTA method named Destroy Your Object (DeYO), which leverages a newly proposed confidence metric named Pseudo-Label Probability Difference (PLPD). PLPD quantifies the influence of the shape of an object on prediction by measuring the difference between predictions before and after applying an object-destructive transformation. DeYO consists of sample selection and sample weighting, which employ entropy and PLPD concurrently. For robust adaptation, DeYO prioritizes samples that dominantly incorporate shape information when making predictions. Our extensive experiments demonstrate the consistent superiority of DeYO over baseline methods across various scenarios, including biased and wild. Project page is publicly available at https://whitesnowdrop.github.io/DeYO/.
△ Less
Submitted 12 March, 2024;
originally announced March 2024.
-
MuNES: Multifloor Navigation Including Elevators and Stairs
Authors:
Donghwi Jung,
Chan Kim,
Jae-Kyung Cho,
Seong-Woo Kim
Abstract:
We propose a scheme called MuNES for single mapping and trajectory planning including elevators and stairs. Optimized multifloor trajectories are important for optimal interfloor movements of robots. However, given two or more options of moving between floors, it is difficult to select the best trajectory because there are no suitable indoor multifloor maps in the existing methods. To solve this p…
▽ More
We propose a scheme called MuNES for single mapping and trajectory planning including elevators and stairs. Optimized multifloor trajectories are important for optimal interfloor movements of robots. However, given two or more options of moving between floors, it is difficult to select the best trajectory because there are no suitable indoor multifloor maps in the existing methods. To solve this problem, MuNES creates a single multifloor map including elevators and stairs by estimating altitude changes based on pressure data. In addition, the proposed method performs floor-based loop detection for faster and more accurate loop closure. The single multifloor map is then voxelized leaving only the parts needed for trajectory planning. An optimal and realistic multifloor trajectory is generated by exploring the voxels using an A* algorithm based on the proposed cost function, which affects realistic factors. We tested this algorithm using data acquired from around a campus and note that a single accurate multifloor map could be created. Furthermore, optimal and realistic multifloor trajectory could be found by selecting the means of motion between floors between elevators and stairs according to factors such as the starting point, ending point, and elevator waiting time. The code and data used in this work are available at https://github.com/donghwijung/MuNES.
△ Less
Submitted 6 February, 2024;
originally announced February 2024.
-
Alternative Speech: Complementary Method to Counter-Narrative for Better Discourse
Authors:
Seungyoon Lee,
Dahyun Jung,
Chanjun Park,
Seolhwa Lee,
Heuiseok Lim
Abstract:
We introduce the concept of "Alternative Speech" as a new way to directly combat hate speech and complement the limitations of counter-narrative. An alternative speech provides practical alternatives to hate speech in real-world scenarios by offering speech-level corrections to speakers while considering the surrounding context and promoting speakers to reform. Further, an alternative speech can c…
▽ More
We introduce the concept of "Alternative Speech" as a new way to directly combat hate speech and complement the limitations of counter-narrative. An alternative speech provides practical alternatives to hate speech in real-world scenarios by offering speech-level corrections to speakers while considering the surrounding context and promoting speakers to reform. Further, an alternative speech can combat hate speech alongside counter-narratives, offering a useful tool to address social issues such as racial discrimination and gender inequality. We propose the new concept and provide detailed guidelines for constructing the necessary dataset. Through discussion, we demonstrate that combining alternative speech and counter-narrative can be a more effective strategy for combating hate speech by complementing specificity and guiding capacity of counter-narrative. This paper presents another perspective for dealing with hate speech, offering viable remedies to complement the constraints of current approaches to mitigating harmful bias.
△ Less
Submitted 25 January, 2024;
originally announced January 2024.
-
RHOBIN Challenge: Reconstruction of Human Object Interaction
Authors:
Xianghui Xie,
Xi Wang,
Nikos Athanasiou,
Bharat Lal Bhatnagar,
Chun-Hao P. Huang,
Kaichun Mo,
Hao Chen,
Xia Jia,
Zerui Zhang,
Liangxian Cui,
Xiao Lin,
Bingqiao Qian,
Jie Xiao,
Wenfei Yang,
Hyeongjin Nam,
Daniel Sungho Jung,
Kihoon Kim,
Kyoung Mu Lee,
Otmar Hilliges,
Gerard Pons-Moll
Abstract:
Modeling the interaction between humans and objects has been an emerging research direction in recent years. Capturing human-object interaction is however a very challenging task due to heavy occlusion and complex dynamics, which requires understanding not only 3D human pose, and object pose but also the interaction between them. Reconstruction of 3D humans and objects has been two separate resear…
▽ More
Modeling the interaction between humans and objects has been an emerging research direction in recent years. Capturing human-object interaction is however a very challenging task due to heavy occlusion and complex dynamics, which requires understanding not only 3D human pose, and object pose but also the interaction between them. Reconstruction of 3D humans and objects has been two separate research fields in computer vision for a long time. We hence proposed the first RHOBIN challenge: reconstruction of human-object interactions in conjunction with the RHOBIN workshop. It was aimed at bringing the research communities of human and object reconstruction as well as interaction modeling together to discuss techniques and exchange ideas. Our challenge consists of three tracks of 3D reconstruction from monocular RGB images with a focus on dealing with challenging interaction scenarios. Our challenge attracted more than 100 participants with more than 300 submissions, indicating the broad interest in the research communities. This paper describes the settings of our challenge and discusses the winning methods of each track in more detail. We observe that the human reconstruction task is becoming mature even under heavy occlusion settings while object pose estimation and joint reconstruction remain challenging tasks. With the growing interest in interaction modeling, we hope this report can provide useful insights and foster future research in this direction. Our workshop website can be found at \href{https://rhobin-challenge.github.io/}{https://rhobin-challenge.github.io/}.
△ Less
Submitted 7 January, 2024;
originally announced January 2024.
-
Modeling and Analysis of GEO Satellite Networks
Authors:
Dong-Hyun Jung,
Hongjae Nam,
Junil Choi,
David J. Love
Abstract:
The extensive coverage offered by satellites makes them effective in enhancing service continuity for users on dynamic airborne and maritime platforms, such as airplanes and ships. In particular, geosynchronous Earth orbit (GEO) satellites ensure stable connectivity for terrestrial users due to their stationary characteristics when observed from Earth. This paper introduces a novel approach to mod…
▽ More
The extensive coverage offered by satellites makes them effective in enhancing service continuity for users on dynamic airborne and maritime platforms, such as airplanes and ships. In particular, geosynchronous Earth orbit (GEO) satellites ensure stable connectivity for terrestrial users due to their stationary characteristics when observed from Earth. This paper introduces a novel approach to model and analyze GEO satellite networks using stochastic geometry. We model the distribution of GEO satellites in the geostationary orbit according to a binomial point process (BPP) and examine satellite visibility depending on the terminal's latitude. Then, we identify potential distribution cases for GEO satellites and derive case probabilities based on the properties of the BPP. We also obtain the distance distributions between the terminal and GEO satellites and derive the coverage probability of the network. We further approximate the derived expressions using the Poisson limit theorem. Monte Carlo simulations are performed to validate the analytical findings, demonstrating a strong alignment between the analyses and simulations. The simplified analytical results can be used to estimate the coverage performance of GEO satellite networks by effectively modeling the positions of GEO satellites.
△ Less
Submitted 26 December, 2023;
originally announced December 2023.
-
Activity Grammars for Temporal Action Segmentation
Authors:
Dayoung Gong,
Joonseok Lee,
Deunsol Jung,
Suha Kwak,
Minsu Cho
Abstract:
Sequence prediction on temporal data requires the ability to understand compositional structures of multi-level semantics beyond individual and contextual properties. The task of temporal action segmentation, which aims at translating an untrimmed activity video into a sequence of action segments, remains challenging for this reason. This paper addresses the problem by introducing an effective act…
▽ More
Sequence prediction on temporal data requires the ability to understand compositional structures of multi-level semantics beyond individual and contextual properties. The task of temporal action segmentation, which aims at translating an untrimmed activity video into a sequence of action segments, remains challenging for this reason. This paper addresses the problem by introducing an effective activity grammar to guide neural predictions for temporal action segmentation. We propose a novel grammar induction algorithm that extracts a powerful context-free grammar from action sequence data. We also develop an efficient generalized parser that transforms frame-level probability distributions into a reliable sequence of actions according to the induced grammar with recursive rules. Our approach can be combined with any neural network for temporal action segmentation to enhance the sequence prediction and discover its compositional structure. Experimental results demonstrate that our method significantly improves temporal action segmentation in terms of both performance and interpretability on two standard benchmarks, Breakfast and 50 Salads.
△ Less
Submitted 7 December, 2023;
originally announced December 2023.
-
Stability-Informed Initialization of Neural Ordinary Differential Equations
Authors:
Theodor Westny,
Arman Mohammadi,
Daniel Jung,
Erik Frisk
Abstract:
This paper addresses the training of Neural Ordinary Differential Equations (neural ODEs), and in particular explores the interplay between numerical integration techniques, stability regions, step size, and initialization techniques. It is shown how the choice of integration technique implicitly regularizes the learned model, and how the solver's corresponding stability region affects training an…
▽ More
This paper addresses the training of Neural Ordinary Differential Equations (neural ODEs), and in particular explores the interplay between numerical integration techniques, stability regions, step size, and initialization techniques. It is shown how the choice of integration technique implicitly regularizes the learned model, and how the solver's corresponding stability region affects training and prediction performance. From this analysis, a stability-informed parameter initialization technique is introduced. The effectiveness of the initialization method is displayed across several learning benchmarks and industrial applications.
△ Less
Submitted 6 August, 2024; v1 submitted 27 November, 2023;
originally announced November 2023.
-
On the Powerfulness of Textual Outlier Exposure for Visual OoD Detection
Authors:
Sangha Park,
Jisoo Mok,
Dahuin Jung,
Saehyung Lee,
Sungroh Yoon
Abstract:
Successful detection of Out-of-Distribution (OoD) data is becoming increasingly important to ensure safe deployment of neural networks. One of the main challenges in OoD detection is that neural networks output overconfident predictions on OoD data, make it difficult to determine OoD-ness of data solely based on their predictions. Outlier exposure addresses this issue by introducing an additional…
▽ More
Successful detection of Out-of-Distribution (OoD) data is becoming increasingly important to ensure safe deployment of neural networks. One of the main challenges in OoD detection is that neural networks output overconfident predictions on OoD data, make it difficult to determine OoD-ness of data solely based on their predictions. Outlier exposure addresses this issue by introducing an additional loss that encourages low-confidence predictions on OoD data during training. While outlier exposure has shown promising potential in improving OoD detection performance, all previous studies on outlier exposure have been limited to utilizing visual outliers. Drawing inspiration from the recent advancements in vision-language pre-training, this paper venture out to the uncharted territory of textual outlier exposure. First, we uncover the benefits of using textual outliers by replacing real or virtual outliers in the image-domain with textual equivalents. Then, we propose various ways of generating preferable textual outliers. Our extensive experiments demonstrate that generated textual outliers achieve competitive performance on large-scale OoD and hard OoD benchmarks. Furthermore, we conduct empirical analyses of textual outliers to provide primary criteria for designing advantageous textual outliers: near-distribution, descriptiveness, and inclusion of visual semantics.
△ Less
Submitted 25 October, 2023;
originally announced October 2023.
-
PUCA: Patch-Unshuffle and Channel Attention for Enhanced Self-Supervised Image Denoising
Authors:
Hyemi Jang,
Junsung Park,
Dahuin Jung,
Jaihyun Lew,
Ho Bae,
Sungroh Yoon
Abstract:
Although supervised image denoising networks have shown remarkable performance on synthesized noisy images, they often fail in practice due to the difference between real and synthesized noise. Since clean-noisy image pairs from the real world are extremely costly to gather, self-supervised learning, which utilizes noisy input itself as a target, has been studied. To prevent a self-supervised deno…
▽ More
Although supervised image denoising networks have shown remarkable performance on synthesized noisy images, they often fail in practice due to the difference between real and synthesized noise. Since clean-noisy image pairs from the real world are extremely costly to gather, self-supervised learning, which utilizes noisy input itself as a target, has been studied. To prevent a self-supervised denoising model from learning identical mapping, each output pixel should not be influenced by its corresponding input pixel; This requirement is known as J-invariance. Blind-spot networks (BSNs) have been a prevalent choice to ensure J-invariance in self-supervised image denoising. However, constructing variations of BSNs by injecting additional operations such as downsampling can expose blinded information, thereby violating J-invariance. Consequently, convolutions designed specifically for BSNs have been allowed only, limiting architectural flexibility. To overcome this limitation, we propose PUCA, a novel J-invariant U-Net architecture, for self-supervised denoising. PUCA leverages patch-unshuffle/shuffle to dramatically expand receptive fields while maintaining J-invariance and dilated attention blocks (DABs) for global context incorporation. Experimental results demonstrate that PUCA achieves state-of-the-art performance, outperforming existing methods in self-supervised image denoising.
△ Less
Submitted 16 October, 2023;
originally announced October 2023.
-
Extract-and-Adaptation Network for 3D Interacting Hand Mesh Recovery
Authors:
JoonKyu Park,
Daniel Sungho Jung,
Gyeongsik Moon,
Kyoung Mu Lee
Abstract:
Understanding how two hands interact with each other is a key component of accurate 3D interacting hand mesh recovery. However, recent Transformer-based methods struggle to learn the interaction between two hands as they directly utilize two hand features as input tokens, which results in distant token problem. The distant token problem represents that input tokens are in heterogeneous spaces, lea…
▽ More
Understanding how two hands interact with each other is a key component of accurate 3D interacting hand mesh recovery. However, recent Transformer-based methods struggle to learn the interaction between two hands as they directly utilize two hand features as input tokens, which results in distant token problem. The distant token problem represents that input tokens are in heterogeneous spaces, leading Transformer to fail in capturing correlation between input tokens. Previous Transformer-based methods suffer from the problem especially when poses of two hands are very different as they project features from a backbone to separate left and right hand-dedicated features. We present EANet, extract-and-adaptation network, with EABlock, the main component of our network. Rather than directly utilizing two hand features as input tokens, our EABlock utilizes two complementary types of novel tokens, SimToken and JoinToken, as input tokens. Our two novel tokens are from a combination of separated two hand features; hence, it is much more robust to the distant token problem. Using the two type of tokens, our EABlock effectively extracts interaction feature and adapts it to each hand. The proposed EANet achieves the state-of-the-art performance on 3D interacting hands benchmarks. The codes are available at https://github.com/jkpark0825/EANet.
△ Less
Submitted 5 September, 2023;
originally announced September 2023.