Search | arXiv e-print repository

Prediction of Frozen Region Growth in Kidney Cryoablation Intervention Using a 3D Flow-Matching Model

Authors: Siyeop Yoon, Yujin Oh, Matthew Tivnan, Sifan Song, Pengfei Jin, Sekeun Kim, Hyun Jin Cho, Dufan Wu, Raul Uppot, Quanzheng Li

Abstract: This study presents a 3D flow-matching model designed to predict the progression of the frozen region (iceball) during kidney cryoablation. Precise intraoperative guidance is critical in cryoablation to ensure complete tumor eradication while preserving adjacent healthy tissue. However, conventional methods, typically based on physics driven or diffusion based simulations, are computationally dema… ▽ More This study presents a 3D flow-matching model designed to predict the progression of the frozen region (iceball) during kidney cryoablation. Precise intraoperative guidance is critical in cryoablation to ensure complete tumor eradication while preserving adjacent healthy tissue. However, conventional methods, typically based on physics driven or diffusion based simulations, are computationally demanding and often struggle to represent complex anatomical structures accurately. To address these limitations, our approach leverages intraoperative CT imaging to inform the model. The proposed 3D flow matching model is trained to learn a continuous deformation field that maps early-stage CT scans to future predictions. This transformation not only estimates the volumetric expansion of the iceball but also generates corresponding segmentation masks, effectively capturing spatial and morphological changes over time. Quantitative analysis highlights the model robustness, demonstrating strong agreement between predictions and ground-truth segmentations. The model achieves an Intersection over Union (IoU) score of 0.61 and a Dice coefficient of 0.75. By integrating real time CT imaging with advanced deep learning techniques, this approach has the potential to enhance intraoperative guidance in kidney cryoablation, improving procedural outcomes and advancing the field of minimally invasive surgery. △ Less

Submitted 11 March, 2025; v1 submitted 6 March, 2025; originally announced March 2025.

Comments: MICCAI 2025 submitted version (author list included)

arXiv:2503.00030 [pdf, other]

Game-Theoretic Regularized Self-Play Alignment of Large Language Models

Authors: Xiaohang Tang, Sangwoong Yoon, Seongho Son, Huizhuo Yuan, Quanquan Gu, Ilija Bogunovic

Abstract: Self-play alignment algorithms have been developed as effective methods for fine-tuning large language models (LLMs), formulating preference optimization as a two-player game. However, the regularization with respect to the reference policy, which is crucial for mitigating over-optimization, has been insufficiently investigated in self-play alignment. In this paper, we show that our regularization… ▽ More Self-play alignment algorithms have been developed as effective methods for fine-tuning large language models (LLMs), formulating preference optimization as a two-player game. However, the regularization with respect to the reference policy, which is crucial for mitigating over-optimization, has been insufficiently investigated in self-play alignment. In this paper, we show that our regularization method can improve the unregularized self-play significantly. To study the impact of different regularizations in self-play alignment, we propose Regularized Self-Play Policy Optimization (RSPO). This generalized framework regularizes the self-play by simply adding a chosen regularization term into the loss while maintaining provable last-iterate convergence to the Nash Equilibrium of the corresponding regularized game. Surprisingly, empirical evaluations using the Mistral-7B-Instruct base model reveal that forward KL divergence regularization reduces response length in RSPO, whereas reverse KL divergence markedly improves raw win rates. RSPO with a linear combination of forward and reverse KL divergence regularization substantially increases the length-controlled win rate in AlpacaEval-2, elevating the unregularized self-play alignment method (SPPO) from $28.53\%$ to $35.44\%$. Finally, we show that RSPO also improves the response diversity. △ Less

Submitted 24 February, 2025; originally announced March 2025.

Comments: Preprint

arXiv:2502.19765 [pdf, ps, other]

EdiText: Controllable Coarse-to-Fine Text Editing with Diffusion Language Models

Authors: Che Hyun Lee, Heeseung Kim, Jiheum Yeom, Sungroh Yoon

Abstract: We propose EdiText, a controllable text editing method that modifies the reference text to desired attributes at various scales. We integrate an SDEdit-based editing technique that allows for broad adjustments in the degree of text editing. Additionally, we introduce a novel fine-level editing method based on self-conditioning, which allows subtle control of reference text. While being capable of… ▽ More We propose EdiText, a controllable text editing method that modifies the reference text to desired attributes at various scales. We integrate an SDEdit-based editing technique that allows for broad adjustments in the degree of text editing. Additionally, we introduce a novel fine-level editing method based on self-conditioning, which allows subtle control of reference text. While being capable of editing on its own, this fine-grained method, integrated with the SDEdit approach, enables EdiText to make precise adjustments within the desired range. EdiText demonstrates its controllability to robustly adjust reference text at a broad range of levels across various tasks, including toxicity control and sentiment control. △ Less

Submitted 2 June, 2025; v1 submitted 27 February, 2025; originally announced February 2025.

Comments: ACL 2025

arXiv:2502.19759 [pdf, other]

Does Your Voice Assistant Remember? Analyzing Conversational Context Recall and Utilization in Voice Interaction Models

Authors: Heeseung Kim, Che Hyun Lee, Sangkwon Park, Jiheum Yeom, Nohil Park, Sangwon Yu, Sungroh Yoon

Abstract: Recent advancements in multi-turn voice interaction models have improved user-model communication. However, while closed-source models effectively retain and recall past utterances, whether open-source models share this ability remains unexplored. To fill this gap, we systematically evaluate how well open-source interaction models utilize past utterances using ContextDialog, a benchmark we propose… ▽ More Recent advancements in multi-turn voice interaction models have improved user-model communication. However, while closed-source models effectively retain and recall past utterances, whether open-source models share this ability remains unexplored. To fill this gap, we systematically evaluate how well open-source interaction models utilize past utterances using ContextDialog, a benchmark we proposed for this purpose. Our findings show that speech-based models have more difficulty than text-based ones, especially when recalling information conveyed in speech, and even with retrieval-augmented generation, models still struggle with questions about past utterances. These insights highlight key limitations in open-source models and suggest ways to improve memory retention and retrieval robustness. △ Less

Submitted 23 May, 2025; v1 submitted 26 February, 2025; originally announced February 2025.

Comments: ACL 2025 Findings, Project Page: https://contextdialog.github.io/

arXiv:2502.19207 [pdf, other]

FaithUn: Toward Faithful Forgetting in Language Models by Investigating the Interconnectedness of Knowledge

Authors: Nakyeong Yang, Minsung Kim, Seunghyun Yoon, Joongbo Shin, Kyomin Jung

Abstract: Various studies have attempted to remove sensitive or private knowledge from a language model to prevent its unauthorized exposure. However, prior studies have overlooked the complex and interconnected nature of knowledge, where related knowledge must be carefully examined. Specifically, they have failed to evaluate whether an unlearning method faithfully erases interconnected knowledge that shoul… ▽ More Various studies have attempted to remove sensitive or private knowledge from a language model to prevent its unauthorized exposure. However, prior studies have overlooked the complex and interconnected nature of knowledge, where related knowledge must be carefully examined. Specifically, they have failed to evaluate whether an unlearning method faithfully erases interconnected knowledge that should be removed, retaining knowledge that appears relevant but exists in a completely different context. To resolve this problem, we first define a new concept called superficial unlearning, which refers to the phenomenon where an unlearning method either fails to erase the interconnected knowledge it should remove or unintentionally erases irrelevant knowledge. Based on the definition, we introduce a new benchmark, FaithUn, to analyze and evaluate the faithfulness of unlearning in real-world knowledge QA settings. Furthermore, we propose a novel unlearning method, KLUE, which updates only knowledge-related neurons to achieve faithful unlearning. KLUE identifies knowledge neurons using an explainability method and updates only those neurons using selected unforgotten samples. Experimental results demonstrate that widely-used unlearning methods fail to ensure faithful unlearning, while our method shows significant effectiveness in real-world QA unlearning. △ Less

Submitted 26 February, 2025; originally announced February 2025.

Comments: 16 pages

arXiv:2502.18881 [pdf, other]

doi 10.1145/3706598.3714206

Letters from Future Self: Augmenting the Letter-Exchange Exercise with LLM-based Agents to Enhance Young Adults' Career Exploration

Authors: Hayeon Jeon, Suhwoo Yoon, Keyeun Lee, Seo Hyeong Kim, Esther Hehsun Kim, Seonghye Cho, Yena Ko, Soeun Yang, Laura Dabbish, John Zimmerman, Eun-mee Kim, Hajin Lim

Abstract: Young adults often encounter challenges in career exploration. Self-guided interventions, such as the letter-exchange exercise, where participants envision and adopt the perspective of their future selves by exchanging letters with their envisioned future selves, can support career development. However, the broader adoption of such interventions may be limited without structured guidance. To addre… ▽ More Young adults often encounter challenges in career exploration. Self-guided interventions, such as the letter-exchange exercise, where participants envision and adopt the perspective of their future selves by exchanging letters with their envisioned future selves, can support career development. However, the broader adoption of such interventions may be limited without structured guidance. To address this, we integrated Large Language Model (LLM)-based agents that simulate participants' future selves into the letter-exchange exercise and evaluated their effectiveness. A one-week experiment (N=36) compared three conditions: (1) participants manually writing replies to themselves from the perspective of their future selves (baseline), (2) future-self agents generating letters to participants, and (3) future-self agents engaging in chat conversations with participants. Results indicated that exchanging letters with future-self agents enhanced participants' engagement during the exercise, while overall benefits of the intervention on future orientation, career self-concept, and psychological support remained comparable across conditions. We discuss design implications for AI-augmented interventions for supporting young adults' career exploration. △ Less

Submitted 5 May, 2025; v1 submitted 26 February, 2025; originally announced February 2025.

Comments: 21 pages, 9 figures, Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (Best Paper Award, Top 1%)

arXiv:2502.13952 [pdf, other]

Characterization of a TES-based Anti-Coincidence Detector for Future Large Field-of-View X-ray Calorimetry Missions

Authors: Samuel V. Hull, Joseph S. Adams, Simon R. Bandler, Matthew Cherry, James A. Chervenak, Renata Cumbee, Xavier Defay, Enectali Figueroa-Feliciano, Fred M. Finkbeiner, Joshua Fuhrman, Richard L. Kelley, Christopher Kenney, Caroline A. Kilbourne, Noah Kurinsky, Jennette Mateo, Haruka Muramatsu, Frederick S. Porter, Kazuhiro Sakai, Aviv Simchony, Stephen J. Smith, Zoe Smith, Nicholas A. Wakeham, Edward J. Wassell, Sang H. Yoon, Betty A. Young

Abstract: Microcalorimeter instruments aboard future X-ray observatories will require an anti-coincidence (anti-co) detector to veto charged particle events and reduce the non-X-ray background. We have developed a large-format, TES-based prototype anti-coincidence detector that is particularly suitable for use with spatially-extended (~ 10 cm^2}) TES microcalorimeter arrays, as would be used for a future la… ▽ More Microcalorimeter instruments aboard future X-ray observatories will require an anti-coincidence (anti-co) detector to veto charged particle events and reduce the non-X-ray background. We have developed a large-format, TES-based prototype anti-coincidence detector that is particularly suitable for use with spatially-extended (~ 10 cm^2}) TES microcalorimeter arrays, as would be used for a future large field-of-view X-ray missions. This prototype was developed in the context of the Line Emission Mapper (LEM) probe concept, which required a ~ 14 cm^2 anti-co detector with > 95% live time and a low-energy threshold below 20 keV. Our anti-co design employs parallel networks of quasiparticle-trap-assisted electrothermal feedback TESs (QETs) to detect the athermal phonon signal produced in the detector substrate by incident charged particles. We developed multiple prototype anti-co designs featuring 12 channels and up to 6300 QETs. Here we focus on a design utilizing tungsten TESs and present characterization results. Broad energy range measurements have been performed (4.1 keV -- 5.5 MeV). Based on noise and responsivity measurements, the implied low-energy threshold is < 1 keV and a live time fraction of > 96% can be achieved up to 5.5 MeV. We also find evidence of mm-scale-or-better spatial resolution and discuss the potential utility of this for future missions. Finally, we discuss the early development of a soild-state physics model of the anti-co towards understanding phonon propagation and quasiparticle production in the detector. △ Less

Submitted 19 February, 2025; originally announced February 2025.

Comments: 26 pages, 16 figures

arXiv:2502.13280 [pdf, other]

Value Gradient Sampler: Sampling as Sequential Decision Making

Authors: Sangwoong Yoon, Himchan Hwang, Hyeokju Jeong, Dong Kyu Shin, Che-Sang Park, Sehee Kweon, Frank Chongwoo Park

Abstract: We propose the Value Gradient Sampler (VGS), a trainable sampler based on the interpretation of sampling as discrete-time sequential decision-making. VGS generates samples from a given unnormalized density (i.e., energy) by drifting and diffusing randomly initialized particles. In VGS, finding the optimal drift is equivalent to solving an optimal control problem where the cost is the upper bound o… ▽ More We propose the Value Gradient Sampler (VGS), a trainable sampler based on the interpretation of sampling as discrete-time sequential decision-making. VGS generates samples from a given unnormalized density (i.e., energy) by drifting and diffusing randomly initialized particles. In VGS, finding the optimal drift is equivalent to solving an optimal control problem where the cost is the upper bound of the KL divergence between the target density and the samples. We employ value-based dynamic programming to solve this optimal control problem, which gives the gradient of the value function as the optimal drift vector. The connection to sequential decision making allows VGS to leverage extensively studied techniques in reinforcement learning, making VGS a fast, adaptive, and accurate sampler that achieves competitive results in various sampling benchmarks. Furthermore, VGS can replace MCMC in contrastive divergence training of energy-based models. We demonstrate the effectiveness of VGS in training accurate energy-based models in industrial anomaly detection applications. △ Less

Submitted 1 March, 2025; v1 submitted 18 February, 2025; originally announced February 2025.

Comments: Code: https://github.com/swyoon/value-gradient-sampler/

arXiv:2502.11767 [pdf, ps, other]

From Selection to Generation: A Survey of LLM-based Active Learning

Authors: Yu Xia, Subhojyoti Mukherjee, Zhouhang Xie, Junda Wu, Xintong Li, Ryan Aponte, Hanjia Lyu, Joe Barrow, Hongjie Chen, Franck Dernoncourt, Branislav Kveton, Tong Yu, Ruiyi Zhang, Jiuxiang Gu, Nesreen K. Ahmed, Yu Wang, Xiang Chen, Hanieh Deilamsalehy, Sungchul Kim, Zhengmian Hu, Yue Zhao, Nedim Lipka, Seunghyun Yoon, Ting-Hao Kenneth Huang, Zichao Wang , et al. (9 additional authors not shown)

Abstract: Active Learning (AL) has been a powerful paradigm for improving model efficiency and performance by selecting the most informative data points for labeling and training. In recent active learning frameworks, Large Language Models (LLMs) have been employed not only for selection but also for generating entirely new data instances and providing more cost-effective annotations. Motivated by the incre… ▽ More Active Learning (AL) has been a powerful paradigm for improving model efficiency and performance by selecting the most informative data points for labeling and training. In recent active learning frameworks, Large Language Models (LLMs) have been employed not only for selection but also for generating entirely new data instances and providing more cost-effective annotations. Motivated by the increasing importance of high-quality data and efficient model training in the era of LLMs, we present a comprehensive survey on LLM-based Active Learning. We introduce an intuitive taxonomy that categorizes these techniques and discuss the transformative roles LLMs can play in the active learning loop. We further examine the impact of AL on LLM learning paradigms and its applications across various domains. Finally, we identify open challenges and propose future research directions. This survey aims to serve as an up-to-date resource for researchers and practitioners seeking to gain an intuitive understanding of LLM-based AL techniques and deploy them to new applications. △ Less

Submitted 31 May, 2025; v1 submitted 17 February, 2025; originally announced February 2025.

Comments: ACL 2025

arXiv:2502.10764 [pdf, other]

Learning to Explain Air Traffic Situation

Authors: Hong-ah Chai, Seokbin Yoon, Keumjin Lee

Abstract: Understanding how air traffic controllers construct a mental 'picture' of complex air traffic situations is crucial but remains a challenge due to the inherently intricate, high-dimensional interactions between aircraft, pilots, and controllers. Previous work on modeling the strategies of air traffic controllers and their mental image of traffic situations often centers on specific air traffic con… ▽ More Understanding how air traffic controllers construct a mental 'picture' of complex air traffic situations is crucial but remains a challenge due to the inherently intricate, high-dimensional interactions between aircraft, pilots, and controllers. Previous work on modeling the strategies of air traffic controllers and their mental image of traffic situations often centers on specific air traffic control tasks or pairwise interactions between aircraft, neglecting to capture the comprehensive dynamics of an air traffic situation. To address this issue, we propose a machine learning-based framework for explaining air traffic situations. Specifically, we employ a Transformer-based multi-agent trajectory model that encapsulates both the spatio-temporal movement of aircraft and social interaction between them. By deriving attention scores from the model, we can quantify the influence of individual aircraft on overall traffic dynamics. This provides explainable insights into how air traffic controllers perceive and understand the traffic situation. Trained on real-world air traffic surveillance data collected from the terminal airspace around Incheon International Airport in South Korea, our framework effectively explicates air traffic situations. This could potentially support and enhance the decision-making and situational awareness of air traffic controllers. △ Less

Submitted 27 May, 2025; v1 submitted 15 February, 2025; originally announced February 2025.

Comments: 5 pages, 3 figures, minor revisions to address reviewer feedback for final submission to the First US-Europe Air Transportation Research and Development (ATRD) Symposium

arXiv:2502.09793 [pdf, other]

Noise Controlled CT Super-Resolution with Conditional Diffusion Model

Authors: Yuang Wang, Siyeop Yoon, Rui Hu, Baihui Yu, Duhgoon Lee, Rajiv Gupta, Li Zhang, Zhiqiang Chen, Dufan Wu

Abstract: Improving the spatial resolution of CT images is a meaningful yet challenging task, often accompanied by the issue of noise amplification. This article introduces an innovative framework for noise-controlled CT super-resolution utilizing the conditional diffusion model. The model is trained on hybrid datasets, combining noise-matched simulation data with segmented details from real data. Experimen… ▽ More Improving the spatial resolution of CT images is a meaningful yet challenging task, often accompanied by the issue of noise amplification. This article introduces an innovative framework for noise-controlled CT super-resolution utilizing the conditional diffusion model. The model is trained on hybrid datasets, combining noise-matched simulation data with segmented details from real data. Experimental results with real CT images validate the effectiveness of our proposed framework, showing its potential for practical applications in CT imaging. △ Less

Submitted 13 February, 2025; originally announced February 2025.

Comments: The 8th International Conference on Image Formation in X-Ray Computed Tomography, Bamberg, Germany, August 5 - 9, 2024

arXiv:2502.08662 [pdf, ps, other]

RoToR: Towards More Reliable Responses for Order-Invariant Inputs

Authors: Soyoung Yoon, Dongha Ahn, Youngwon Lee, Minkyu Jung, HyungJoo Jang, Seung-won Hwang

Abstract: Mitigating positional bias of language models (LMs) for listwise inputs is a well-known and important problem (e.g., lost-in-the-middle). While zero-shot order-invariant LMs have been proposed to solve this issue, their success on practical listwise problems has been limited. In this work, as a first contribution, we identify and overcome two limitations to make zero-shot invariant LMs more practi… ▽ More Mitigating positional bias of language models (LMs) for listwise inputs is a well-known and important problem (e.g., lost-in-the-middle). While zero-shot order-invariant LMs have been proposed to solve this issue, their success on practical listwise problems has been limited. In this work, as a first contribution, we identify and overcome two limitations to make zero-shot invariant LMs more practical: (1) training and inference distribution mismatch arising from modifying positional ID assignments to enforce invariance, and (2) failure to adapt to mixture of order-invariant and sensitive inputs in practical listwise problems. Then, to overcome these issues we propose (1) RoToR, a zero-shot invariant LM for genuinely order-invariant inputs with minimal modifications of positional IDs, and (2) Selective Routing, an adaptive framework that handles both order-invariant and order-sensitive inputs in listwise tasks. On the Lost in the middle (LitM), Knowledge Graph QA (KGQA), and MMLU benchmarks, we show that RoToR with Selective Routing can effectively handle practical listwise input tasks in a zero-shot manner (https://github.com/soyoung97/RoToR) △ Less

Submitted 2 June, 2025; v1 submitted 10 February, 2025; originally announced February 2025.

Comments: Accepted at ACL 2025 main

arXiv:2502.06802 [pdf, other]

Solving the Content Gap in Roblox Game Recommendations: LLM-Based Profile Generation and Reranking

Authors: Chen Wang, Xiaokai Wei, Yexi Jiang, Frank Ong, Kevin Gao, Xiao Yu, Zheng Hui, Se-eun Yoon, Philip Yu, Michelle Gong

Abstract: With the vast and dynamic user-generated content on Roblox, creating effective game recommendations requires a deep understanding of game content. Traditional recommendation models struggle with the inconsistent and sparse nature of game text features such as titles and descriptions. Recent advancements in large language models (LLMs) offer opportunities to enhance recommendation systems by analyz… ▽ More With the vast and dynamic user-generated content on Roblox, creating effective game recommendations requires a deep understanding of game content. Traditional recommendation models struggle with the inconsistent and sparse nature of game text features such as titles and descriptions. Recent advancements in large language models (LLMs) offer opportunities to enhance recommendation systems by analyzing in-game text data. This paper addresses two challenges: generating high-quality, structured text features for games without extensive human annotation, and validating these features to ensure they improve recommendation relevance. We propose an approach that extracts in-game text and uses LLMs to infer attributes such as genre and gameplay objectives from raw player interactions. Additionally, we introduce an LLM-based re-ranking mechanism to assess the effectiveness of the generated text features, enhancing personalization and user satisfaction. Beyond recommendations, our approach supports applications such as user engagement-based integrity detection, already deployed in production. This scalable framework demonstrates the potential of in-game text understanding to improve recommendation quality on Roblox and adapt recommendations to its unique, user-generated ecosystem. △ Less

Submitted 1 February, 2025; originally announced February 2025.

arXiv:2502.05167 [pdf, other]

NoLiMa: Long-Context Evaluation Beyond Literal Matching

Authors: Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, Hinrich Schütze

Abstract: Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a "needle" (relevant information) from a "haystack" (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in… ▽ More Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a "needle" (relevant information) from a "haystack" (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information. We publicly release the dataset and evaluation code at https://github.com/adobe-research/NoLiMa. △ Less

Submitted 26 March, 2025; v1 submitted 7 February, 2025; originally announced February 2025.

arXiv:2502.05055 [pdf]

Differentiable Mobile Display Photometric Stereo

Authors: Gawoon Ban, Hyeongjun Kim, Seokjun Choi, Seungwoo Yoon, Seung-Hwan Baek

Abstract: Display photometric stereo uses a display as a programmable light source to illuminate a scene with diverse illumination conditions. Recently, differentiable display photometric stereo (DDPS) demonstrated improved normal reconstruction accuracy by using learned display patterns. However, DDPS faced limitations in practicality, requiring a fixed desktop imaging setup using a polarization camera and… ▽ More Display photometric stereo uses a display as a programmable light source to illuminate a scene with diverse illumination conditions. Recently, differentiable display photometric stereo (DDPS) demonstrated improved normal reconstruction accuracy by using learned display patterns. However, DDPS faced limitations in practicality, requiring a fixed desktop imaging setup using a polarization camera and a desktop-scale monitor. In this paper, we propose a more practical physics-based photometric stereo, differentiable mobile display photometric stereo (DMDPS), that leverages a mobile phone consisting of a display and a camera. We overcome the limitations of using a mobile device by developing a mobile app and method that simultaneously displays patterns and captures high-quality HDR images. Using this technique, we capture real-world 3D-printed objects and learn display patterns via a differentiable learning process. We demonstrate the effectiveness of DMDPS on both a 3D printed dataset and a first dataset of fallen leaves. The leaf dataset contains reconstructed surface normals and albedos of fallen leaves that may enable future research beyond computer graphics and vision. We believe that DMDPS takes a step forward for practical physics-based photometric stereo. △ Less

Submitted 7 February, 2025; originally announced February 2025.

Comments: 9 pages

arXiv:2502.01419 [pdf, other]

Visual Attention Never Fades: Selective Progressive Attention ReCalibration for Detailed Image Captioning in Multimodal Large Language Models

Authors: Mingi Jung, Saehuyng Lee, Eunji Kim, Sungroh Yoon

Abstract: Detailed image captioning is essential for tasks like data generation and aiding visually impaired individuals. High-quality captions require a balance between precision and recall, which remains challenging for current multimodal large language models (MLLMs). In this work, we hypothesize that this limitation stems from weakening and increasingly noisy visual attention as responses lengthen. To a… ▽ More Detailed image captioning is essential for tasks like data generation and aiding visually impaired individuals. High-quality captions require a balance between precision and recall, which remains challenging for current multimodal large language models (MLLMs). In this work, we hypothesize that this limitation stems from weakening and increasingly noisy visual attention as responses lengthen. To address this issue, we propose SPARC (Selective Progressive Attention ReCalibration), a training-free method that enhances the contribution of visual tokens during decoding. SPARC is founded on three key observations: (1) increasing the influence of all visual tokens reduces recall; thus, SPARC selectively amplifies visual tokens; (2) as captions lengthen, visual attention becomes noisier, so SPARC identifies critical visual tokens by leveraging attention differences across time steps; (3) as visual attention gradually weakens, SPARC reinforces it to preserve its influence. Our experiments, incorporating both automated and human evaluations, demonstrate that existing methods improve the precision of MLLMs at the cost of recall. In contrast, our proposed method enhances both precision and recall with minimal computational overhead. △ Less

Submitted 3 February, 2025; originally announced February 2025.

ACM Class: I.2.7

arXiv:2502.01059 [pdf, other]

Knowledge Synthesis of Photosynthesis Research Using a Large Language Model

Authors: Seungri Yoon, Woosang Jeon, Sanghyeok Choi, Taehyeong Kim, Tae In Ahn

Abstract: The development of biological data analysis tools and large language models (LLMs) has opened up new possibilities for utilizing AI in plant science research, with the potential to contribute significantly to knowledge integration and research gap identification. Nonetheless, current LLMs struggle to handle complex biological data and theoretical models in photosynthesis research and often fail to… ▽ More The development of biological data analysis tools and large language models (LLMs) has opened up new possibilities for utilizing AI in plant science research, with the potential to contribute significantly to knowledge integration and research gap identification. Nonetheless, current LLMs struggle to handle complex biological data and theoretical models in photosynthesis research and often fail to provide accurate scientific contexts. Therefore, this study proposed a photosynthesis research assistant (PRAG) based on OpenAI's GPT-4o with retrieval-augmented generation (RAG) techniques and prompt optimization. Vector databases and an automated feedback loop were used in the prompt optimization process to enhance the accuracy and relevance of the responses to photosynthesis-related queries. PRAG showed an average improvement of 8.7% across five metrics related to scientific writing, with a 25.4% increase in source transparency. Additionally, its scientific depth and domain coverage were comparable to those of photosynthesis research papers. A knowledge graph was used to structure PRAG's responses with papers within and outside the database, which allowed PRAG to match key entities with 63% and 39.5% of the database and test papers, respectively. PRAG can be applied for photosynthesis research and broader plant science domains, paving the way for more in-depth data analysis and predictive capabilities. △ Less

Submitted 3 February, 2025; originally announced February 2025.

Comments: 17 pages, 6 figures

arXiv:2502.00654 [pdf, other]

EmoTalkingGaussian: Continuous Emotion-conditioned Talking Head Synthesis

Authors: Junuk Cha, Seongro Yoon, Valeriya Strizhkova, Francois Bremond, Seungryul Baek

Abstract: 3D Gaussian splatting-based talking head synthesis has recently gained attention for its ability to render high-fidelity images with real-time inference speed. However, since it is typically trained on only a short video that lacks the diversity in facial emotions, the resultant talking heads struggle to represent a wide range of emotions. To address this issue, we propose a lip-aligned emotional… ▽ More 3D Gaussian splatting-based talking head synthesis has recently gained attention for its ability to render high-fidelity images with real-time inference speed. However, since it is typically trained on only a short video that lacks the diversity in facial emotions, the resultant talking heads struggle to represent a wide range of emotions. To address this issue, we propose a lip-aligned emotional face generator and leverage it to train our EmoTalkingGaussian model. It is able to manipulate facial emotions conditioned on continuous emotion values (i.e., valence and arousal); while retaining synchronization of lip movements with input audio. Additionally, to achieve the accurate lip synchronization for in-the-wild audio, we introduce a self-supervised learning method that leverages a text-to-speech network and a visual-audio synchronization network. We experiment our EmoTalkingGaussian on publicly available videos and have obtained better results than state-of-the-arts in terms of image quality (measured in PSNR, SSIM, LPIPS), emotion expression (measured in V-RMSE, A-RMSE, V-SA, A-SA, Emotion Accuracy), and lip synchronization (measured in LMD, Sync-E, Sync-C), respectively. △ Less

Submitted 1 February, 2025; originally announced February 2025.

Comments: 22 pages

arXiv:2502.00619 [pdf, other]

Distribution-aware Fairness Learning in Medical Image Segmentation From A Control-Theoretic Perspective

Authors: Yujin Oh, Pengfei Jin, Sangjoon Park, Sekeun Kim, Siyeop Yoon, Kyungsang Kim, Jin Sung Kim, Xiang Li, Quanzheng Li

Abstract: Ensuring fairness in medical image segmentation is critical due to biases in imbalanced clinical data acquisition caused by demographic attributes (e.g., age, sex, race) and clinical factors (e.g., disease severity). To address these challenges, we introduce Distribution-aware Mixture of Experts (dMoE), inspired by optimal control theory. We provide a comprehensive analysis of its underlying mecha… ▽ More Ensuring fairness in medical image segmentation is critical due to biases in imbalanced clinical data acquisition caused by demographic attributes (e.g., age, sex, race) and clinical factors (e.g., disease severity). To address these challenges, we introduce Distribution-aware Mixture of Experts (dMoE), inspired by optimal control theory. We provide a comprehensive analysis of its underlying mechanisms and clarify dMoE's role in adapting to heterogeneous distributions in medical image segmentation. Furthermore, we integrate dMoE into multiple network architectures, demonstrating its broad applicability across diverse medical image analysis tasks. By incorporating demographic and clinical factors, dMoE achieves state-of-the-art performance on two 2D benchmark datasets and a 3D in-house dataset. Our results highlight the effectiveness of dMoE in mitigating biases from imbalanced distributions, offering a promising approach to bridging control theory and medical image segmentation within fairness learning paradigms. The source code will be made available. The source code is available at https://github.com/tvseg/dMoE. △ Less

Submitted 27 May, 2025; v1 submitted 1 February, 2025; originally announced February 2025.

Comments: ICML 2025 spotlight, see https://openreview.net/forum?id=BUONdewsBa

arXiv:2501.17797 [pdf, other]

doi 10.3847/1538-4357/acc86b

The SPHEREx Target List of Ice Sources (SPLICES)

Authors: Matthew L. N. Ashby, Joseph L. Hora, Kiran Lakshmipathaiah, Sarita Vig, Rama Krishna Sai Subrahmanyam Gorthi, Miju Kang, Volker Tolls, Gary J. Melnick, Michael W. Werner, Brendan P. Crill, Daniel C. Masters, Carlos Contreras Pena, Jeong-Eun Lee, Jaeyeong Kim, Ho-Gyu Lee, Sung-Yong Yoon, Soung-Chul Yang, Nicholas Flagey, Bertrand Mennesson

Abstract: One of the primary objectives of the SPHEREx mission is to understand the origin of molecules such as H2O, CO2, and other volatile compounds at the early stages of planetary system formation. Because the vast majority of these compounds -- typically exceeding 95% -- exist in the solid phase rather than the gaseous phase in the systems of concern here, the observing strategy planned to characterize… ▽ More One of the primary objectives of the SPHEREx mission is to understand the origin of molecules such as H2O, CO2, and other volatile compounds at the early stages of planetary system formation. Because the vast majority of these compounds -- typically exceeding 95% -- exist in the solid phase rather than the gaseous phase in the systems of concern here, the observing strategy planned to characterize them is slightly unusual. Specifically, SPHEREx will target highly obscured sources throughout the Milky Way, and observe the species of concern in absorption against background illumination. SPHEREx spectrophotometry will yield ice column density measurements for millions of obscured Milky Way sources of all ages and types. By correlating those column densities with source ages, the SPHEREx mission will shed light on whether those molecules were formed in situ along with their nascent stellar systems, or whether instead they formed elsewhere and were introduced into those systems after their formation. To that end, this work describes version 7$.$1 of the SPHEREx Target List of Ice Sources (SPLICES) for the community. It contains about 8$.$6 million objects brighter than W2~12 Vega mag over much of the sky, principally within a broad strip running the length of the Milky Way midplane, but also within high-latitude molecular clouds and even the Magellanic Clouds. △ Less

Submitted 29 January, 2025; originally announced January 2025.

Comments: Published by ApJ. 21 pages, 6 figures. This article documents the original version of SPLICES (7.1). The current version as well as the complete catalog is publicly available along with release notes documenting all additions and changes at the NASA/IPAC Infrared Science Archive (IRSA) at this URL: https://irsa.ipac.caltech.edu/data/SPHEREx/SPLICES/

Journal ref: ApJ, 949, 105 (2023)

arXiv:2501.11225 [pdf, other]

CNN-based TEM image denoising from first principles

Authors: Jinwoong Chae, Sungwook Hong, Sungkyu Kim, Sungroh Yoon, Gunn Kim

Abstract: Transmission electron microscope (TEM) images are often corrupted by noise, hindering their interpretation. To address this issue, we propose a deep learning-based approach using simulated images. Using density functional theory calculations with a set of pseudo-atomic orbital basis sets, we generate highly accurate ground truth images. We introduce four types of noise into these simulations to cr… ▽ More Transmission electron microscope (TEM) images are often corrupted by noise, hindering their interpretation. To address this issue, we propose a deep learning-based approach using simulated images. Using density functional theory calculations with a set of pseudo-atomic orbital basis sets, we generate highly accurate ground truth images. We introduce four types of noise into these simulations to create realistic training datasets. Each type of noise is then used to train a separate convolutional neural network (CNN) model. Our results show that these CNNs are effective in reducing noise, even when applied to images with different noise levels than those used during training. However, we observe limitations in some cases, particularly in preserving the integrity of circular shapes and avoiding visible artifacts between image patches. To overcome these challenges, we propose alternative training strategies and future research directions. This study provides a valuable framework for training deep learning models for TEM image denoising. △ Less

Submitted 19 January, 2025; originally announced January 2025.

Comments: 10 pages and 4 figures

arXiv:2501.10913 [pdf, other]

Know "No'' Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP

Authors: Junsung Park, Jungbeom Lee, Jongyoon Song, Sangwon Yu, Dahuin Jung, Sungroh Yoon

Abstract: While CLIP has significantly advanced multimodal understanding by bridging vision and language, the inability to grasp negation - such as failing to differentiate concepts like "parking" from "no parking" - poses substantial challenges. By analyzing the data used in the public CLIP model's pre-training, we posit this limitation stems from a lack of negation-inclusive data. To address this, we intr… ▽ More While CLIP has significantly advanced multimodal understanding by bridging vision and language, the inability to grasp negation - such as failing to differentiate concepts like "parking" from "no parking" - poses substantial challenges. By analyzing the data used in the public CLIP model's pre-training, we posit this limitation stems from a lack of negation-inclusive data. To address this, we introduce data generation pipelines that employ a large language model (LLM) and a multimodal LLM to produce negation-inclusive captions. Fine-tuning CLIP with data generated from our pipelines, we develop NegationCLIP, which enhances negation awareness while preserving the generality. Moreover, to enable a comprehensive evaluation of negation understanding, we propose NegRefCOCOg-a benchmark tailored to test VLMs' ability to interpret negation across diverse expressions and positions within a sentence. Experiments on various CLIP architectures validate the effectiveness of our data generation pipelines in enhancing CLIP's ability to perceive negation accurately. Additionally, NegationCLIP's enhanced negation awareness has practical applications across various multimodal tasks, demonstrated by performance gains in text-to-image generation and referring image segmentation. △ Less

Submitted 31 March, 2025; v1 submitted 18 January, 2025; originally announced January 2025.

arXiv:2501.04970 [pdf, other]

Battling the Non-stationarity in Time Series Forecasting via Test-time Adaptation

Authors: HyunGi Kim, Siwon Kim, Jisoo Mok, Sungroh Yoon

Abstract: Deep Neural Networks have spearheaded remarkable advancements in time series forecasting (TSF), one of the major tasks in time series modeling. Nonetheless, the non-stationarity of time series undermines the reliability of pre-trained source time series forecasters in mission-critical deployment settings. In this study, we introduce a pioneering test-time adaptation framework tailored for TSF (TSF… ▽ More Deep Neural Networks have spearheaded remarkable advancements in time series forecasting (TSF), one of the major tasks in time series modeling. Nonetheless, the non-stationarity of time series undermines the reliability of pre-trained source time series forecasters in mission-critical deployment settings. In this study, we introduce a pioneering test-time adaptation framework tailored for TSF (TSF-TTA). TAFAS, the proposed approach to TSF-TTA, flexibly adapts source forecasters to continuously shifting test distributions while preserving the core semantic information learned during pre-training. The novel utilization of partially-observed ground truth and gated calibration module enables proactive, robust, and model-agnostic adaptation of source forecasters. Experiments on diverse benchmark datasets and cutting-edge architectures demonstrate the efficacy and generality of TAFAS, especially in long-term forecasting scenarios that suffer from significant distribution shifts. The code is available at https://github.com/kimanki/TAFAS. △ Less

Submitted 8 January, 2025; originally announced January 2025.

Comments: Accepted at AAAI 2025

arXiv:2501.01409 [pdf, other]

JOG3R: Towards 3D-Consistent Video Generators

Authors: Chun-Hao Paul Huang, Niloy Mitra, Hyeonho Jeong, Jae Shin Yoon, Duygu Ceylan

Abstract: Emergent capabilities of image generators have led to many impactful zero- or few-shot applications. Inspired by this success, we investigate whether video generators similarly exhibit 3D-awareness. Using structure-from-motion as a 3D-aware task, we test if intermediate features of a video generator - OpenSora in our case - can support camera pose estimation. Surprisingly, at first, we only find a… ▽ More Emergent capabilities of image generators have led to many impactful zero- or few-shot applications. Inspired by this success, we investigate whether video generators similarly exhibit 3D-awareness. Using structure-from-motion as a 3D-aware task, we test if intermediate features of a video generator - OpenSora in our case - can support camera pose estimation. Surprisingly, at first, we only find a weak correlation between the two tasks. Deeper investigation reveals that although the video generator produces plausible video frames, the frames themselves are not truly 3D-consistent. Instead, we propose to jointly train for the two tasks, using photometric generation and 3D aware errors. Specifically, we find that SoTA video generation and camera pose estimation (i.e.,DUSt3R [79]) networks share common structures, and propose an architecture that unifies the two. The proposed unified model, named \nameMethod, produces camera pose estimates with competitive quality while producing 3D-consistent videos. In summary, we propose the first unified video generator that is 3D-consistent, generates realistic video frames, and can potentially be repurposed for other 3D-aware tasks. △ Less

Submitted 26 March, 2025; v1 submitted 2 January, 2025; originally announced January 2025.

arXiv:2412.18711 [pdf, other]

Measurement of reactor antineutrino oscillation amplitude and frequency using 3800 days of complete data sample of the RENO experiment

Authors: S. Jeon, H. I. Kim, J. H. Choi, H. I. Jang, J. S. Jang, K. K. Joo, D. E. Jung, J. G. Kim, J. H. Kim, J. Y. Kim, S. B. Kim, S. Y. Kim, W. Kim, E. Kwon, D. H. Lee, H. G. Lee, W. J. Lee, I. T. Lim, D. H. Moon, M. Y. Pac, J. S. Park, R. G. Park, H. Seo, J. W. Seo, C. D. Shin , et al. (5 additional authors not shown)

Abstract: We report an updated neutrino mixing angle of $θ_{13}$ obtained from a complete data sample of the RENO experiment. The experiment has measured the amplitude and frequency of reactor anti-electron-neutrinos ($\barν_{e}$) oscillations at the Hanbit nuclear power plant, Younggwang, Korea, since August 2011. As of March 2023, the data acquisition was completed after a total of 3800 live days of detec… ▽ More We report an updated neutrino mixing angle of $θ_{13}$ obtained from a complete data sample of the RENO experiment. The experiment has measured the amplitude and frequency of reactor anti-electron-neutrinos ($\barν_{e}$) oscillations at the Hanbit nuclear power plant, Younggwang, Korea, since August 2011. As of March 2023, the data acquisition was completed after a total of 3800 live days of detector operation. The observed candidates via inverse beta decay (IBD) are 1,211,995 (144,667) in the near (far) detector. Based on an observed energy-dependent reactor neutrino disappearance, neutrino oscillation parameters of $θ_{13}$ and $\lvertΔm_{ee}^2\rvert$ are precisely determined as $\sin^{2}2θ_{13}=0.0920_{-0.0042}^{+0.0044}(\text{stat.})_{-0.0041}^{+0.0041}(\text{syst.})$ and $\lvertΔm_{ee}^2\rvert=\left[2.57_{-0.11}^{+0.10}(\text{stat.})_{-0.05}^{+0.05}(\text{syst.})\right]\times10^{-3}~\text{eV}^{2}$. Compared to the previous RENO results published in Ref.~\cite{PhysRevLett.121.201801}, the precision is improved from 7.5\% to 6.4\% for $\sin^{2}2θ_{13}$ and from 5.2\% to 4.5\% for $\lvertΔm_{ee}^2\rvert$. The statistical error of the measurement has reached our goal and is hardly improved with additional data-taking. △ Less

Submitted 24 December, 2024; originally announced December 2024.

Comments: 13 pages, 11 figures

arXiv:2412.15484 [pdf, ps, other]

Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage

Authors: Saehyung Lee, Seunghyun Yoon, Trung Bui, Jing Shi, Sungroh Yoon

Abstract: Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations. Our analysis reveals that existing hallucination detection methods struggle with detailed captions. We attribute this to the increasing reliance of MLLMs on their generated text, rather than the input image, as the sequence length grows. To address this issue, we propose a multia… ▽ More Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations. Our analysis reveals that existing hallucination detection methods struggle with detailed captions. We attribute this to the increasing reliance of MLLMs on their generated text, rather than the input image, as the sequence length grows. To address this issue, we propose a multiagent approach that leverages LLM-MLLM collaboration to correct given captions. Additionally, we introduce an evaluation framework and a benchmark dataset to facilitate the systematic analysis of detailed captions. Our experiments demonstrate that our proposed evaluation method better aligns with human judgments of factuality than existing metrics and that existing approaches to improve the MLLM factuality may fall short in hyper-detailed image captioning tasks. In contrast, our proposed method significantly enhances the factual accuracy of captions, even improving those generated by GPT-4V. Finally, we highlight a limitation of VQA-centric benchmarking by demonstrating that an MLLM's performance on VQA benchmarks may not correlate with its ability to generate detailed image captions. △ Less

Submitted 29 May, 2025; v1 submitted 19 December, 2024; originally announced December 2024.

Comments: ICML 2025

arXiv:2412.14568 [pdf, other]

Improving Geometry in Sparse-View 3DGS via Reprojection-based DoF Separation

Authors: Yongsung Kim, Minjun Park, Jooyoung Choi, Sungroh Yoon

Abstract: Recent learning-based Multi-View Stereo models have demonstrated state-of-the-art performance in sparse-view 3D reconstruction. However, directly applying 3D Gaussian Splatting (3DGS) as a refinement step following these models presents challenges. We hypothesize that the excessive positional degrees of freedom (DoFs) in Gaussians induce geometry distortion, fitting color patterns at the cost of s… ▽ More Recent learning-based Multi-View Stereo models have demonstrated state-of-the-art performance in sparse-view 3D reconstruction. However, directly applying 3D Gaussian Splatting (3DGS) as a refinement step following these models presents challenges. We hypothesize that the excessive positional degrees of freedom (DoFs) in Gaussians induce geometry distortion, fitting color patterns at the cost of structural fidelity. To address this, we propose reprojection-based DoF separation, a method distinguishing positional DoFs in terms of uncertainty: image-plane-parallel DoFs and ray-aligned DoF. To independently manage each DoF, we introduce a reprojection process along with tailored constraints for each DoF. Through experiments across various datasets, we confirm that separating the positional DoFs of Gaussians and applying targeted constraints effectively suppresses geometric artifacts, producing reconstruction results that are both visually and geometrically plausible. △ Less

Submitted 19 December, 2024; originally announced December 2024.

Comments: 11 pages

arXiv:2412.13875 [pdf, other]

Denoising Nearest Neighbor Graph via Continuous CRF for Visual Re-ranking without Fine-tuning

Authors: Jaeyoon Kim, Yoonki Cho, Taeyong Kim, Sung-Eui Yoon

Abstract: Visual re-ranking using Nearest Neighbor graph~(NN graph) has been adapted to yield high retrieval accuracy, since it is beneficial to exploring an high-dimensional manifold and applicable without additional fine-tuning. The quality of visual re-ranking using NN graph, however, is limited to that of connectivity, i.e., edges of the NN graph. Some edges can be misconnected with negative images. Thi… ▽ More Visual re-ranking using Nearest Neighbor graph~(NN graph) has been adapted to yield high retrieval accuracy, since it is beneficial to exploring an high-dimensional manifold and applicable without additional fine-tuning. The quality of visual re-ranking using NN graph, however, is limited to that of connectivity, i.e., edges of the NN graph. Some edges can be misconnected with negative images. This is known as a noisy edge problem, resulting in a degradation of the retrieval quality. To address this, we propose a complementary denoising method based on Continuous Conditional Random Field (C-CRF) that uses a statistical distance of our similarity-based distribution. This method employs the concept of cliques to make the process computationally feasible. We demonstrate the complementarity of our method through its application to three visual re-ranking methods, observing quality boosts in landmark retrieval and person re-identification (re-ID). △ Less

Submitted 18 December, 2024; originally announced December 2024.

arXiv:2412.13734 [pdf, other]

Text2Relight: Creative Portrait Relighting with Text Guidance

Authors: Junuk Cha, Mengwei Ren, Krishna Kumar Singh, He Zhang, Yannick Hold-Geoffroy, Seunghyun Yoon, HyunJoon Jung, Jae Shin Yoon, Seungryul Baek

Abstract: We present a lighting-aware image editing pipeline that, given a portrait image and a text prompt, performs single image relighting. Our model modifies the lighting and color of both the foreground and background to align with the provided text description. The unbounded nature in creativeness of a text allows us to describe the lighting of a scene with any sensory features including temperature,… ▽ More We present a lighting-aware image editing pipeline that, given a portrait image and a text prompt, performs single image relighting. Our model modifies the lighting and color of both the foreground and background to align with the provided text description. The unbounded nature in creativeness of a text allows us to describe the lighting of a scene with any sensory features including temperature, emotion, smell, time, and so on. However, the modeling of such mapping between the unbounded text and lighting is extremely challenging due to the lack of dataset where there exists no scalable data that provides large pairs of text and relighting, and therefore, current text-driven image editing models does not generalize to lighting-specific use cases. We overcome this problem by introducing a novel data synthesis pipeline: First, diverse and creative text prompts that describe the scenes with various lighting are automatically generated under a crafted hierarchy using a large language model (*e.g.,* ChatGPT). A text-guided image generation model creates a lighting image that best matches the text. As a condition of the lighting images, we perform image-based relighting for both foreground and background using a single portrait image or a set of OLAT (One-Light-at-A-Time) images captured from lightstage system. Particularly for the background relighting, we represent the lighting image as a set of point lights and transfer them to other background images. A generative diffusion model learns the synthesized large-scale data with auxiliary task augmentation (*e.g.,* portrait delighting and light positioning) to correlate the latent text and lighting distribution for text-guided portrait relighting. △ Less

Submitted 18 December, 2024; originally announced December 2024.

arXiv:2412.13646 [pdf, other]

Transmit What You Need: Task-Adaptive Semantic Communications for Visual Information

Authors: Jeonghun Park, Sung Whan Yoon

Abstract: Recently, semantic communications have drawn great attention as the groundbreaking concept surpasses the limited capacity of Shannon's theory. Specifically, semantic communications probably become crucial in realizing visual tasks that demand massive network traffic. Although highly distinctive forms of visual semantics exist for computer vision tasks, a thorough investigation of what visual seman… ▽ More Recently, semantic communications have drawn great attention as the groundbreaking concept surpasses the limited capacity of Shannon's theory. Specifically, semantic communications probably become crucial in realizing visual tasks that demand massive network traffic. Although highly distinctive forms of visual semantics exist for computer vision tasks, a thorough investigation of what visual semantics can be transmitted in time and which one is required for completing different visual tasks has not yet been reported. To this end, we first scrutinize the achievable throughput in transmitting existing visual semantics through the limited wireless communication bandwidth. In addition, we further demonstrate the resulting performance of various visual tasks for each visual semantic. Based on the empirical testing, we suggest a task-adaptive selection of visual semantics is crucial for real-time semantic communications for visual tasks, where we transmit basic semantics (e.g., objects in the given image) for simple visual tasks, such as classification, and richer semantics (e.g., scene graphs) for complex tasks, such as image regeneration. To further improve transmission efficiency, we suggest a filtering method for scene graphs, which drops redundant information in the scene graph, thus allowing the sending of essential semantics for completing the given task. We confirm the efficacy of our task-adaptive semantic communication approach through extensive simulations in wireless channels, showing more than 45 times larger throughput over a naive transmission of original data. Our work can be reproduced at the following source codes: https://github.com/jhpark2024/jhpark.github.io △ Less

Submitted 18 December, 2024; originally announced December 2024.

Comments: This work has been submitted to the IEEE for possible publication

arXiv:2412.13501 [pdf, other]

GUI Agents: A Survey

Authors: Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Thien Huu Nguyen , et al. (4 additional authors not shown)

Abstract: Graphical User Interface (GUI) agents, powered by Large Foundation Models, have emerged as a transformative approach to automating human-computer interaction. These agents autonomously interact with digital systems or software applications via GUIs, emulating human actions such as clicking, typing, and navigating visual elements across diverse platforms. Motivated by the growing interest and funda… ▽ More Graphical User Interface (GUI) agents, powered by Large Foundation Models, have emerged as a transformative approach to automating human-computer interaction. These agents autonomously interact with digital systems or software applications via GUIs, emulating human actions such as clicking, typing, and navigating visual elements across diverse platforms. Motivated by the growing interest and fundamental importance of GUI agents, we provide a comprehensive survey that categorizes their benchmarks, evaluation metrics, architectures, and training methods. We propose a unified framework that delineates their perception, reasoning, planning, and acting capabilities. Furthermore, we identify important open challenges and discuss key future directions. Finally, this work serves as a basis for practitioners and researchers to gain an intuitive understanding of current progress, techniques, benchmarks, and critical open problems that remain to be addressed. △ Less

Submitted 17 December, 2024; originally announced December 2024.

arXiv:2412.13422 [pdf, other]

Generating Diverse Hypotheses for Inductive Reasoning

Authors: Kang-il Lee, Hyukhun Koh, Dongryeol Lee, Seunghyun Yoon, Minsung Kim, Kyomin Jung

Abstract: Inductive reasoning - the process of inferring general rules from a small number of observations - is a fundamental aspect of human intelligence. Recent works suggest that large language models (LLMs) can engage in inductive reasoning by sampling multiple hypotheses about the rules and selecting the one that best explains the observations. However, due to the IID sampling, semantically redundant h… ▽ More Inductive reasoning - the process of inferring general rules from a small number of observations - is a fundamental aspect of human intelligence. Recent works suggest that large language models (LLMs) can engage in inductive reasoning by sampling multiple hypotheses about the rules and selecting the one that best explains the observations. However, due to the IID sampling, semantically redundant hypotheses are frequently generated, leading to significant wastage of compute. In this paper, we 1) demonstrate that increasing the temperature to enhance the diversity is limited due to text degeneration issue, and 2) propose a novel method to improve the diversity while maintaining text quality. We first analyze the effect of increasing the temperature parameter, which is regarded as the LLM's diversity control, on IID hypotheses. Our analysis shows that as temperature rises, diversity and accuracy of hypotheses increase up to a certain point, but this trend saturates due to text degeneration. To generate hypotheses that are more semantically diverse and of higher quality, we propose a novel approach inspired by human inductive reasoning, which we call Mixture of Concepts (MoC). When applied to several inductive reasoning benchmarks, MoC demonstrated significant performance improvements compared to standard IID sampling and other approaches. △ Less

Submitted 8 February, 2025; v1 submitted 17 December, 2024; originally announced December 2024.

Comments: NAACL 2025

arXiv:2412.10436 [pdf, other]

Benchmarking Federated Learning for Semantic Datasets: Federated Scene Graph Generation

Authors: SeungBum Ha, Taehwan Lee, Jiyoun Lim, Sung Whan Yoon

Abstract: Federated learning (FL) has recently garnered attention as a data-decentralized training framework that enables the learning of deep models from locally distributed samples while keeping data privacy. Built upon the framework, immense efforts have been made to establish FL benchmarks, which provide rigorous evaluation settings that control data heterogeneity across clients. Prior efforts have main… ▽ More Federated learning (FL) has recently garnered attention as a data-decentralized training framework that enables the learning of deep models from locally distributed samples while keeping data privacy. Built upon the framework, immense efforts have been made to establish FL benchmarks, which provide rigorous evaluation settings that control data heterogeneity across clients. Prior efforts have mainly focused on handling relatively simple classification tasks, where each sample is annotated with a one-hot label, such as MNIST, CIFAR, LEAF benchmark, etc. However, little attention has been paid to demonstrating an FL benchmark that handles complicated semantics, where each sample encompasses diverse semantic information from multiple labels, such as Panoptic Scene Graph Generation (PSG) with objects, subjects, and relations between them. Because the existing benchmark is designed to distribute data in a narrow view of a single semantic, e.g., a one-hot label, managing the complicated semantic heterogeneity across clients when formalizing FL benchmarks is non-trivial. In this paper, we propose a benchmark process to establish an FL benchmark with controllable semantic heterogeneity across clients: two key steps are i) data clustering with semantics and ii) data distributing via controllable semantic heterogeneity across clients. As a proof of concept, we first construct a federated PSG benchmark, demonstrating the efficacy of the existing PSG methods in an FL setting with controllable semantic heterogeneity of scene graphs. We also present the effectiveness of our benchmark by applying robust federated learning algorithms to data heterogeneity to show increased performance. Our code is available at https://github.com/Seung-B/FL-PSG. △ Less

Submitted 11 December, 2024; originally announced December 2024.

Comments: This work has been submitted to the IEEE for possible publication

arXiv:2412.09921 [pdf, other]

FaceShield: Defending Facial Image against Deepfake Threats

Authors: Jaehwan Jeong, Sumin In, Sieun Kim, Hannie Shin, Jongheon Jeong, Sang Ho Yoon, Jaewook Chung, Sangpil Kim

Abstract: The rising use of deepfakes in criminal activities presents a significant issue, inciting widespread controversy. While numerous studies have tackled this problem, most primarily focus on deepfake detection. These reactive solutions are insufficient as a fundamental approach for crimes where authenticity is disregarded. Existing proactive defenses also have limitations, as they are effective only… ▽ More The rising use of deepfakes in criminal activities presents a significant issue, inciting widespread controversy. While numerous studies have tackled this problem, most primarily focus on deepfake detection. These reactive solutions are insufficient as a fundamental approach for crimes where authenticity is disregarded. Existing proactive defenses also have limitations, as they are effective only for deepfake models based on specific Generative Adversarial Networks (GANs), making them less applicable in light of recent advancements in diffusion-based models. In this paper, we propose a proactive defense method named FaceShield, which introduces novel defense strategies targeting deepfakes generated by Diffusion Models (DMs) and facilitates defenses on various existing GAN-based deepfake models through facial feature extractor manipulations. Our approach consists of three main components: (i) manipulating the attention mechanism of DMs to exclude protected facial features during the denoising process, (ii) targeting prominent facial feature extraction models to enhance the robustness of our adversarial perturbation, and (iii) employing Gaussian blur and low-pass filtering techniques to improve imperceptibility while enhancing robustness against JPEG compression. Experimental results on the CelebA-HQ and VGGFace2-HQ datasets demonstrate that our method achieves state-of-the-art performance against the latest deepfake models based on DMs, while also exhibiting transferability to GANs and showcasing greater imperceptibility of noise along with enhanced robustness. △ Less

Submitted 10 March, 2025; v1 submitted 13 December, 2024; originally announced December 2024.

arXiv:2412.04680 [pdf, other]

Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens

Authors: Jaihyun Lew, Soohyuk Jang, Jaehoon Lee, Seungryong Yoo, Eunji Kim, Saehyung Lee, Jisoo Mok, Siwon Kim, Sungroh Yoon

Abstract: Transformers, a groundbreaking architecture proposed for Natural Language Processing (NLP), have also achieved remarkable success in Computer Vision. A cornerstone of their success lies in the attention mechanism, which models relationships among tokens. While the tokenization process in NLP inherently ensures that a single token does not contain multiple semantics, the tokenization of Vision Tran… ▽ More Transformers, a groundbreaking architecture proposed for Natural Language Processing (NLP), have also achieved remarkable success in Computer Vision. A cornerstone of their success lies in the attention mechanism, which models relationships among tokens. While the tokenization process in NLP inherently ensures that a single token does not contain multiple semantics, the tokenization of Vision Transformer (ViT) utilizes tokens from uniformly partitioned square image patches, which may result in an arbitrary mixing of visual concepts in a token. In this work, we propose to substitute the grid-based tokenization in ViT with superpixel tokenization, which employs superpixels to generate a token that encapsulates a sole visual concept. Unfortunately, the diverse shapes, sizes, and locations of superpixels make integrating superpixels into ViT tokenization rather challenging. Our tokenization pipeline, comprised of pre-aggregate extraction and superpixel-aware aggregation, overcomes the challenges that arise in superpixel tokenization. Extensive experiments demonstrate that our approach, which exhibits strong compatibility with existing frameworks, enhances the accuracy and robustness of ViT on various downstream tasks. △ Less

Submitted 24 March, 2025; v1 submitted 5 December, 2024; originally announced December 2024.

Comments: Project page: https://github.com/jangsoohyuk/SuiT

arXiv:2412.03736 [pdf, other]

Domain-specific Question Answering with Hybrid Search

Authors: Dewang Sultania, Zhaoyu Lu, Twisha Naik, Franck Dernoncourt, David Seunghyun Yoon, Sanat Sharma, Trung Bui, Ashok Gupta, Tushar Vatsa, Suhas Suresha, Ishita Verma, Vibha Belavadi, Cheng Chen, Michael Friedrich

Abstract: Domain specific question answering is an evolving field that requires specialized solutions to address unique challenges. In this paper, we show that a hybrid approach combining a fine-tuned dense retriever with keyword based sparse search methods significantly enhances performance. Our system leverages a linear combination of relevance signals, including cosine similarity from dense retrieval, BM… ▽ More Domain specific question answering is an evolving field that requires specialized solutions to address unique challenges. In this paper, we show that a hybrid approach combining a fine-tuned dense retriever with keyword based sparse search methods significantly enhances performance. Our system leverages a linear combination of relevance signals, including cosine similarity from dense retrieval, BM25 scores, and URL host matching, each with tunable boost parameters. Experimental results indicate that this hybrid method outperforms our single-retriever system, achieving improved accuracy while maintaining robust contextual grounding. These findings suggest that integrating multiple retrieval methodologies with weighted scoring effectively addresses the complexities of domain specific question answering in enterprise settings. △ Less

Submitted 21 December, 2024; v1 submitted 4 December, 2024; originally announced December 2024.

Comments: AAAI-25 Workshop on Document Understanding and Intelligence

arXiv:2412.01756 [pdf, other]

Adversarial Sample-Based Approach for Tighter Privacy Auditing in Final Model-Only Scenarios

Authors: Sangyeon Yoon, Wonje Jeung, Albert No

Abstract: Auditing Differentially Private Stochastic Gradient Descent (DP-SGD) in the final model setting is challenging and often results in empirical lower bounds that are significantly looser than theoretical privacy guarantees. We introduce a novel auditing method that achieves tighter empirical lower bounds without additional assumptions by crafting worst-case adversarial samples through loss-based inp… ▽ More Auditing Differentially Private Stochastic Gradient Descent (DP-SGD) in the final model setting is challenging and often results in empirical lower bounds that are significantly looser than theoretical privacy guarantees. We introduce a novel auditing method that achieves tighter empirical lower bounds without additional assumptions by crafting worst-case adversarial samples through loss-based input-space auditing. Our approach surpasses traditional canary-based heuristics and is effective in final model-only scenarios. Specifically, with a theoretical privacy budget of $\varepsilon = 10.0$, our method achieves empirical lower bounds of $4.914$, compared to the baseline of $4.385$ for MNIST. Our work offers a practical framework for reliable and accurate privacy auditing in differentially private machine learning. △ Less

Submitted 24 February, 2025; v1 submitted 2 December, 2024; originally announced December 2024.

Comments: 10 pages, NeurIPS (SFLLM Workshop)

arXiv:2412.01140 [pdf, other]

Dense Dispersed Structured Light for Hyperspectral 3D Imaging of Dynamic Scenes

Authors: Suhyun Shin, Seungwoo Yoon, Ryota Maeda, Seung-Hwan Baek

Abstract: Hyperspectral 3D imaging captures both depth maps and hyperspectral images, enabling comprehensive geometric and material analysis. Recent methods achieve high spectral and depth accuracy; however, they require long acquisition times often over several minutes or rely on large, expensive systems, restricting their use to static scenes. We present Dense Dispersed Structured Light (DDSL), an accurat… ▽ More Hyperspectral 3D imaging captures both depth maps and hyperspectral images, enabling comprehensive geometric and material analysis. Recent methods achieve high spectral and depth accuracy; however, they require long acquisition times often over several minutes or rely on large, expensive systems, restricting their use to static scenes. We present Dense Dispersed Structured Light (DDSL), an accurate hyperspectral 3D imaging method for dynamic scenes that utilizes stereo RGB cameras and an RGB projector equipped with an affordable diffraction grating film. We design spectrally multiplexed DDSL patterns that significantly reduce the number of required projector patterns, thereby accelerating acquisition speed. Additionally, we formulate an image formation model and a reconstruction method to estimate a hyperspectral image and depth map from captured stereo images. As the first practical and accurate hyperspectral 3D imaging method for dynamic scenes, we experimentally demonstrate that DDSL achieves a spectral resolution of 15.5 nm full width at half maximum (FWHM), a depth error of 4 mm, and a frame rate of 6.6 fps. △ Less

Submitted 2 December, 2024; originally announced December 2024.

arXiv:2411.19352 [pdf, other]

OMuleT: Orchestrating Multiple Tools for Practicable Conversational Recommendation

Authors: Se-eun Yoon, Xiaokai Wei, Yexi Jiang, Rachit Pareek, Frank Ong, Kevin Gao, Julian McAuley, Michelle Gong

Abstract: In this paper, we present a systematic effort to design, evaluate, and implement a realistic conversational recommender system (CRS). The objective of our system is to allow users to input free-form text to request recommendations, and then receive a list of relevant and diverse items. While previous work on synthetic queries augments large language models (LLMs) with 1-3 tools, we argue that a mo… ▽ More In this paper, we present a systematic effort to design, evaluate, and implement a realistic conversational recommender system (CRS). The objective of our system is to allow users to input free-form text to request recommendations, and then receive a list of relevant and diverse items. While previous work on synthetic queries augments large language models (LLMs) with 1-3 tools, we argue that a more extensive toolbox is necessary to effectively handle real user requests. As such, we propose a novel approach that equips LLMs with over 10 tools, providing them access to the internal knowledge base and API calls used in production. We evaluate our model on a dataset of real users and show that it generates relevant, novel, and diverse recommendations compared to vanilla LLMs. Furthermore, we conduct ablation studies to demonstrate the effectiveness of using the full range of tools in our toolbox. We share our designs and lessons learned from deploying the system for internal alpha release. Our contribution is the addressing of all four key aspects of a practicable CRS: (1) real user requests, (2) augmenting LLMs with a wide variety of tools, (3) extensive evaluation, and (4) deployment insights. △ Less

Submitted 31 December, 2024; v1 submitted 28 November, 2024; originally announced November 2024.

arXiv:2411.18040 [pdf, other]

A New Rarity Assessment of the `Disk of Satellites': the Milky Way System Is the Exception Rather than the Rule in the $Λ$CDM Cosmology

Authors: Chanoul Seo, Suk-Jin Yoon, Sanjaya Paudel, Sung-Ho An, Jun-Sung Moon

Abstract: The majority of satellite galaxies around the Milky Way (MW) show disk-like distributions (the disk of satellites; DoS), which is a small-scale problem of the $Λ$CDM cosmology. The conventional definition of the MW-like DoS is a satellite system with a minor-to-major axis ratio ($c$/$a$) lower than the MW's $c$/$a$ value of 0.181. Here we question the validity of the $c$/$a$-based DoS rarity asses… ▽ More The majority of satellite galaxies around the Milky Way (MW) show disk-like distributions (the disk of satellites; DoS), which is a small-scale problem of the $Λ$CDM cosmology. The conventional definition of the MW-like DoS is a satellite system with a minor-to-major axis ratio ($c$/$a$) lower than the MW's $c$/$a$ value of 0.181. Here we question the validity of the $c$/$a$-based DoS rarity assessment and propose an alternative approach. How satellites are placed around a galaxy is dictated mainly by two factors: the distributions of satellites' orbital poles and distances from the host. Based on this premise, we construct the `satellite distribution generator' code and generate 10$^5$ `spatially and kinematically analogous systems (SKASs)' sharing these two factors. The SKAS can disclose the intrinsic, underlying $c$/$a$ probability distribution function (PDF), from which a present-day $c$/$a$ value is fortuitously determined. We find that the $c$/$a$ PDF of the MW DoS defined by 11 classical satellites is quite broad ($σ_{c/a}$$\sim$0.105), implying that a simple present-day $c$/$a$ value, combined with its highly time-variable nature, cannot fully represent the degree of flatness. Moreover, based on the intrinsic $c$/$a$ PDF, we re-evaluate the rarity of the MW DoS by comparing it with IllustrisTNG50-1 host-satellite systems and find that even with the new measure, the MW DoS remains rare (0.00$\sim$3.40%). We show that the reason behind the rareness is that both orbital poles and distances of the 11 MW satellites are far more plane-friendly than those of simulated host-satellite systems, challenging the current structure and galaxy formation model. △ Less

Submitted 26 November, 2024; originally announced November 2024.

Comments: 23 pages, 15 figures

arXiv:2411.15466 [pdf, other]

Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator

Authors: Chaehun Shin, Jooyoung Choi, Heeseung Kim, Sungroh Yoon

Abstract: Subject-driven text-to-image generation aims to produce images of a new subject within a desired context by accurately capturing both the visual characteristics of the subject and the semantic content of a text prompt. Traditional methods rely on time- and resource-intensive fine-tuning for subject alignment, while recent zero-shot approaches leverage on-the-fly image prompting, often sacrificing… ▽ More Subject-driven text-to-image generation aims to produce images of a new subject within a desired context by accurately capturing both the visual characteristics of the subject and the semantic content of a text prompt. Traditional methods rely on time- and resource-intensive fine-tuning for subject alignment, while recent zero-shot approaches leverage on-the-fly image prompting, often sacrificing subject alignment. In this paper, we introduce Diptych Prompting, a novel zero-shot approach that reinterprets as an inpainting task with precise subject alignment by leveraging the emergent property of diptych generation in large-scale text-to-image models. Diptych Prompting arranges an incomplete diptych with the reference image in the left panel, and performs text-conditioned inpainting on the right panel. We further prevent unwanted content leakage by removing the background in the reference image and improve fine-grained details in the generated subject by enhancing attention weights between the panels during inpainting. Experimental results confirm that our approach significantly outperforms zero-shot image prompting methods, resulting in images that are visually preferred by users. Additionally, our method supports not only subject-driven generation but also stylized image generation and subject-driven image editing, demonstrating versatility across diverse image generation applications. Project page: https://diptychprompting.github.io/ △ Less

Submitted 23 November, 2024; originally announced November 2024.

arXiv:2411.14793 [pdf, other]

Style-Friendly SNR Sampler for Style-Driven Generation

Authors: Jooyoung Choi, Chaehun Shin, Yeongtak Oh, Heeseung Kim, Jungbeom Lee, Sungroh Yoon

Abstract: Recent text-to-image diffusion models generate high-quality images but struggle to learn new, personalized styles, which limits the creation of unique style templates. In style-driven generation, users typically supply reference images exemplifying the desired style, together with text prompts that specify desired stylistic attributes. Previous approaches popularly rely on fine-tuning, yet it ofte… ▽ More Recent text-to-image diffusion models generate high-quality images but struggle to learn new, personalized styles, which limits the creation of unique style templates. In style-driven generation, users typically supply reference images exemplifying the desired style, together with text prompts that specify desired stylistic attributes. Previous approaches popularly rely on fine-tuning, yet it often blindly utilizes objectives and noise level distributions from pre-training without adaptation. We discover that stylistic features predominantly emerge at higher noise levels, leading current fine-tuning methods to exhibit suboptimal style alignment. We propose the Style-friendly SNR sampler, which aggressively shifts the signal-to-noise ratio (SNR) distribution toward higher noise levels during fine-tuning to focus on noise levels where stylistic features emerge. This enhances models' ability to capture novel styles indicated by reference images and text prompts. We demonstrate improved generation of novel styles that cannot be adequately described solely with a text prompt, enabling the creation of new style templates for personalized content creation. △ Less

Submitted 20 March, 2025; v1 submitted 22 November, 2024; originally announced November 2024.

Comments: Project page: https://stylefriendly.github.io/

arXiv:2411.13036 [pdf, other]

Unsupervised Homography Estimation on Multimodal Image Pair via Alternating Optimization

Authors: Sanghyeob Song, Jaihyun Lew, Hyemi Jang, Sungroh Yoon

Abstract: Estimating the homography between two images is crucial for mid- or high-level vision tasks, such as image stitching and fusion. However, using supervised learning methods is often challenging or costly due to the difficulty of collecting ground-truth data. In response, unsupervised learning approaches have emerged. Most early methods, though, assume that the given image pairs are from the same ca… ▽ More Estimating the homography between two images is crucial for mid- or high-level vision tasks, such as image stitching and fusion. However, using supervised learning methods is often challenging or costly due to the difficulty of collecting ground-truth data. In response, unsupervised learning approaches have emerged. Most early methods, though, assume that the given image pairs are from the same camera or have minor lighting differences. Consequently, while these methods perform effectively under such conditions, they generally fail when input image pairs come from different domains, referred to as multimodal image pairs. To address these limitations, we propose AltO, an unsupervised learning framework for estimating homography in multimodal image pairs. Our method employs a two-phase alternating optimization framework, similar to Expectation-Maximization (EM), where one phase reduces the geometry gap and the other addresses the modality gap. To handle these gaps, we use Barlow Twins loss for the modality gap and propose an extended version, Geometry Barlow Twins, for the geometry gap. As a result, we demonstrate that our method, AltO, can be trained on multimodal datasets without any ground-truth data. It not only outperforms other unsupervised methods but is also compatible with various architectures of homography estimators. The source code can be found at:~\url{https://github.com/songsang7/AltO} △ Less

Submitted 19 November, 2024; originally announced November 2024.

Comments: This paper is accepted to the Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024)

arXiv:2411.11471 [pdf, other]

Generalizable Person Re-identification via Balancing Alignment and Uniformity

Authors: Yoonki Cho, Jaeyoon Kim, Woo Jae Kim, Junsik Jung, Sung-eui Yoon

Abstract: Domain generalizable person re-identification (DG re-ID) aims to learn discriminative representations that are robust to distributional shifts. While data augmentation is a straightforward solution to improve generalization, certain augmentations exhibit a polarized effect in this task, enhancing in-distribution performance while deteriorating out-of-distribution performance. In this paper, we inv… ▽ More Domain generalizable person re-identification (DG re-ID) aims to learn discriminative representations that are robust to distributional shifts. While data augmentation is a straightforward solution to improve generalization, certain augmentations exhibit a polarized effect in this task, enhancing in-distribution performance while deteriorating out-of-distribution performance. In this paper, we investigate this phenomenon and reveal that it leads to sparse representation spaces with reduced uniformity. To address this issue, we propose a novel framework, Balancing Alignment and Uniformity (BAU), which effectively mitigates this effect by maintaining a balance between alignment and uniformity. Specifically, BAU incorporates alignment and uniformity losses applied to both original and augmented images and integrates a weighting strategy to assess the reliability of augmented samples, further improving the alignment loss. Additionally, we introduce a domain-specific uniformity loss that promotes uniformity within each source domain, thereby enhancing the learning of domain-invariant features. Extensive experimental results demonstrate that BAU effectively exploits the advantages of data augmentation, which previous studies could not fully utilize, and achieves state-of-the-art performance without requiring complex training procedures. The code is available at \url{https://github.com/yoonkicho/BAU}. △ Less

Submitted 18 November, 2024; originally announced November 2024.

Comments: NeurIPS 2024

arXiv:2411.10045 [pdf, other]

doi 10.3847/2041-8213/ad8f3c

Discovery of a Rare Group of Dwarf Galaxies in the Local Universe

Authors: Sanjaya Paudel, Cristiano G. Sabiu, Suk-Jin Yoon, Pierre-Alain Duc, Jaewon Yoo, Oliver Müller

Abstract: We report the discovery of a rare isolated group of five dwarf galaxies located at z = 0.0086 ($D$ = 36 Mpc). All member galaxies are star-forming, blue, and gas-rich with $g-r$ indices ranging from 0.2 to 0.6 mag, and two of them show signs of ongoing mutual interaction. The most massive member of the group has a stellar mass that is half of the Small Magellanic Cloud stellar mass, and the median… ▽ More We report the discovery of a rare isolated group of five dwarf galaxies located at z = 0.0086 ($D$ = 36 Mpc). All member galaxies are star-forming, blue, and gas-rich with $g-r$ indices ranging from 0.2 to 0.6 mag, and two of them show signs of ongoing mutual interaction. The most massive member of the group has a stellar mass that is half of the Small Magellanic Cloud stellar mass, and the median stellar mass of the group members is 7.87 $\times$ 10$^{7}$ M$_{\odot}$. The derived total dynamical mass of the group is $M_{\rm dyn}$ = 6.02$\times$10$^{10}$ M$_{\odot}$, whereas its total baryonic mass (stellar + HI) is 2.6$\times$10$^{9}$ M$_{\odot}$, which gives us the dynamical to baryonic mass ratio of 23. Interestingly, all galaxies found in the group are aligned along a straight line in the plane of the sky. The observed spatial extent of the member galaxies is 154 kpc, and their relative line-of-sight velocity span is within 75 km s$^{-1}$. Using the spatially resolved optical spectra provided by DESI EDR, we find that three group members share a common rotational direction. With these unique properties of the group and its member galaxies, we discuss the possible importance of such a system in the formation and evolution of dwarf galaxy groups and in testing the theory of large-scale structure formation. △ Less

Submitted 15 November, 2024; originally announced November 2024.

Comments: Accepted for publication in ApJL

arXiv:2411.09944 [pdf, other]

SlimLM: An Efficient Small Language Model for On-Device Document Assistance

Authors: Thang M. Pham, Phat T. Nguyen, Seunghyun Yoon, Viet Dac Lai, Franck Dernoncourt, Trung Bui

Abstract: While small language models (SLMs) show promises for mobile deployment, their real-world performance and applications on smartphones remains underexplored. We present SlimLM, a series of SLMs optimized for document assistance tasks on mobile devices. Through extensive experiments on a Samsung Galaxy S24, we identify the optimal trade-offs between model size (ranging from 125M to 7B parameters), co… ▽ More While small language models (SLMs) show promises for mobile deployment, their real-world performance and applications on smartphones remains underexplored. We present SlimLM, a series of SLMs optimized for document assistance tasks on mobile devices. Through extensive experiments on a Samsung Galaxy S24, we identify the optimal trade-offs between model size (ranging from 125M to 7B parameters), context length, and inference time for efficient on-device processing. SlimLM is pre-trained on SlimPajama-627B and fine-tuned on DocAssist, our constructed dataset for summarization, question answering and suggestion tasks. Our smallest model demonstrates efficient performance on S24, while larger variants offer enhanced capabilities within mobile constraints. We evaluate SlimLM against existing SLMs, showing comparable or superior performance and offering a benchmark for future research in on-device language models. We also provide an Android application, offering practical insights into SLM deployment. Our findings provide valuable insights and illuminate the capabilities of running advanced language models on high-end smartphones, potentially reducing server costs and enhancing privacy through on-device processing. △ Less

Submitted 25 November, 2024; v1 submitted 14 November, 2024; originally announced November 2024.

arXiv:2411.08378 [pdf, other]

Physics Informed Distillation for Diffusion Models

Authors: Joshua Tian Jin Tee, Kang Zhang, Hee Suk Yoon, Dhananjaya Nagaraja Gowda, Chanwoo Kim, Chang D. Yoo

Abstract: Diffusion models have recently emerged as a potent tool in generative modeling. However, their inherent iterative nature often results in sluggish image generation due to the requirement for multiple model evaluations. Recent progress has unveiled the intrinsic link between diffusion models and Probability Flow Ordinary Differential Equations (ODEs), thus enabling us to conceptualize diffusion mod… ▽ More Diffusion models have recently emerged as a potent tool in generative modeling. However, their inherent iterative nature often results in sluggish image generation due to the requirement for multiple model evaluations. Recent progress has unveiled the intrinsic link between diffusion models and Probability Flow Ordinary Differential Equations (ODEs), thus enabling us to conceptualize diffusion models as ODE systems. Simultaneously, Physics Informed Neural Networks (PINNs) have substantiated their effectiveness in solving intricate differential equations through implicit modeling of their solutions. Building upon these foundational insights, we introduce Physics Informed Distillation (PID), which employs a student model to represent the solution of the ODE system corresponding to the teacher diffusion model, akin to the principles employed in PINNs. Through experiments on CIFAR 10 and ImageNet 64x64, we observe that PID achieves performance comparable to recent distillation methods. Notably, it demonstrates predictable trends concerning method-specific hyperparameters and eliminates the need for synthetic dataset generation during the distillation process. Both of which contribute to its easy-to-use nature as a distillation approach for Diffusion Models. Our code and pre-trained checkpoint are publicly available at: https://github.com/pantheon5100/pid_diffusion.git. △ Less

Submitted 13 November, 2024; originally announced November 2024.

arXiv:2411.07542 [pdf, other]

Radio Follow-up Observations of SN 2023ixf by Japanese and Korean VLBIs

Authors: Yuhei Iwata, Masanori Akimoto, Tomoki Matsuoka, Keiichi Maeda, Yoshinori Yonekura, Nozomu Tominaga, Takashi J. Moriya, Kenta Fujisawa, Kotaro Niinuma, Sung-Chul Yoon, Jae-Joon Lee, Taehyun Jung, Do-Young Byun

Abstract: We report on radio follow-up observations of the nearby Type II supernova, SN 2023ixf, spanning from 1.7 to 269.9 days after the explosion, conducted using three very long baseline interferometers (VLBIs), which are the Japanese VLBI Network (JVN), the VLBI Exploration of Radio Astrometry (VERA), and the Korean VLBI Network (KVN). In three observation epochs (152.3, 206.1, and 269.9 days), we dete… ▽ More We report on radio follow-up observations of the nearby Type II supernova, SN 2023ixf, spanning from 1.7 to 269.9 days after the explosion, conducted using three very long baseline interferometers (VLBIs), which are the Japanese VLBI Network (JVN), the VLBI Exploration of Radio Astrometry (VERA), and the Korean VLBI Network (KVN). In three observation epochs (152.3, 206.1, and 269.9 days), we detected emission at the 6.9 and 8.4 GHz bands, with a flux density of $\sim 5$ mJy. The flux density reached a peak at around 206.1 days, which is longer than the timescale to reach the peak observed in typical Type II supernovae. Based on the analytical model of radio emission, our late-time detections were inferred to be due to the decreasing optical depth. In this case, the mass-loss rate of the progenitor is estimated to have increased from $\sim 10^{-6} - 10^{-5}\, M_{\odot}\,{\rm yr^{-1}}$ to $\sim 10^{-4}\, M_{\odot}\,{\rm yr^{-1}}$ between 28 and 6 years before the explosion. Our radio constraints are also consistent with the mass-loss rate to produce a confined circumstellar medium proposed by previous studies, which suggest that the mass-loss rate increased from $\sim 10^{-4}\, M_{\odot}\,{\rm yr^{-1}}$ to $\gtrsim 10^{-2}\, M_{\odot}\,{\rm yr^{-1}}$ in the last few years before the explosion. △ Less

Submitted 11 November, 2024; originally announced November 2024.

Comments: 12 pages, 3 figures, 3 tables. Accepted for publication in ApJ

arXiv:2411.06045 [pdf, other]

doi 10.3847/1538-3881/ad87ec

Moving Groups in the Solar Neighborhood with Gaia, APOGEE, GALAH, and LAMOST: Dynamical Effects Gather Gas and the Ensuing Star Formation Plays an Important Role in Shaping the Stellar Velocity Distributions

Authors: Xilong Liang, Suk-Jin Yoon, Jingkun Zhao

Abstract: With Gaia, APOGEE, GALAH, and LAMOST data, we investigate the positional, kinematic, chemical, and age properties of nine moving groups in the solar neighborhood. We find that each moving group has a distinct distribution in the velocity space in terms of its metallicity, $α$ abundance, and age. Comparison of the moving groups with their underlying background stars suggests that they have experien… ▽ More With Gaia, APOGEE, GALAH, and LAMOST data, we investigate the positional, kinematic, chemical, and age properties of nine moving groups in the solar neighborhood. We find that each moving group has a distinct distribution in the velocity space in terms of its metallicity, $α$ abundance, and age. Comparison of the moving groups with their underlying background stars suggests that they have experienced the enhanced, prolonged star formation. We infer that any dynamical effects that gathered stars as a moving group in the velocity space also worked for gas. We propose for the first time that the ensuing newborn stars from such gas inherited the kinematic feature from the gas, shaping the current stellar velocity distributions of the groups. Our findings improve the understanding of the origins and evolutionary histories of moving groups in the solar neighborhood. △ Less

Submitted 8 November, 2024; originally announced November 2024.

Comments: 22 page2, 9 figures

Journal ref: AJ 168 277 (2024)

arXiv:2411.05793 [pdf, other]

A Comprehensive Survey of Deep Learning for Time Series Forecasting: Architectural Diversity and Open Challenges

Authors: Jongseon Kim, Hyungjoon Kim, HyunGi Kim, Dongjun Lee, Sungroh Yoon

Abstract: Time series forecasting is a critical task that provides key information for decision-making. After traditional statistical and machine learning approaches, various fundamental deep learning architectures such as MLPs, CNNs, RNNs, and GNNs have been developed. However, the structural limitations caused by the inductive biases of each deep learning architecture constrained their performance. Transf… ▽ More Time series forecasting is a critical task that provides key information for decision-making. After traditional statistical and machine learning approaches, various fundamental deep learning architectures such as MLPs, CNNs, RNNs, and GNNs have been developed. However, the structural limitations caused by the inductive biases of each deep learning architecture constrained their performance. Transformer models, which excel at handling long-term dependencies, have become significant architectural components for time series forecasting. However, recent research has shown that alternatives such as simple linear layers can outperform Transformers. These findings have opened up new possibilities for using diverse architectures, ranging from fundamental deep learning models to emerging architectures and hybrid approaches. In this context, architectural modeling of time series forecasting has now entered a renaissance. This survey not only provides a historical context for time series forecasting but also offers comprehensive and timely analysis of the movement toward architectural diversification. By comparing and re-examining deep learning models, we uncover new perspectives and present recent trends, including hybrid, diffusion, Mamba, and foundation models. By focusing on the inherent characteristics of time series data, we also address open challenges that have gained attention in time series forecasting, such as channel dependency, distribution shift, causality, and feature extraction. These contributions help lower entry barriers for newcomers by providing a systematic understanding of the diverse research areas in time series forecasting (TSF), while offering seasoned researchers broader perspectives and new opportunities through in-depth exploration of TSF challenges. (Shortened due to arXiv's 1,920-character limit. Full version in the paper.) △ Less

Submitted 1 May, 2025; v1 submitted 24 October, 2024; originally announced November 2024.

Comments: This is the accepted manuscript of the article published in Artificial Intelligence Review. The final authenticated version is available at: https://doi.org/10.1007/s10462-025-11223-9

Showing 51–100 of 1,041 results for author: Yoon, S