-
Prediction of Frozen Region Growth in Kidney Cryoablation Intervention Using a 3D Flow-Matching Model
Authors:
Siyeop Yoon,
Yujin Oh,
Matthew Tivnan,
Sifan Song,
Pengfei Jin,
Sekeun Kim,
Hyun Jin Cho,
Dufan Wu,
Raul Uppot,
Quanzheng Li
Abstract:
This study presents a 3D flow-matching model designed to predict the progression of the frozen region (iceball) during kidney cryoablation. Precise intraoperative guidance is critical in cryoablation to ensure complete tumor eradication while preserving adjacent healthy tissue. However, conventional methods, typically based on physics driven or diffusion based simulations, are computationally dema…
▽ More
This study presents a 3D flow-matching model designed to predict the progression of the frozen region (iceball) during kidney cryoablation. Precise intraoperative guidance is critical in cryoablation to ensure complete tumor eradication while preserving adjacent healthy tissue. However, conventional methods, typically based on physics driven or diffusion based simulations, are computationally demanding and often struggle to represent complex anatomical structures accurately. To address these limitations, our approach leverages intraoperative CT imaging to inform the model. The proposed 3D flow matching model is trained to learn a continuous deformation field that maps early-stage CT scans to future predictions. This transformation not only estimates the volumetric expansion of the iceball but also generates corresponding segmentation masks, effectively capturing spatial and morphological changes over time. Quantitative analysis highlights the model robustness, demonstrating strong agreement between predictions and ground-truth segmentations. The model achieves an Intersection over Union (IoU) score of 0.61 and a Dice coefficient of 0.75. By integrating real time CT imaging with advanced deep learning techniques, this approach has the potential to enhance intraoperative guidance in kidney cryoablation, improving procedural outcomes and advancing the field of minimally invasive surgery.
△ Less
Submitted 11 March, 2025; v1 submitted 6 March, 2025;
originally announced March 2025.
-
Game-Theoretic Regularized Self-Play Alignment of Large Language Models
Authors:
Xiaohang Tang,
Sangwoong Yoon,
Seongho Son,
Huizhuo Yuan,
Quanquan Gu,
Ilija Bogunovic
Abstract:
Self-play alignment algorithms have been developed as effective methods for fine-tuning large language models (LLMs), formulating preference optimization as a two-player game. However, the regularization with respect to the reference policy, which is crucial for mitigating over-optimization, has been insufficiently investigated in self-play alignment. In this paper, we show that our regularization…
▽ More
Self-play alignment algorithms have been developed as effective methods for fine-tuning large language models (LLMs), formulating preference optimization as a two-player game. However, the regularization with respect to the reference policy, which is crucial for mitigating over-optimization, has been insufficiently investigated in self-play alignment. In this paper, we show that our regularization method can improve the unregularized self-play significantly. To study the impact of different regularizations in self-play alignment, we propose Regularized Self-Play Policy Optimization (RSPO). This generalized framework regularizes the self-play by simply adding a chosen regularization term into the loss while maintaining provable last-iterate convergence to the Nash Equilibrium of the corresponding regularized game. Surprisingly, empirical evaluations using the Mistral-7B-Instruct base model reveal that forward KL divergence regularization reduces response length in RSPO, whereas reverse KL divergence markedly improves raw win rates. RSPO with a linear combination of forward and reverse KL divergence regularization substantially increases the length-controlled win rate in AlpacaEval-2, elevating the unregularized self-play alignment method (SPPO) from $28.53\%$ to $35.44\%$. Finally, we show that RSPO also improves the response diversity.
△ Less
Submitted 24 February, 2025;
originally announced March 2025.
-
EdiText: Controllable Coarse-to-Fine Text Editing with Diffusion Language Models
Authors:
Che Hyun Lee,
Heeseung Kim,
Jiheum Yeom,
Sungroh Yoon
Abstract:
We propose EdiText, a controllable text editing method that modifies the reference text to desired attributes at various scales. We integrate an SDEdit-based editing technique that allows for broad adjustments in the degree of text editing. Additionally, we introduce a novel fine-level editing method based on self-conditioning, which allows subtle control of reference text. While being capable of…
▽ More
We propose EdiText, a controllable text editing method that modifies the reference text to desired attributes at various scales. We integrate an SDEdit-based editing technique that allows for broad adjustments in the degree of text editing. Additionally, we introduce a novel fine-level editing method based on self-conditioning, which allows subtle control of reference text. While being capable of editing on its own, this fine-grained method, integrated with the SDEdit approach, enables EdiText to make precise adjustments within the desired range. EdiText demonstrates its controllability to robustly adjust reference text at a broad range of levels across various tasks, including toxicity control and sentiment control.
△ Less
Submitted 2 June, 2025; v1 submitted 27 February, 2025;
originally announced February 2025.
-
Does Your Voice Assistant Remember? Analyzing Conversational Context Recall and Utilization in Voice Interaction Models
Authors:
Heeseung Kim,
Che Hyun Lee,
Sangkwon Park,
Jiheum Yeom,
Nohil Park,
Sangwon Yu,
Sungroh Yoon
Abstract:
Recent advancements in multi-turn voice interaction models have improved user-model communication. However, while closed-source models effectively retain and recall past utterances, whether open-source models share this ability remains unexplored. To fill this gap, we systematically evaluate how well open-source interaction models utilize past utterances using ContextDialog, a benchmark we propose…
▽ More
Recent advancements in multi-turn voice interaction models have improved user-model communication. However, while closed-source models effectively retain and recall past utterances, whether open-source models share this ability remains unexplored. To fill this gap, we systematically evaluate how well open-source interaction models utilize past utterances using ContextDialog, a benchmark we proposed for this purpose. Our findings show that speech-based models have more difficulty than text-based ones, especially when recalling information conveyed in speech, and even with retrieval-augmented generation, models still struggle with questions about past utterances. These insights highlight key limitations in open-source models and suggest ways to improve memory retention and retrieval robustness.
△ Less
Submitted 23 May, 2025; v1 submitted 26 February, 2025;
originally announced February 2025.
-
FaithUn: Toward Faithful Forgetting in Language Models by Investigating the Interconnectedness of Knowledge
Authors:
Nakyeong Yang,
Minsung Kim,
Seunghyun Yoon,
Joongbo Shin,
Kyomin Jung
Abstract:
Various studies have attempted to remove sensitive or private knowledge from a language model to prevent its unauthorized exposure. However, prior studies have overlooked the complex and interconnected nature of knowledge, where related knowledge must be carefully examined. Specifically, they have failed to evaluate whether an unlearning method faithfully erases interconnected knowledge that shoul…
▽ More
Various studies have attempted to remove sensitive or private knowledge from a language model to prevent its unauthorized exposure. However, prior studies have overlooked the complex and interconnected nature of knowledge, where related knowledge must be carefully examined. Specifically, they have failed to evaluate whether an unlearning method faithfully erases interconnected knowledge that should be removed, retaining knowledge that appears relevant but exists in a completely different context. To resolve this problem, we first define a new concept called superficial unlearning, which refers to the phenomenon where an unlearning method either fails to erase the interconnected knowledge it should remove or unintentionally erases irrelevant knowledge. Based on the definition, we introduce a new benchmark, FaithUn, to analyze and evaluate the faithfulness of unlearning in real-world knowledge QA settings. Furthermore, we propose a novel unlearning method, KLUE, which updates only knowledge-related neurons to achieve faithful unlearning. KLUE identifies knowledge neurons using an explainability method and updates only those neurons using selected unforgotten samples. Experimental results demonstrate that widely-used unlearning methods fail to ensure faithful unlearning, while our method shows significant effectiveness in real-world QA unlearning.
△ Less
Submitted 26 February, 2025;
originally announced February 2025.
-
Letters from Future Self: Augmenting the Letter-Exchange Exercise with LLM-based Agents to Enhance Young Adults' Career Exploration
Authors:
Hayeon Jeon,
Suhwoo Yoon,
Keyeun Lee,
Seo Hyeong Kim,
Esther Hehsun Kim,
Seonghye Cho,
Yena Ko,
Soeun Yang,
Laura Dabbish,
John Zimmerman,
Eun-mee Kim,
Hajin Lim
Abstract:
Young adults often encounter challenges in career exploration. Self-guided interventions, such as the letter-exchange exercise, where participants envision and adopt the perspective of their future selves by exchanging letters with their envisioned future selves, can support career development. However, the broader adoption of such interventions may be limited without structured guidance. To addre…
▽ More
Young adults often encounter challenges in career exploration. Self-guided interventions, such as the letter-exchange exercise, where participants envision and adopt the perspective of their future selves by exchanging letters with their envisioned future selves, can support career development. However, the broader adoption of such interventions may be limited without structured guidance. To address this, we integrated Large Language Model (LLM)-based agents that simulate participants' future selves into the letter-exchange exercise and evaluated their effectiveness. A one-week experiment (N=36) compared three conditions: (1) participants manually writing replies to themselves from the perspective of their future selves (baseline), (2) future-self agents generating letters to participants, and (3) future-self agents engaging in chat conversations with participants. Results indicated that exchanging letters with future-self agents enhanced participants' engagement during the exercise, while overall benefits of the intervention on future orientation, career self-concept, and psychological support remained comparable across conditions. We discuss design implications for AI-augmented interventions for supporting young adults' career exploration.
△ Less
Submitted 5 May, 2025; v1 submitted 26 February, 2025;
originally announced February 2025.
-
Characterization of a TES-based Anti-Coincidence Detector for Future Large Field-of-View X-ray Calorimetry Missions
Authors:
Samuel V. Hull,
Joseph S. Adams,
Simon R. Bandler,
Matthew Cherry,
James A. Chervenak,
Renata Cumbee,
Xavier Defay,
Enectali Figueroa-Feliciano,
Fred M. Finkbeiner,
Joshua Fuhrman,
Richard L. Kelley,
Christopher Kenney,
Caroline A. Kilbourne,
Noah Kurinsky,
Jennette Mateo,
Haruka Muramatsu,
Frederick S. Porter,
Kazuhiro Sakai,
Aviv Simchony,
Stephen J. Smith,
Zoe Smith,
Nicholas A. Wakeham,
Edward J. Wassell,
Sang H. Yoon,
Betty A. Young
Abstract:
Microcalorimeter instruments aboard future X-ray observatories will require an anti-coincidence (anti-co) detector to veto charged particle events and reduce the non-X-ray background. We have developed a large-format, TES-based prototype anti-coincidence detector that is particularly suitable for use with spatially-extended (~ 10 cm^2}) TES microcalorimeter arrays, as would be used for a future la…
▽ More
Microcalorimeter instruments aboard future X-ray observatories will require an anti-coincidence (anti-co) detector to veto charged particle events and reduce the non-X-ray background. We have developed a large-format, TES-based prototype anti-coincidence detector that is particularly suitable for use with spatially-extended (~ 10 cm^2}) TES microcalorimeter arrays, as would be used for a future large field-of-view X-ray missions. This prototype was developed in the context of the Line Emission Mapper (LEM) probe concept, which required a ~ 14 cm^2 anti-co detector with > 95% live time and a low-energy threshold below 20 keV. Our anti-co design employs parallel networks of quasiparticle-trap-assisted electrothermal feedback TESs (QETs) to detect the athermal phonon signal produced in the detector substrate by incident charged particles. We developed multiple prototype anti-co designs featuring 12 channels and up to 6300 QETs. Here we focus on a design utilizing tungsten TESs and present characterization results. Broad energy range measurements have been performed (4.1 keV -- 5.5 MeV). Based on noise and responsivity measurements, the implied low-energy threshold is < 1 keV and a live time fraction of > 96% can be achieved up to 5.5 MeV. We also find evidence of mm-scale-or-better spatial resolution and discuss the potential utility of this for future missions. Finally, we discuss the early development of a soild-state physics model of the anti-co towards understanding phonon propagation and quasiparticle production in the detector.
△ Less
Submitted 19 February, 2025;
originally announced February 2025.
-
Value Gradient Sampler: Sampling as Sequential Decision Making
Authors:
Sangwoong Yoon,
Himchan Hwang,
Hyeokju Jeong,
Dong Kyu Shin,
Che-Sang Park,
Sehee Kweon,
Frank Chongwoo Park
Abstract:
We propose the Value Gradient Sampler (VGS), a trainable sampler based on the interpretation of sampling as discrete-time sequential decision-making. VGS generates samples from a given unnormalized density (i.e., energy) by drifting and diffusing randomly initialized particles. In VGS, finding the optimal drift is equivalent to solving an optimal control problem where the cost is the upper bound o…
▽ More
We propose the Value Gradient Sampler (VGS), a trainable sampler based on the interpretation of sampling as discrete-time sequential decision-making. VGS generates samples from a given unnormalized density (i.e., energy) by drifting and diffusing randomly initialized particles. In VGS, finding the optimal drift is equivalent to solving an optimal control problem where the cost is the upper bound of the KL divergence between the target density and the samples. We employ value-based dynamic programming to solve this optimal control problem, which gives the gradient of the value function as the optimal drift vector. The connection to sequential decision making allows VGS to leverage extensively studied techniques in reinforcement learning, making VGS a fast, adaptive, and accurate sampler that achieves competitive results in various sampling benchmarks. Furthermore, VGS can replace MCMC in contrastive divergence training of energy-based models. We demonstrate the effectiveness of VGS in training accurate energy-based models in industrial anomaly detection applications.
△ Less
Submitted 1 March, 2025; v1 submitted 18 February, 2025;
originally announced February 2025.
-
From Selection to Generation: A Survey of LLM-based Active Learning
Authors:
Yu Xia,
Subhojyoti Mukherjee,
Zhouhang Xie,
Junda Wu,
Xintong Li,
Ryan Aponte,
Hanjia Lyu,
Joe Barrow,
Hongjie Chen,
Franck Dernoncourt,
Branislav Kveton,
Tong Yu,
Ruiyi Zhang,
Jiuxiang Gu,
Nesreen K. Ahmed,
Yu Wang,
Xiang Chen,
Hanieh Deilamsalehy,
Sungchul Kim,
Zhengmian Hu,
Yue Zhao,
Nedim Lipka,
Seunghyun Yoon,
Ting-Hao Kenneth Huang,
Zichao Wang
, et al. (9 additional authors not shown)
Abstract:
Active Learning (AL) has been a powerful paradigm for improving model efficiency and performance by selecting the most informative data points for labeling and training. In recent active learning frameworks, Large Language Models (LLMs) have been employed not only for selection but also for generating entirely new data instances and providing more cost-effective annotations. Motivated by the incre…
▽ More
Active Learning (AL) has been a powerful paradigm for improving model efficiency and performance by selecting the most informative data points for labeling and training. In recent active learning frameworks, Large Language Models (LLMs) have been employed not only for selection but also for generating entirely new data instances and providing more cost-effective annotations. Motivated by the increasing importance of high-quality data and efficient model training in the era of LLMs, we present a comprehensive survey on LLM-based Active Learning. We introduce an intuitive taxonomy that categorizes these techniques and discuss the transformative roles LLMs can play in the active learning loop. We further examine the impact of AL on LLM learning paradigms and its applications across various domains. Finally, we identify open challenges and propose future research directions. This survey aims to serve as an up-to-date resource for researchers and practitioners seeking to gain an intuitive understanding of LLM-based AL techniques and deploy them to new applications.
△ Less
Submitted 31 May, 2025; v1 submitted 17 February, 2025;
originally announced February 2025.
-
Learning to Explain Air Traffic Situation
Authors:
Hong-ah Chai,
Seokbin Yoon,
Keumjin Lee
Abstract:
Understanding how air traffic controllers construct a mental 'picture' of complex air traffic situations is crucial but remains a challenge due to the inherently intricate, high-dimensional interactions between aircraft, pilots, and controllers. Previous work on modeling the strategies of air traffic controllers and their mental image of traffic situations often centers on specific air traffic con…
▽ More
Understanding how air traffic controllers construct a mental 'picture' of complex air traffic situations is crucial but remains a challenge due to the inherently intricate, high-dimensional interactions between aircraft, pilots, and controllers. Previous work on modeling the strategies of air traffic controllers and their mental image of traffic situations often centers on specific air traffic control tasks or pairwise interactions between aircraft, neglecting to capture the comprehensive dynamics of an air traffic situation. To address this issue, we propose a machine learning-based framework for explaining air traffic situations. Specifically, we employ a Transformer-based multi-agent trajectory model that encapsulates both the spatio-temporal movement of aircraft and social interaction between them. By deriving attention scores from the model, we can quantify the influence of individual aircraft on overall traffic dynamics. This provides explainable insights into how air traffic controllers perceive and understand the traffic situation. Trained on real-world air traffic surveillance data collected from the terminal airspace around Incheon International Airport in South Korea, our framework effectively explicates air traffic situations. This could potentially support and enhance the decision-making and situational awareness of air traffic controllers.
△ Less
Submitted 27 May, 2025; v1 submitted 15 February, 2025;
originally announced February 2025.
-
Noise Controlled CT Super-Resolution with Conditional Diffusion Model
Authors:
Yuang Wang,
Siyeop Yoon,
Rui Hu,
Baihui Yu,
Duhgoon Lee,
Rajiv Gupta,
Li Zhang,
Zhiqiang Chen,
Dufan Wu
Abstract:
Improving the spatial resolution of CT images is a meaningful yet challenging task, often accompanied by the issue of noise amplification. This article introduces an innovative framework for noise-controlled CT super-resolution utilizing the conditional diffusion model. The model is trained on hybrid datasets, combining noise-matched simulation data with segmented details from real data. Experimen…
▽ More
Improving the spatial resolution of CT images is a meaningful yet challenging task, often accompanied by the issue of noise amplification. This article introduces an innovative framework for noise-controlled CT super-resolution utilizing the conditional diffusion model. The model is trained on hybrid datasets, combining noise-matched simulation data with segmented details from real data. Experimental results with real CT images validate the effectiveness of our proposed framework, showing its potential for practical applications in CT imaging.
△ Less
Submitted 13 February, 2025;
originally announced February 2025.
-
RoToR: Towards More Reliable Responses for Order-Invariant Inputs
Authors:
Soyoung Yoon,
Dongha Ahn,
Youngwon Lee,
Minkyu Jung,
HyungJoo Jang,
Seung-won Hwang
Abstract:
Mitigating positional bias of language models (LMs) for listwise inputs is a well-known and important problem (e.g., lost-in-the-middle). While zero-shot order-invariant LMs have been proposed to solve this issue, their success on practical listwise problems has been limited. In this work, as a first contribution, we identify and overcome two limitations to make zero-shot invariant LMs more practi…
▽ More
Mitigating positional bias of language models (LMs) for listwise inputs is a well-known and important problem (e.g., lost-in-the-middle). While zero-shot order-invariant LMs have been proposed to solve this issue, their success on practical listwise problems has been limited. In this work, as a first contribution, we identify and overcome two limitations to make zero-shot invariant LMs more practical: (1) training and inference distribution mismatch arising from modifying positional ID assignments to enforce invariance, and (2) failure to adapt to mixture of order-invariant and sensitive inputs in practical listwise problems. Then, to overcome these issues we propose (1) RoToR, a zero-shot invariant LM for genuinely order-invariant inputs with minimal modifications of positional IDs, and (2) Selective Routing, an adaptive framework that handles both order-invariant and order-sensitive inputs in listwise tasks. On the Lost in the middle (LitM), Knowledge Graph QA (KGQA), and MMLU benchmarks, we show that RoToR with Selective Routing can effectively handle practical listwise input tasks in a zero-shot manner (https://github.com/soyoung97/RoToR)
△ Less
Submitted 2 June, 2025; v1 submitted 10 February, 2025;
originally announced February 2025.
-
Solving the Content Gap in Roblox Game Recommendations: LLM-Based Profile Generation and Reranking
Authors:
Chen Wang,
Xiaokai Wei,
Yexi Jiang,
Frank Ong,
Kevin Gao,
Xiao Yu,
Zheng Hui,
Se-eun Yoon,
Philip Yu,
Michelle Gong
Abstract:
With the vast and dynamic user-generated content on Roblox, creating effective game recommendations requires a deep understanding of game content. Traditional recommendation models struggle with the inconsistent and sparse nature of game text features such as titles and descriptions. Recent advancements in large language models (LLMs) offer opportunities to enhance recommendation systems by analyz…
▽ More
With the vast and dynamic user-generated content on Roblox, creating effective game recommendations requires a deep understanding of game content. Traditional recommendation models struggle with the inconsistent and sparse nature of game text features such as titles and descriptions. Recent advancements in large language models (LLMs) offer opportunities to enhance recommendation systems by analyzing in-game text data. This paper addresses two challenges: generating high-quality, structured text features for games without extensive human annotation, and validating these features to ensure they improve recommendation relevance. We propose an approach that extracts in-game text and uses LLMs to infer attributes such as genre and gameplay objectives from raw player interactions. Additionally, we introduce an LLM-based re-ranking mechanism to assess the effectiveness of the generated text features, enhancing personalization and user satisfaction. Beyond recommendations, our approach supports applications such as user engagement-based integrity detection, already deployed in production. This scalable framework demonstrates the potential of in-game text understanding to improve recommendation quality on Roblox and adapt recommendations to its unique, user-generated ecosystem.
△ Less
Submitted 1 February, 2025;
originally announced February 2025.
-
NoLiMa: Long-Context Evaluation Beyond Literal Matching
Authors:
Ali Modarressi,
Hanieh Deilamsalehy,
Franck Dernoncourt,
Trung Bui,
Ryan A. Rossi,
Seunghyun Yoon,
Hinrich Schütze
Abstract:
Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a "needle" (relevant information) from a "haystack" (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in…
▽ More
Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a "needle" (relevant information) from a "haystack" (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information. We publicly release the dataset and evaluation code at https://github.com/adobe-research/NoLiMa.
△ Less
Submitted 26 March, 2025; v1 submitted 7 February, 2025;
originally announced February 2025.
-
Differentiable Mobile Display Photometric Stereo
Authors:
Gawoon Ban,
Hyeongjun Kim,
Seokjun Choi,
Seungwoo Yoon,
Seung-Hwan Baek
Abstract:
Display photometric stereo uses a display as a programmable light source to illuminate a scene with diverse illumination conditions. Recently, differentiable display photometric stereo (DDPS) demonstrated improved normal reconstruction accuracy by using learned display patterns. However, DDPS faced limitations in practicality, requiring a fixed desktop imaging setup using a polarization camera and…
▽ More
Display photometric stereo uses a display as a programmable light source to illuminate a scene with diverse illumination conditions. Recently, differentiable display photometric stereo (DDPS) demonstrated improved normal reconstruction accuracy by using learned display patterns. However, DDPS faced limitations in practicality, requiring a fixed desktop imaging setup using a polarization camera and a desktop-scale monitor. In this paper, we propose a more practical physics-based photometric stereo, differentiable mobile display photometric stereo (DMDPS), that leverages a mobile phone consisting of a display and a camera. We overcome the limitations of using a mobile device by developing a mobile app and method that simultaneously displays patterns and captures high-quality HDR images. Using this technique, we capture real-world 3D-printed objects and learn display patterns via a differentiable learning process. We demonstrate the effectiveness of DMDPS on both a 3D printed dataset and a first dataset of fallen leaves. The leaf dataset contains reconstructed surface normals and albedos of fallen leaves that may enable future research beyond computer graphics and vision. We believe that DMDPS takes a step forward for practical physics-based photometric stereo.
△ Less
Submitted 7 February, 2025;
originally announced February 2025.
-
Visual Attention Never Fades: Selective Progressive Attention ReCalibration for Detailed Image Captioning in Multimodal Large Language Models
Authors:
Mingi Jung,
Saehuyng Lee,
Eunji Kim,
Sungroh Yoon
Abstract:
Detailed image captioning is essential for tasks like data generation and aiding visually impaired individuals. High-quality captions require a balance between precision and recall, which remains challenging for current multimodal large language models (MLLMs). In this work, we hypothesize that this limitation stems from weakening and increasingly noisy visual attention as responses lengthen. To a…
▽ More
Detailed image captioning is essential for tasks like data generation and aiding visually impaired individuals. High-quality captions require a balance between precision and recall, which remains challenging for current multimodal large language models (MLLMs). In this work, we hypothesize that this limitation stems from weakening and increasingly noisy visual attention as responses lengthen. To address this issue, we propose SPARC (Selective Progressive Attention ReCalibration), a training-free method that enhances the contribution of visual tokens during decoding. SPARC is founded on three key observations: (1) increasing the influence of all visual tokens reduces recall; thus, SPARC selectively amplifies visual tokens; (2) as captions lengthen, visual attention becomes noisier, so SPARC identifies critical visual tokens by leveraging attention differences across time steps; (3) as visual attention gradually weakens, SPARC reinforces it to preserve its influence. Our experiments, incorporating both automated and human evaluations, demonstrate that existing methods improve the precision of MLLMs at the cost of recall. In contrast, our proposed method enhances both precision and recall with minimal computational overhead.
△ Less
Submitted 3 February, 2025;
originally announced February 2025.
-
Knowledge Synthesis of Photosynthesis Research Using a Large Language Model
Authors:
Seungri Yoon,
Woosang Jeon,
Sanghyeok Choi,
Taehyeong Kim,
Tae In Ahn
Abstract:
The development of biological data analysis tools and large language models (LLMs) has opened up new possibilities for utilizing AI in plant science research, with the potential to contribute significantly to knowledge integration and research gap identification. Nonetheless, current LLMs struggle to handle complex biological data and theoretical models in photosynthesis research and often fail to…
▽ More
The development of biological data analysis tools and large language models (LLMs) has opened up new possibilities for utilizing AI in plant science research, with the potential to contribute significantly to knowledge integration and research gap identification. Nonetheless, current LLMs struggle to handle complex biological data and theoretical models in photosynthesis research and often fail to provide accurate scientific contexts. Therefore, this study proposed a photosynthesis research assistant (PRAG) based on OpenAI's GPT-4o with retrieval-augmented generation (RAG) techniques and prompt optimization. Vector databases and an automated feedback loop were used in the prompt optimization process to enhance the accuracy and relevance of the responses to photosynthesis-related queries. PRAG showed an average improvement of 8.7% across five metrics related to scientific writing, with a 25.4% increase in source transparency. Additionally, its scientific depth and domain coverage were comparable to those of photosynthesis research papers. A knowledge graph was used to structure PRAG's responses with papers within and outside the database, which allowed PRAG to match key entities with 63% and 39.5% of the database and test papers, respectively. PRAG can be applied for photosynthesis research and broader plant science domains, paving the way for more in-depth data analysis and predictive capabilities.
△ Less
Submitted 3 February, 2025;
originally announced February 2025.
-
EmoTalkingGaussian: Continuous Emotion-conditioned Talking Head Synthesis
Authors:
Junuk Cha,
Seongro Yoon,
Valeriya Strizhkova,
Francois Bremond,
Seungryul Baek
Abstract:
3D Gaussian splatting-based talking head synthesis has recently gained attention for its ability to render high-fidelity images with real-time inference speed. However, since it is typically trained on only a short video that lacks the diversity in facial emotions, the resultant talking heads struggle to represent a wide range of emotions. To address this issue, we propose a lip-aligned emotional…
▽ More
3D Gaussian splatting-based talking head synthesis has recently gained attention for its ability to render high-fidelity images with real-time inference speed. However, since it is typically trained on only a short video that lacks the diversity in facial emotions, the resultant talking heads struggle to represent a wide range of emotions. To address this issue, we propose a lip-aligned emotional face generator and leverage it to train our EmoTalkingGaussian model. It is able to manipulate facial emotions conditioned on continuous emotion values (i.e., valence and arousal); while retaining synchronization of lip movements with input audio. Additionally, to achieve the accurate lip synchronization for in-the-wild audio, we introduce a self-supervised learning method that leverages a text-to-speech network and a visual-audio synchronization network. We experiment our EmoTalkingGaussian on publicly available videos and have obtained better results than state-of-the-arts in terms of image quality (measured in PSNR, SSIM, LPIPS), emotion expression (measured in V-RMSE, A-RMSE, V-SA, A-SA, Emotion Accuracy), and lip synchronization (measured in LMD, Sync-E, Sync-C), respectively.
△ Less
Submitted 1 February, 2025;
originally announced February 2025.
-
Distribution-aware Fairness Learning in Medical Image Segmentation From A Control-Theoretic Perspective
Authors:
Yujin Oh,
Pengfei Jin,
Sangjoon Park,
Sekeun Kim,
Siyeop Yoon,
Kyungsang Kim,
Jin Sung Kim,
Xiang Li,
Quanzheng Li
Abstract:
Ensuring fairness in medical image segmentation is critical due to biases in imbalanced clinical data acquisition caused by demographic attributes (e.g., age, sex, race) and clinical factors (e.g., disease severity). To address these challenges, we introduce Distribution-aware Mixture of Experts (dMoE), inspired by optimal control theory. We provide a comprehensive analysis of its underlying mecha…
▽ More
Ensuring fairness in medical image segmentation is critical due to biases in imbalanced clinical data acquisition caused by demographic attributes (e.g., age, sex, race) and clinical factors (e.g., disease severity). To address these challenges, we introduce Distribution-aware Mixture of Experts (dMoE), inspired by optimal control theory. We provide a comprehensive analysis of its underlying mechanisms and clarify dMoE's role in adapting to heterogeneous distributions in medical image segmentation. Furthermore, we integrate dMoE into multiple network architectures, demonstrating its broad applicability across diverse medical image analysis tasks. By incorporating demographic and clinical factors, dMoE achieves state-of-the-art performance on two 2D benchmark datasets and a 3D in-house dataset. Our results highlight the effectiveness of dMoE in mitigating biases from imbalanced distributions, offering a promising approach to bridging control theory and medical image segmentation within fairness learning paradigms. The source code will be made available. The source code is available at https://github.com/tvseg/dMoE.
△ Less
Submitted 27 May, 2025; v1 submitted 1 February, 2025;
originally announced February 2025.
-
The SPHEREx Target List of Ice Sources (SPLICES)
Authors:
Matthew L. N. Ashby,
Joseph L. Hora,
Kiran Lakshmipathaiah,
Sarita Vig,
Rama Krishna Sai Subrahmanyam Gorthi,
Miju Kang,
Volker Tolls,
Gary J. Melnick,
Michael W. Werner,
Brendan P. Crill,
Daniel C. Masters,
Carlos Contreras Pena,
Jeong-Eun Lee,
Jaeyeong Kim,
Ho-Gyu Lee,
Sung-Yong Yoon,
Soung-Chul Yang,
Nicholas Flagey,
Bertrand Mennesson
Abstract:
One of the primary objectives of the SPHEREx mission is to understand the origin of molecules such as H2O, CO2, and other volatile compounds at the early stages of planetary system formation. Because the vast majority of these compounds -- typically exceeding 95% -- exist in the solid phase rather than the gaseous phase in the systems of concern here, the observing strategy planned to characterize…
▽ More
One of the primary objectives of the SPHEREx mission is to understand the origin of molecules such as H2O, CO2, and other volatile compounds at the early stages of planetary system formation. Because the vast majority of these compounds -- typically exceeding 95% -- exist in the solid phase rather than the gaseous phase in the systems of concern here, the observing strategy planned to characterize them is slightly unusual. Specifically, SPHEREx will target highly obscured sources throughout the Milky Way, and observe the species of concern in absorption against background illumination. SPHEREx spectrophotometry will yield ice column density measurements for millions of obscured Milky Way sources of all ages and types. By correlating those column densities with source ages, the SPHEREx mission will shed light on whether those molecules were formed in situ along with their nascent stellar systems, or whether instead they formed elsewhere and were introduced into those systems after their formation. To that end, this work describes version 7$.$1 of the SPHEREx Target List of Ice Sources (SPLICES) for the community. It contains about 8$.$6 million objects brighter than W2~12 Vega mag over much of the sky, principally within a broad strip running the length of the Milky Way midplane, but also within high-latitude molecular clouds and even the Magellanic Clouds.
△ Less
Submitted 29 January, 2025;
originally announced January 2025.
-
CNN-based TEM image denoising from first principles
Authors:
Jinwoong Chae,
Sungwook Hong,
Sungkyu Kim,
Sungroh Yoon,
Gunn Kim
Abstract:
Transmission electron microscope (TEM) images are often corrupted by noise, hindering their interpretation. To address this issue, we propose a deep learning-based approach using simulated images. Using density functional theory calculations with a set of pseudo-atomic orbital basis sets, we generate highly accurate ground truth images. We introduce four types of noise into these simulations to cr…
▽ More
Transmission electron microscope (TEM) images are often corrupted by noise, hindering their interpretation. To address this issue, we propose a deep learning-based approach using simulated images. Using density functional theory calculations with a set of pseudo-atomic orbital basis sets, we generate highly accurate ground truth images. We introduce four types of noise into these simulations to create realistic training datasets. Each type of noise is then used to train a separate convolutional neural network (CNN) model. Our results show that these CNNs are effective in reducing noise, even when applied to images with different noise levels than those used during training. However, we observe limitations in some cases, particularly in preserving the integrity of circular shapes and avoiding visible artifacts between image patches. To overcome these challenges, we propose alternative training strategies and future research directions. This study provides a valuable framework for training deep learning models for TEM image denoising.
△ Less
Submitted 19 January, 2025;
originally announced January 2025.
-
Know "No'' Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP
Authors:
Junsung Park,
Jungbeom Lee,
Jongyoon Song,
Sangwon Yu,
Dahuin Jung,
Sungroh Yoon
Abstract:
While CLIP has significantly advanced multimodal understanding by bridging vision and language, the inability to grasp negation - such as failing to differentiate concepts like "parking" from "no parking" - poses substantial challenges. By analyzing the data used in the public CLIP model's pre-training, we posit this limitation stems from a lack of negation-inclusive data. To address this, we intr…
▽ More
While CLIP has significantly advanced multimodal understanding by bridging vision and language, the inability to grasp negation - such as failing to differentiate concepts like "parking" from "no parking" - poses substantial challenges. By analyzing the data used in the public CLIP model's pre-training, we posit this limitation stems from a lack of negation-inclusive data. To address this, we introduce data generation pipelines that employ a large language model (LLM) and a multimodal LLM to produce negation-inclusive captions. Fine-tuning CLIP with data generated from our pipelines, we develop NegationCLIP, which enhances negation awareness while preserving the generality. Moreover, to enable a comprehensive evaluation of negation understanding, we propose NegRefCOCOg-a benchmark tailored to test VLMs' ability to interpret negation across diverse expressions and positions within a sentence. Experiments on various CLIP architectures validate the effectiveness of our data generation pipelines in enhancing CLIP's ability to perceive negation accurately. Additionally, NegationCLIP's enhanced negation awareness has practical applications across various multimodal tasks, demonstrated by performance gains in text-to-image generation and referring image segmentation.
△ Less
Submitted 31 March, 2025; v1 submitted 18 January, 2025;
originally announced January 2025.
-
Battling the Non-stationarity in Time Series Forecasting via Test-time Adaptation
Authors:
HyunGi Kim,
Siwon Kim,
Jisoo Mok,
Sungroh Yoon
Abstract:
Deep Neural Networks have spearheaded remarkable advancements in time series forecasting (TSF), one of the major tasks in time series modeling. Nonetheless, the non-stationarity of time series undermines the reliability of pre-trained source time series forecasters in mission-critical deployment settings. In this study, we introduce a pioneering test-time adaptation framework tailored for TSF (TSF…
▽ More
Deep Neural Networks have spearheaded remarkable advancements in time series forecasting (TSF), one of the major tasks in time series modeling. Nonetheless, the non-stationarity of time series undermines the reliability of pre-trained source time series forecasters in mission-critical deployment settings. In this study, we introduce a pioneering test-time adaptation framework tailored for TSF (TSF-TTA). TAFAS, the proposed approach to TSF-TTA, flexibly adapts source forecasters to continuously shifting test distributions while preserving the core semantic information learned during pre-training. The novel utilization of partially-observed ground truth and gated calibration module enables proactive, robust, and model-agnostic adaptation of source forecasters. Experiments on diverse benchmark datasets and cutting-edge architectures demonstrate the efficacy and generality of TAFAS, especially in long-term forecasting scenarios that suffer from significant distribution shifts. The code is available at https://github.com/kimanki/TAFAS.
△ Less
Submitted 8 January, 2025;
originally announced January 2025.
-
JOG3R: Towards 3D-Consistent Video Generators
Authors:
Chun-Hao Paul Huang,
Niloy Mitra,
Hyeonho Jeong,
Jae Shin Yoon,
Duygu Ceylan
Abstract:
Emergent capabilities of image generators have led to many impactful zero- or few-shot applications. Inspired by this success, we investigate whether video generators similarly exhibit 3D-awareness. Using structure-from-motion as a 3D-aware task, we test if intermediate features of a video generator - OpenSora in our case - can support camera pose estimation. Surprisingly, at first, we only find a…
▽ More
Emergent capabilities of image generators have led to many impactful zero- or few-shot applications. Inspired by this success, we investigate whether video generators similarly exhibit 3D-awareness. Using structure-from-motion as a 3D-aware task, we test if intermediate features of a video generator - OpenSora in our case - can support camera pose estimation. Surprisingly, at first, we only find a weak correlation between the two tasks. Deeper investigation reveals that although the video generator produces plausible video frames, the frames themselves are not truly 3D-consistent. Instead, we propose to jointly train for the two tasks, using photometric generation and 3D aware errors. Specifically, we find that SoTA video generation and camera pose estimation (i.e.,DUSt3R [79]) networks share common structures, and propose an architecture that unifies the two. The proposed unified model, named \nameMethod, produces camera pose estimates with competitive quality while producing 3D-consistent videos. In summary, we propose the first unified video generator that is 3D-consistent, generates realistic video frames, and can potentially be repurposed for other 3D-aware tasks.
△ Less
Submitted 26 March, 2025; v1 submitted 2 January, 2025;
originally announced January 2025.
-
Measurement of reactor antineutrino oscillation amplitude and frequency using 3800 days of complete data sample of the RENO experiment
Authors:
S. Jeon,
H. I. Kim,
J. H. Choi,
H. I. Jang,
J. S. Jang,
K. K. Joo,
D. E. Jung,
J. G. Kim,
J. H. Kim,
J. Y. Kim,
S. B. Kim,
S. Y. Kim,
W. Kim,
E. Kwon,
D. H. Lee,
H. G. Lee,
W. J. Lee,
I. T. Lim,
D. H. Moon,
M. Y. Pac,
J. S. Park,
R. G. Park,
H. Seo,
J. W. Seo,
C. D. Shin
, et al. (5 additional authors not shown)
Abstract:
We report an updated neutrino mixing angle of $θ_{13}$ obtained from a complete data sample of the RENO experiment. The experiment has measured the amplitude and frequency of reactor anti-electron-neutrinos ($\barν_{e}$) oscillations at the Hanbit nuclear power plant, Younggwang, Korea, since August 2011. As of March 2023, the data acquisition was completed after a total of 3800 live days of detec…
▽ More
We report an updated neutrino mixing angle of $θ_{13}$ obtained from a complete data sample of the RENO experiment. The experiment has measured the amplitude and frequency of reactor anti-electron-neutrinos ($\barν_{e}$) oscillations at the Hanbit nuclear power plant, Younggwang, Korea, since August 2011. As of March 2023, the data acquisition was completed after a total of 3800 live days of detector operation. The observed candidates via inverse beta decay (IBD) are 1,211,995 (144,667) in the near (far) detector. Based on an observed energy-dependent reactor neutrino disappearance, neutrino oscillation parameters of $θ_{13}$ and $\lvertΔm_{ee}^2\rvert$ are precisely determined as $\sin^{2}2θ_{13}=0.0920_{-0.0042}^{+0.0044}(\text{stat.})_{-0.0041}^{+0.0041}(\text{syst.})$ and $\lvertΔm_{ee}^2\rvert=\left[2.57_{-0.11}^{+0.10}(\text{stat.})_{-0.05}^{+0.05}(\text{syst.})\right]\times10^{-3}~\text{eV}^{2}$. Compared to the previous RENO results published in Ref.~\cite{PhysRevLett.121.201801}, the precision is improved from 7.5\% to 6.4\% for $\sin^{2}2θ_{13}$ and from 5.2\% to 4.5\% for $\lvertΔm_{ee}^2\rvert$. The statistical error of the measurement has reached our goal and is hardly improved with additional data-taking.
△ Less
Submitted 24 December, 2024;
originally announced December 2024.
-
Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage
Authors:
Saehyung Lee,
Seunghyun Yoon,
Trung Bui,
Jing Shi,
Sungroh Yoon
Abstract:
Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations. Our analysis reveals that existing hallucination detection methods struggle with detailed captions. We attribute this to the increasing reliance of MLLMs on their generated text, rather than the input image, as the sequence length grows. To address this issue, we propose a multia…
▽ More
Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations. Our analysis reveals that existing hallucination detection methods struggle with detailed captions. We attribute this to the increasing reliance of MLLMs on their generated text, rather than the input image, as the sequence length grows. To address this issue, we propose a multiagent approach that leverages LLM-MLLM collaboration to correct given captions. Additionally, we introduce an evaluation framework and a benchmark dataset to facilitate the systematic analysis of detailed captions. Our experiments demonstrate that our proposed evaluation method better aligns with human judgments of factuality than existing metrics and that existing approaches to improve the MLLM factuality may fall short in hyper-detailed image captioning tasks. In contrast, our proposed method significantly enhances the factual accuracy of captions, even improving those generated by GPT-4V. Finally, we highlight a limitation of VQA-centric benchmarking by demonstrating that an MLLM's performance on VQA benchmarks may not correlate with its ability to generate detailed image captions.
△ Less
Submitted 29 May, 2025; v1 submitted 19 December, 2024;
originally announced December 2024.
-
Improving Geometry in Sparse-View 3DGS via Reprojection-based DoF Separation
Authors:
Yongsung Kim,
Minjun Park,
Jooyoung Choi,
Sungroh Yoon
Abstract:
Recent learning-based Multi-View Stereo models have demonstrated state-of-the-art performance in sparse-view 3D reconstruction. However, directly applying 3D Gaussian Splatting (3DGS) as a refinement step following these models presents challenges. We hypothesize that the excessive positional degrees of freedom (DoFs) in Gaussians induce geometry distortion, fitting color patterns at the cost of s…
▽ More
Recent learning-based Multi-View Stereo models have demonstrated state-of-the-art performance in sparse-view 3D reconstruction. However, directly applying 3D Gaussian Splatting (3DGS) as a refinement step following these models presents challenges. We hypothesize that the excessive positional degrees of freedom (DoFs) in Gaussians induce geometry distortion, fitting color patterns at the cost of structural fidelity. To address this, we propose reprojection-based DoF separation, a method distinguishing positional DoFs in terms of uncertainty: image-plane-parallel DoFs and ray-aligned DoF. To independently manage each DoF, we introduce a reprojection process along with tailored constraints for each DoF. Through experiments across various datasets, we confirm that separating the positional DoFs of Gaussians and applying targeted constraints effectively suppresses geometric artifacts, producing reconstruction results that are both visually and geometrically plausible.
△ Less
Submitted 19 December, 2024;
originally announced December 2024.
-
Denoising Nearest Neighbor Graph via Continuous CRF for Visual Re-ranking without Fine-tuning
Authors:
Jaeyoon Kim,
Yoonki Cho,
Taeyong Kim,
Sung-Eui Yoon
Abstract:
Visual re-ranking using Nearest Neighbor graph~(NN graph) has been adapted to yield high retrieval accuracy, since it is beneficial to exploring an high-dimensional manifold and applicable without additional fine-tuning. The quality of visual re-ranking using NN graph, however, is limited to that of connectivity, i.e., edges of the NN graph. Some edges can be misconnected with negative images. Thi…
▽ More
Visual re-ranking using Nearest Neighbor graph~(NN graph) has been adapted to yield high retrieval accuracy, since it is beneficial to exploring an high-dimensional manifold and applicable without additional fine-tuning. The quality of visual re-ranking using NN graph, however, is limited to that of connectivity, i.e., edges of the NN graph. Some edges can be misconnected with negative images. This is known as a noisy edge problem, resulting in a degradation of the retrieval quality. To address this, we propose a complementary denoising method based on Continuous Conditional Random Field (C-CRF) that uses a statistical distance of our similarity-based distribution. This method employs the concept of cliques to make the process computationally feasible. We demonstrate the complementarity of our method through its application to three visual re-ranking methods, observing quality boosts in landmark retrieval and person re-identification (re-ID).
△ Less
Submitted 18 December, 2024;
originally announced December 2024.
-
Text2Relight: Creative Portrait Relighting with Text Guidance
Authors:
Junuk Cha,
Mengwei Ren,
Krishna Kumar Singh,
He Zhang,
Yannick Hold-Geoffroy,
Seunghyun Yoon,
HyunJoon Jung,
Jae Shin Yoon,
Seungryul Baek
Abstract:
We present a lighting-aware image editing pipeline that, given a portrait image and a text prompt, performs single image relighting. Our model modifies the lighting and color of both the foreground and background to align with the provided text description. The unbounded nature in creativeness of a text allows us to describe the lighting of a scene with any sensory features including temperature,…
▽ More
We present a lighting-aware image editing pipeline that, given a portrait image and a text prompt, performs single image relighting. Our model modifies the lighting and color of both the foreground and background to align with the provided text description. The unbounded nature in creativeness of a text allows us to describe the lighting of a scene with any sensory features including temperature, emotion, smell, time, and so on. However, the modeling of such mapping between the unbounded text and lighting is extremely challenging due to the lack of dataset where there exists no scalable data that provides large pairs of text and relighting, and therefore, current text-driven image editing models does not generalize to lighting-specific use cases. We overcome this problem by introducing a novel data synthesis pipeline: First, diverse and creative text prompts that describe the scenes with various lighting are automatically generated under a crafted hierarchy using a large language model (*e.g.,* ChatGPT). A text-guided image generation model creates a lighting image that best matches the text. As a condition of the lighting images, we perform image-based relighting for both foreground and background using a single portrait image or a set of OLAT (One-Light-at-A-Time) images captured from lightstage system. Particularly for the background relighting, we represent the lighting image as a set of point lights and transfer them to other background images. A generative diffusion model learns the synthesized large-scale data with auxiliary task augmentation (*e.g.,* portrait delighting and light positioning) to correlate the latent text and lighting distribution for text-guided portrait relighting.
△ Less
Submitted 18 December, 2024;
originally announced December 2024.
-
Transmit What You Need: Task-Adaptive Semantic Communications for Visual Information
Authors:
Jeonghun Park,
Sung Whan Yoon
Abstract:
Recently, semantic communications have drawn great attention as the groundbreaking concept surpasses the limited capacity of Shannon's theory. Specifically, semantic communications probably become crucial in realizing visual tasks that demand massive network traffic. Although highly distinctive forms of visual semantics exist for computer vision tasks, a thorough investigation of what visual seman…
▽ More
Recently, semantic communications have drawn great attention as the groundbreaking concept surpasses the limited capacity of Shannon's theory. Specifically, semantic communications probably become crucial in realizing visual tasks that demand massive network traffic. Although highly distinctive forms of visual semantics exist for computer vision tasks, a thorough investigation of what visual semantics can be transmitted in time and which one is required for completing different visual tasks has not yet been reported. To this end, we first scrutinize the achievable throughput in transmitting existing visual semantics through the limited wireless communication bandwidth. In addition, we further demonstrate the resulting performance of various visual tasks for each visual semantic. Based on the empirical testing, we suggest a task-adaptive selection of visual semantics is crucial for real-time semantic communications for visual tasks, where we transmit basic semantics (e.g., objects in the given image) for simple visual tasks, such as classification, and richer semantics (e.g., scene graphs) for complex tasks, such as image regeneration. To further improve transmission efficiency, we suggest a filtering method for scene graphs, which drops redundant information in the scene graph, thus allowing the sending of essential semantics for completing the given task. We confirm the efficacy of our task-adaptive semantic communication approach through extensive simulations in wireless channels, showing more than 45 times larger throughput over a naive transmission of original data. Our work can be reproduced at the following source codes: https://github.com/jhpark2024/jhpark.github.io
△ Less
Submitted 18 December, 2024;
originally announced December 2024.
-
GUI Agents: A Survey
Authors:
Dang Nguyen,
Jian Chen,
Yu Wang,
Gang Wu,
Namyong Park,
Zhengmian Hu,
Hanjia Lyu,
Junda Wu,
Ryan Aponte,
Yu Xia,
Xintong Li,
Jing Shi,
Hongjie Chen,
Viet Dac Lai,
Zhouhang Xie,
Sungchul Kim,
Ruiyi Zhang,
Tong Yu,
Mehrab Tanjim,
Nesreen K. Ahmed,
Puneet Mathur,
Seunghyun Yoon,
Lina Yao,
Branislav Kveton,
Thien Huu Nguyen
, et al. (4 additional authors not shown)
Abstract:
Graphical User Interface (GUI) agents, powered by Large Foundation Models, have emerged as a transformative approach to automating human-computer interaction. These agents autonomously interact with digital systems or software applications via GUIs, emulating human actions such as clicking, typing, and navigating visual elements across diverse platforms. Motivated by the growing interest and funda…
▽ More
Graphical User Interface (GUI) agents, powered by Large Foundation Models, have emerged as a transformative approach to automating human-computer interaction. These agents autonomously interact with digital systems or software applications via GUIs, emulating human actions such as clicking, typing, and navigating visual elements across diverse platforms. Motivated by the growing interest and fundamental importance of GUI agents, we provide a comprehensive survey that categorizes their benchmarks, evaluation metrics, architectures, and training methods. We propose a unified framework that delineates their perception, reasoning, planning, and acting capabilities. Furthermore, we identify important open challenges and discuss key future directions. Finally, this work serves as a basis for practitioners and researchers to gain an intuitive understanding of current progress, techniques, benchmarks, and critical open problems that remain to be addressed.
△ Less
Submitted 17 December, 2024;
originally announced December 2024.
-
Generating Diverse Hypotheses for Inductive Reasoning
Authors:
Kang-il Lee,
Hyukhun Koh,
Dongryeol Lee,
Seunghyun Yoon,
Minsung Kim,
Kyomin Jung
Abstract:
Inductive reasoning - the process of inferring general rules from a small number of observations - is a fundamental aspect of human intelligence. Recent works suggest that large language models (LLMs) can engage in inductive reasoning by sampling multiple hypotheses about the rules and selecting the one that best explains the observations. However, due to the IID sampling, semantically redundant h…
▽ More
Inductive reasoning - the process of inferring general rules from a small number of observations - is a fundamental aspect of human intelligence. Recent works suggest that large language models (LLMs) can engage in inductive reasoning by sampling multiple hypotheses about the rules and selecting the one that best explains the observations. However, due to the IID sampling, semantically redundant hypotheses are frequently generated, leading to significant wastage of compute. In this paper, we 1) demonstrate that increasing the temperature to enhance the diversity is limited due to text degeneration issue, and 2) propose a novel method to improve the diversity while maintaining text quality. We first analyze the effect of increasing the temperature parameter, which is regarded as the LLM's diversity control, on IID hypotheses. Our analysis shows that as temperature rises, diversity and accuracy of hypotheses increase up to a certain point, but this trend saturates due to text degeneration. To generate hypotheses that are more semantically diverse and of higher quality, we propose a novel approach inspired by human inductive reasoning, which we call Mixture of Concepts (MoC). When applied to several inductive reasoning benchmarks, MoC demonstrated significant performance improvements compared to standard IID sampling and other approaches.
△ Less
Submitted 8 February, 2025; v1 submitted 17 December, 2024;
originally announced December 2024.
-
Benchmarking Federated Learning for Semantic Datasets: Federated Scene Graph Generation
Authors:
SeungBum Ha,
Taehwan Lee,
Jiyoun Lim,
Sung Whan Yoon
Abstract:
Federated learning (FL) has recently garnered attention as a data-decentralized training framework that enables the learning of deep models from locally distributed samples while keeping data privacy. Built upon the framework, immense efforts have been made to establish FL benchmarks, which provide rigorous evaluation settings that control data heterogeneity across clients. Prior efforts have main…
▽ More
Federated learning (FL) has recently garnered attention as a data-decentralized training framework that enables the learning of deep models from locally distributed samples while keeping data privacy. Built upon the framework, immense efforts have been made to establish FL benchmarks, which provide rigorous evaluation settings that control data heterogeneity across clients. Prior efforts have mainly focused on handling relatively simple classification tasks, where each sample is annotated with a one-hot label, such as MNIST, CIFAR, LEAF benchmark, etc. However, little attention has been paid to demonstrating an FL benchmark that handles complicated semantics, where each sample encompasses diverse semantic information from multiple labels, such as Panoptic Scene Graph Generation (PSG) with objects, subjects, and relations between them. Because the existing benchmark is designed to distribute data in a narrow view of a single semantic, e.g., a one-hot label, managing the complicated semantic heterogeneity across clients when formalizing FL benchmarks is non-trivial. In this paper, we propose a benchmark process to establish an FL benchmark with controllable semantic heterogeneity across clients: two key steps are i) data clustering with semantics and ii) data distributing via controllable semantic heterogeneity across clients. As a proof of concept, we first construct a federated PSG benchmark, demonstrating the efficacy of the existing PSG methods in an FL setting with controllable semantic heterogeneity of scene graphs. We also present the effectiveness of our benchmark by applying robust federated learning algorithms to data heterogeneity to show increased performance. Our code is available at https://github.com/Seung-B/FL-PSG.
△ Less
Submitted 11 December, 2024;
originally announced December 2024.
-
FaceShield: Defending Facial Image against Deepfake Threats
Authors:
Jaehwan Jeong,
Sumin In,
Sieun Kim,
Hannie Shin,
Jongheon Jeong,
Sang Ho Yoon,
Jaewook Chung,
Sangpil Kim
Abstract:
The rising use of deepfakes in criminal activities presents a significant issue, inciting widespread controversy. While numerous studies have tackled this problem, most primarily focus on deepfake detection. These reactive solutions are insufficient as a fundamental approach for crimes where authenticity is disregarded. Existing proactive defenses also have limitations, as they are effective only…
▽ More
The rising use of deepfakes in criminal activities presents a significant issue, inciting widespread controversy. While numerous studies have tackled this problem, most primarily focus on deepfake detection. These reactive solutions are insufficient as a fundamental approach for crimes where authenticity is disregarded. Existing proactive defenses also have limitations, as they are effective only for deepfake models based on specific Generative Adversarial Networks (GANs), making them less applicable in light of recent advancements in diffusion-based models. In this paper, we propose a proactive defense method named FaceShield, which introduces novel defense strategies targeting deepfakes generated by Diffusion Models (DMs) and facilitates defenses on various existing GAN-based deepfake models through facial feature extractor manipulations. Our approach consists of three main components: (i) manipulating the attention mechanism of DMs to exclude protected facial features during the denoising process, (ii) targeting prominent facial feature extraction models to enhance the robustness of our adversarial perturbation, and (iii) employing Gaussian blur and low-pass filtering techniques to improve imperceptibility while enhancing robustness against JPEG compression. Experimental results on the CelebA-HQ and VGGFace2-HQ datasets demonstrate that our method achieves state-of-the-art performance against the latest deepfake models based on DMs, while also exhibiting transferability to GANs and showcasing greater imperceptibility of noise along with enhanced robustness.
△ Less
Submitted 10 March, 2025; v1 submitted 13 December, 2024;
originally announced December 2024.
-
Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens
Authors:
Jaihyun Lew,
Soohyuk Jang,
Jaehoon Lee,
Seungryong Yoo,
Eunji Kim,
Saehyung Lee,
Jisoo Mok,
Siwon Kim,
Sungroh Yoon
Abstract:
Transformers, a groundbreaking architecture proposed for Natural Language Processing (NLP), have also achieved remarkable success in Computer Vision. A cornerstone of their success lies in the attention mechanism, which models relationships among tokens. While the tokenization process in NLP inherently ensures that a single token does not contain multiple semantics, the tokenization of Vision Tran…
▽ More
Transformers, a groundbreaking architecture proposed for Natural Language Processing (NLP), have also achieved remarkable success in Computer Vision. A cornerstone of their success lies in the attention mechanism, which models relationships among tokens. While the tokenization process in NLP inherently ensures that a single token does not contain multiple semantics, the tokenization of Vision Transformer (ViT) utilizes tokens from uniformly partitioned square image patches, which may result in an arbitrary mixing of visual concepts in a token. In this work, we propose to substitute the grid-based tokenization in ViT with superpixel tokenization, which employs superpixels to generate a token that encapsulates a sole visual concept. Unfortunately, the diverse shapes, sizes, and locations of superpixels make integrating superpixels into ViT tokenization rather challenging. Our tokenization pipeline, comprised of pre-aggregate extraction and superpixel-aware aggregation, overcomes the challenges that arise in superpixel tokenization. Extensive experiments demonstrate that our approach, which exhibits strong compatibility with existing frameworks, enhances the accuracy and robustness of ViT on various downstream tasks.
△ Less
Submitted 24 March, 2025; v1 submitted 5 December, 2024;
originally announced December 2024.
-
Domain-specific Question Answering with Hybrid Search
Authors:
Dewang Sultania,
Zhaoyu Lu,
Twisha Naik,
Franck Dernoncourt,
David Seunghyun Yoon,
Sanat Sharma,
Trung Bui,
Ashok Gupta,
Tushar Vatsa,
Suhas Suresha,
Ishita Verma,
Vibha Belavadi,
Cheng Chen,
Michael Friedrich
Abstract:
Domain specific question answering is an evolving field that requires specialized solutions to address unique challenges. In this paper, we show that a hybrid approach combining a fine-tuned dense retriever with keyword based sparse search methods significantly enhances performance. Our system leverages a linear combination of relevance signals, including cosine similarity from dense retrieval, BM…
▽ More
Domain specific question answering is an evolving field that requires specialized solutions to address unique challenges. In this paper, we show that a hybrid approach combining a fine-tuned dense retriever with keyword based sparse search methods significantly enhances performance. Our system leverages a linear combination of relevance signals, including cosine similarity from dense retrieval, BM25 scores, and URL host matching, each with tunable boost parameters. Experimental results indicate that this hybrid method outperforms our single-retriever system, achieving improved accuracy while maintaining robust contextual grounding. These findings suggest that integrating multiple retrieval methodologies with weighted scoring effectively addresses the complexities of domain specific question answering in enterprise settings.
△ Less
Submitted 21 December, 2024; v1 submitted 4 December, 2024;
originally announced December 2024.
-
Adversarial Sample-Based Approach for Tighter Privacy Auditing in Final Model-Only Scenarios
Authors:
Sangyeon Yoon,
Wonje Jeung,
Albert No
Abstract:
Auditing Differentially Private Stochastic Gradient Descent (DP-SGD) in the final model setting is challenging and often results in empirical lower bounds that are significantly looser than theoretical privacy guarantees. We introduce a novel auditing method that achieves tighter empirical lower bounds without additional assumptions by crafting worst-case adversarial samples through loss-based inp…
▽ More
Auditing Differentially Private Stochastic Gradient Descent (DP-SGD) in the final model setting is challenging and often results in empirical lower bounds that are significantly looser than theoretical privacy guarantees. We introduce a novel auditing method that achieves tighter empirical lower bounds without additional assumptions by crafting worst-case adversarial samples through loss-based input-space auditing. Our approach surpasses traditional canary-based heuristics and is effective in final model-only scenarios. Specifically, with a theoretical privacy budget of $\varepsilon = 10.0$, our method achieves empirical lower bounds of $4.914$, compared to the baseline of $4.385$ for MNIST. Our work offers a practical framework for reliable and accurate privacy auditing in differentially private machine learning.
△ Less
Submitted 24 February, 2025; v1 submitted 2 December, 2024;
originally announced December 2024.
-
Dense Dispersed Structured Light for Hyperspectral 3D Imaging of Dynamic Scenes
Authors:
Suhyun Shin,
Seungwoo Yoon,
Ryota Maeda,
Seung-Hwan Baek
Abstract:
Hyperspectral 3D imaging captures both depth maps and hyperspectral images, enabling comprehensive geometric and material analysis. Recent methods achieve high spectral and depth accuracy; however, they require long acquisition times often over several minutes or rely on large, expensive systems, restricting their use to static scenes. We present Dense Dispersed Structured Light (DDSL), an accurat…
▽ More
Hyperspectral 3D imaging captures both depth maps and hyperspectral images, enabling comprehensive geometric and material analysis. Recent methods achieve high spectral and depth accuracy; however, they require long acquisition times often over several minutes or rely on large, expensive systems, restricting their use to static scenes. We present Dense Dispersed Structured Light (DDSL), an accurate hyperspectral 3D imaging method for dynamic scenes that utilizes stereo RGB cameras and an RGB projector equipped with an affordable diffraction grating film. We design spectrally multiplexed DDSL patterns that significantly reduce the number of required projector patterns, thereby accelerating acquisition speed. Additionally, we formulate an image formation model and a reconstruction method to estimate a hyperspectral image and depth map from captured stereo images. As the first practical and accurate hyperspectral 3D imaging method for dynamic scenes, we experimentally demonstrate that DDSL achieves a spectral resolution of 15.5 nm full width at half maximum (FWHM), a depth error of 4 mm, and a frame rate of 6.6 fps.
△ Less
Submitted 2 December, 2024;
originally announced December 2024.
-
OMuleT: Orchestrating Multiple Tools for Practicable Conversational Recommendation
Authors:
Se-eun Yoon,
Xiaokai Wei,
Yexi Jiang,
Rachit Pareek,
Frank Ong,
Kevin Gao,
Julian McAuley,
Michelle Gong
Abstract:
In this paper, we present a systematic effort to design, evaluate, and implement a realistic conversational recommender system (CRS). The objective of our system is to allow users to input free-form text to request recommendations, and then receive a list of relevant and diverse items. While previous work on synthetic queries augments large language models (LLMs) with 1-3 tools, we argue that a mo…
▽ More
In this paper, we present a systematic effort to design, evaluate, and implement a realistic conversational recommender system (CRS). The objective of our system is to allow users to input free-form text to request recommendations, and then receive a list of relevant and diverse items. While previous work on synthetic queries augments large language models (LLMs) with 1-3 tools, we argue that a more extensive toolbox is necessary to effectively handle real user requests. As such, we propose a novel approach that equips LLMs with over 10 tools, providing them access to the internal knowledge base and API calls used in production. We evaluate our model on a dataset of real users and show that it generates relevant, novel, and diverse recommendations compared to vanilla LLMs. Furthermore, we conduct ablation studies to demonstrate the effectiveness of using the full range of tools in our toolbox. We share our designs and lessons learned from deploying the system for internal alpha release. Our contribution is the addressing of all four key aspects of a practicable CRS: (1) real user requests, (2) augmenting LLMs with a wide variety of tools, (3) extensive evaluation, and (4) deployment insights.
△ Less
Submitted 31 December, 2024; v1 submitted 28 November, 2024;
originally announced November 2024.
-
A New Rarity Assessment of the `Disk of Satellites': the Milky Way System Is the Exception Rather than the Rule in the $Λ$CDM Cosmology
Authors:
Chanoul Seo,
Suk-Jin Yoon,
Sanjaya Paudel,
Sung-Ho An,
Jun-Sung Moon
Abstract:
The majority of satellite galaxies around the Milky Way (MW) show disk-like distributions (the disk of satellites; DoS), which is a small-scale problem of the $Λ$CDM cosmology. The conventional definition of the MW-like DoS is a satellite system with a minor-to-major axis ratio ($c$/$a$) lower than the MW's $c$/$a$ value of 0.181. Here we question the validity of the $c$/$a$-based DoS rarity asses…
▽ More
The majority of satellite galaxies around the Milky Way (MW) show disk-like distributions (the disk of satellites; DoS), which is a small-scale problem of the $Λ$CDM cosmology. The conventional definition of the MW-like DoS is a satellite system with a minor-to-major axis ratio ($c$/$a$) lower than the MW's $c$/$a$ value of 0.181. Here we question the validity of the $c$/$a$-based DoS rarity assessment and propose an alternative approach. How satellites are placed around a galaxy is dictated mainly by two factors: the distributions of satellites' orbital poles and distances from the host. Based on this premise, we construct the `satellite distribution generator' code and generate 10$^5$ `spatially and kinematically analogous systems (SKASs)' sharing these two factors. The SKAS can disclose the intrinsic, underlying $c$/$a$ probability distribution function (PDF), from which a present-day $c$/$a$ value is fortuitously determined. We find that the $c$/$a$ PDF of the MW DoS defined by 11 classical satellites is quite broad ($σ_{c/a}$$\sim$0.105), implying that a simple present-day $c$/$a$ value, combined with its highly time-variable nature, cannot fully represent the degree of flatness. Moreover, based on the intrinsic $c$/$a$ PDF, we re-evaluate the rarity of the MW DoS by comparing it with IllustrisTNG50-1 host-satellite systems and find that even with the new measure, the MW DoS remains rare (0.00$\sim$3.40%). We show that the reason behind the rareness is that both orbital poles and distances of the 11 MW satellites are far more plane-friendly than those of simulated host-satellite systems, challenging the current structure and galaxy formation model.
△ Less
Submitted 26 November, 2024;
originally announced November 2024.
-
Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator
Authors:
Chaehun Shin,
Jooyoung Choi,
Heeseung Kim,
Sungroh Yoon
Abstract:
Subject-driven text-to-image generation aims to produce images of a new subject within a desired context by accurately capturing both the visual characteristics of the subject and the semantic content of a text prompt. Traditional methods rely on time- and resource-intensive fine-tuning for subject alignment, while recent zero-shot approaches leverage on-the-fly image prompting, often sacrificing…
▽ More
Subject-driven text-to-image generation aims to produce images of a new subject within a desired context by accurately capturing both the visual characteristics of the subject and the semantic content of a text prompt. Traditional methods rely on time- and resource-intensive fine-tuning for subject alignment, while recent zero-shot approaches leverage on-the-fly image prompting, often sacrificing subject alignment. In this paper, we introduce Diptych Prompting, a novel zero-shot approach that reinterprets as an inpainting task with precise subject alignment by leveraging the emergent property of diptych generation in large-scale text-to-image models. Diptych Prompting arranges an incomplete diptych with the reference image in the left panel, and performs text-conditioned inpainting on the right panel. We further prevent unwanted content leakage by removing the background in the reference image and improve fine-grained details in the generated subject by enhancing attention weights between the panels during inpainting. Experimental results confirm that our approach significantly outperforms zero-shot image prompting methods, resulting in images that are visually preferred by users. Additionally, our method supports not only subject-driven generation but also stylized image generation and subject-driven image editing, demonstrating versatility across diverse image generation applications. Project page: https://diptychprompting.github.io/
△ Less
Submitted 23 November, 2024;
originally announced November 2024.
-
Style-Friendly SNR Sampler for Style-Driven Generation
Authors:
Jooyoung Choi,
Chaehun Shin,
Yeongtak Oh,
Heeseung Kim,
Jungbeom Lee,
Sungroh Yoon
Abstract:
Recent text-to-image diffusion models generate high-quality images but struggle to learn new, personalized styles, which limits the creation of unique style templates. In style-driven generation, users typically supply reference images exemplifying the desired style, together with text prompts that specify desired stylistic attributes. Previous approaches popularly rely on fine-tuning, yet it ofte…
▽ More
Recent text-to-image diffusion models generate high-quality images but struggle to learn new, personalized styles, which limits the creation of unique style templates. In style-driven generation, users typically supply reference images exemplifying the desired style, together with text prompts that specify desired stylistic attributes. Previous approaches popularly rely on fine-tuning, yet it often blindly utilizes objectives and noise level distributions from pre-training without adaptation. We discover that stylistic features predominantly emerge at higher noise levels, leading current fine-tuning methods to exhibit suboptimal style alignment. We propose the Style-friendly SNR sampler, which aggressively shifts the signal-to-noise ratio (SNR) distribution toward higher noise levels during fine-tuning to focus on noise levels where stylistic features emerge. This enhances models' ability to capture novel styles indicated by reference images and text prompts. We demonstrate improved generation of novel styles that cannot be adequately described solely with a text prompt, enabling the creation of new style templates for personalized content creation.
△ Less
Submitted 20 March, 2025; v1 submitted 22 November, 2024;
originally announced November 2024.
-
Unsupervised Homography Estimation on Multimodal Image Pair via Alternating Optimization
Authors:
Sanghyeob Song,
Jaihyun Lew,
Hyemi Jang,
Sungroh Yoon
Abstract:
Estimating the homography between two images is crucial for mid- or high-level vision tasks, such as image stitching and fusion. However, using supervised learning methods is often challenging or costly due to the difficulty of collecting ground-truth data. In response, unsupervised learning approaches have emerged. Most early methods, though, assume that the given image pairs are from the same ca…
▽ More
Estimating the homography between two images is crucial for mid- or high-level vision tasks, such as image stitching and fusion. However, using supervised learning methods is often challenging or costly due to the difficulty of collecting ground-truth data. In response, unsupervised learning approaches have emerged. Most early methods, though, assume that the given image pairs are from the same camera or have minor lighting differences. Consequently, while these methods perform effectively under such conditions, they generally fail when input image pairs come from different domains, referred to as multimodal image pairs. To address these limitations, we propose AltO, an unsupervised learning framework for estimating homography in multimodal image pairs. Our method employs a two-phase alternating optimization framework, similar to Expectation-Maximization (EM), where one phase reduces the geometry gap and the other addresses the modality gap. To handle these gaps, we use Barlow Twins loss for the modality gap and propose an extended version, Geometry Barlow Twins, for the geometry gap. As a result, we demonstrate that our method, AltO, can be trained on multimodal datasets without any ground-truth data. It not only outperforms other unsupervised methods but is also compatible with various architectures of homography estimators. The source code can be found at:~\url{https://github.com/songsang7/AltO}
△ Less
Submitted 19 November, 2024;
originally announced November 2024.
-
Generalizable Person Re-identification via Balancing Alignment and Uniformity
Authors:
Yoonki Cho,
Jaeyoon Kim,
Woo Jae Kim,
Junsik Jung,
Sung-eui Yoon
Abstract:
Domain generalizable person re-identification (DG re-ID) aims to learn discriminative representations that are robust to distributional shifts. While data augmentation is a straightforward solution to improve generalization, certain augmentations exhibit a polarized effect in this task, enhancing in-distribution performance while deteriorating out-of-distribution performance. In this paper, we inv…
▽ More
Domain generalizable person re-identification (DG re-ID) aims to learn discriminative representations that are robust to distributional shifts. While data augmentation is a straightforward solution to improve generalization, certain augmentations exhibit a polarized effect in this task, enhancing in-distribution performance while deteriorating out-of-distribution performance. In this paper, we investigate this phenomenon and reveal that it leads to sparse representation spaces with reduced uniformity. To address this issue, we propose a novel framework, Balancing Alignment and Uniformity (BAU), which effectively mitigates this effect by maintaining a balance between alignment and uniformity. Specifically, BAU incorporates alignment and uniformity losses applied to both original and augmented images and integrates a weighting strategy to assess the reliability of augmented samples, further improving the alignment loss. Additionally, we introduce a domain-specific uniformity loss that promotes uniformity within each source domain, thereby enhancing the learning of domain-invariant features. Extensive experimental results demonstrate that BAU effectively exploits the advantages of data augmentation, which previous studies could not fully utilize, and achieves state-of-the-art performance without requiring complex training procedures. The code is available at \url{https://github.com/yoonkicho/BAU}.
△ Less
Submitted 18 November, 2024;
originally announced November 2024.
-
Discovery of a Rare Group of Dwarf Galaxies in the Local Universe
Authors:
Sanjaya Paudel,
Cristiano G. Sabiu,
Suk-Jin Yoon,
Pierre-Alain Duc,
Jaewon Yoo,
Oliver Müller
Abstract:
We report the discovery of a rare isolated group of five dwarf galaxies located at z = 0.0086 ($D$ = 36 Mpc). All member galaxies are star-forming, blue, and gas-rich with $g-r$ indices ranging from 0.2 to 0.6 mag, and two of them show signs of ongoing mutual interaction. The most massive member of the group has a stellar mass that is half of the Small Magellanic Cloud stellar mass, and the median…
▽ More
We report the discovery of a rare isolated group of five dwarf galaxies located at z = 0.0086 ($D$ = 36 Mpc). All member galaxies are star-forming, blue, and gas-rich with $g-r$ indices ranging from 0.2 to 0.6 mag, and two of them show signs of ongoing mutual interaction. The most massive member of the group has a stellar mass that is half of the Small Magellanic Cloud stellar mass, and the median stellar mass of the group members is 7.87 $\times$ 10$^{7}$ M$_{\odot}$. The derived total dynamical mass of the group is $M_{\rm dyn}$ = 6.02$\times$10$^{10}$ M$_{\odot}$, whereas its total baryonic mass (stellar + HI) is 2.6$\times$10$^{9}$ M$_{\odot}$, which gives us the dynamical to baryonic mass ratio of 23. Interestingly, all galaxies found in the group are aligned along a straight line in the plane of the sky. The observed spatial extent of the member galaxies is 154 kpc, and their relative line-of-sight velocity span is within 75 km s$^{-1}$. Using the spatially resolved optical spectra provided by DESI EDR, we find that three group members share a common rotational direction. With these unique properties of the group and its member galaxies, we discuss the possible importance of such a system in the formation and evolution of dwarf galaxy groups and in testing the theory of large-scale structure formation.
△ Less
Submitted 15 November, 2024;
originally announced November 2024.
-
SlimLM: An Efficient Small Language Model for On-Device Document Assistance
Authors:
Thang M. Pham,
Phat T. Nguyen,
Seunghyun Yoon,
Viet Dac Lai,
Franck Dernoncourt,
Trung Bui
Abstract:
While small language models (SLMs) show promises for mobile deployment, their real-world performance and applications on smartphones remains underexplored. We present SlimLM, a series of SLMs optimized for document assistance tasks on mobile devices. Through extensive experiments on a Samsung Galaxy S24, we identify the optimal trade-offs between model size (ranging from 125M to 7B parameters), co…
▽ More
While small language models (SLMs) show promises for mobile deployment, their real-world performance and applications on smartphones remains underexplored. We present SlimLM, a series of SLMs optimized for document assistance tasks on mobile devices. Through extensive experiments on a Samsung Galaxy S24, we identify the optimal trade-offs between model size (ranging from 125M to 7B parameters), context length, and inference time for efficient on-device processing. SlimLM is pre-trained on SlimPajama-627B and fine-tuned on DocAssist, our constructed dataset for summarization, question answering and suggestion tasks. Our smallest model demonstrates efficient performance on S24, while larger variants offer enhanced capabilities within mobile constraints. We evaluate SlimLM against existing SLMs, showing comparable or superior performance and offering a benchmark for future research in on-device language models. We also provide an Android application, offering practical insights into SLM deployment. Our findings provide valuable insights and illuminate the capabilities of running advanced language models on high-end smartphones, potentially reducing server costs and enhancing privacy through on-device processing.
△ Less
Submitted 25 November, 2024; v1 submitted 14 November, 2024;
originally announced November 2024.
-
Physics Informed Distillation for Diffusion Models
Authors:
Joshua Tian Jin Tee,
Kang Zhang,
Hee Suk Yoon,
Dhananjaya Nagaraja Gowda,
Chanwoo Kim,
Chang D. Yoo
Abstract:
Diffusion models have recently emerged as a potent tool in generative modeling. However, their inherent iterative nature often results in sluggish image generation due to the requirement for multiple model evaluations. Recent progress has unveiled the intrinsic link between diffusion models and Probability Flow Ordinary Differential Equations (ODEs), thus enabling us to conceptualize diffusion mod…
▽ More
Diffusion models have recently emerged as a potent tool in generative modeling. However, their inherent iterative nature often results in sluggish image generation due to the requirement for multiple model evaluations. Recent progress has unveiled the intrinsic link between diffusion models and Probability Flow Ordinary Differential Equations (ODEs), thus enabling us to conceptualize diffusion models as ODE systems. Simultaneously, Physics Informed Neural Networks (PINNs) have substantiated their effectiveness in solving intricate differential equations through implicit modeling of their solutions. Building upon these foundational insights, we introduce Physics Informed Distillation (PID), which employs a student model to represent the solution of the ODE system corresponding to the teacher diffusion model, akin to the principles employed in PINNs. Through experiments on CIFAR 10 and ImageNet 64x64, we observe that PID achieves performance comparable to recent distillation methods. Notably, it demonstrates predictable trends concerning method-specific hyperparameters and eliminates the need for synthetic dataset generation during the distillation process. Both of which contribute to its easy-to-use nature as a distillation approach for Diffusion Models. Our code and pre-trained checkpoint are publicly available at: https://github.com/pantheon5100/pid_diffusion.git.
△ Less
Submitted 13 November, 2024;
originally announced November 2024.
-
Radio Follow-up Observations of SN 2023ixf by Japanese and Korean VLBIs
Authors:
Yuhei Iwata,
Masanori Akimoto,
Tomoki Matsuoka,
Keiichi Maeda,
Yoshinori Yonekura,
Nozomu Tominaga,
Takashi J. Moriya,
Kenta Fujisawa,
Kotaro Niinuma,
Sung-Chul Yoon,
Jae-Joon Lee,
Taehyun Jung,
Do-Young Byun
Abstract:
We report on radio follow-up observations of the nearby Type II supernova, SN 2023ixf, spanning from 1.7 to 269.9 days after the explosion, conducted using three very long baseline interferometers (VLBIs), which are the Japanese VLBI Network (JVN), the VLBI Exploration of Radio Astrometry (VERA), and the Korean VLBI Network (KVN). In three observation epochs (152.3, 206.1, and 269.9 days), we dete…
▽ More
We report on radio follow-up observations of the nearby Type II supernova, SN 2023ixf, spanning from 1.7 to 269.9 days after the explosion, conducted using three very long baseline interferometers (VLBIs), which are the Japanese VLBI Network (JVN), the VLBI Exploration of Radio Astrometry (VERA), and the Korean VLBI Network (KVN). In three observation epochs (152.3, 206.1, and 269.9 days), we detected emission at the 6.9 and 8.4 GHz bands, with a flux density of $\sim 5$ mJy. The flux density reached a peak at around 206.1 days, which is longer than the timescale to reach the peak observed in typical Type II supernovae. Based on the analytical model of radio emission, our late-time detections were inferred to be due to the decreasing optical depth. In this case, the mass-loss rate of the progenitor is estimated to have increased from $\sim 10^{-6} - 10^{-5}\, M_{\odot}\,{\rm yr^{-1}}$ to $\sim 10^{-4}\, M_{\odot}\,{\rm yr^{-1}}$ between 28 and 6 years before the explosion. Our radio constraints are also consistent with the mass-loss rate to produce a confined circumstellar medium proposed by previous studies, which suggest that the mass-loss rate increased from $\sim 10^{-4}\, M_{\odot}\,{\rm yr^{-1}}$ to $\gtrsim 10^{-2}\, M_{\odot}\,{\rm yr^{-1}}$ in the last few years before the explosion.
△ Less
Submitted 11 November, 2024;
originally announced November 2024.
-
Moving Groups in the Solar Neighborhood with Gaia, APOGEE, GALAH, and LAMOST: Dynamical Effects Gather Gas and the Ensuing Star Formation Plays an Important Role in Shaping the Stellar Velocity Distributions
Authors:
Xilong Liang,
Suk-Jin Yoon,
Jingkun Zhao
Abstract:
With Gaia, APOGEE, GALAH, and LAMOST data, we investigate the positional, kinematic, chemical, and age properties of nine moving groups in the solar neighborhood. We find that each moving group has a distinct distribution in the velocity space in terms of its metallicity, $α$ abundance, and age. Comparison of the moving groups with their underlying background stars suggests that they have experien…
▽ More
With Gaia, APOGEE, GALAH, and LAMOST data, we investigate the positional, kinematic, chemical, and age properties of nine moving groups in the solar neighborhood. We find that each moving group has a distinct distribution in the velocity space in terms of its metallicity, $α$ abundance, and age. Comparison of the moving groups with their underlying background stars suggests that they have experienced the enhanced, prolonged star formation. We infer that any dynamical effects that gathered stars as a moving group in the velocity space also worked for gas. We propose for the first time that the ensuing newborn stars from such gas inherited the kinematic feature from the gas, shaping the current stellar velocity distributions of the groups. Our findings improve the understanding of the origins and evolutionary histories of moving groups in the solar neighborhood.
△ Less
Submitted 8 November, 2024;
originally announced November 2024.
-
A Comprehensive Survey of Deep Learning for Time Series Forecasting: Architectural Diversity and Open Challenges
Authors:
Jongseon Kim,
Hyungjoon Kim,
HyunGi Kim,
Dongjun Lee,
Sungroh Yoon
Abstract:
Time series forecasting is a critical task that provides key information for decision-making. After traditional statistical and machine learning approaches, various fundamental deep learning architectures such as MLPs, CNNs, RNNs, and GNNs have been developed. However, the structural limitations caused by the inductive biases of each deep learning architecture constrained their performance. Transf…
▽ More
Time series forecasting is a critical task that provides key information for decision-making. After traditional statistical and machine learning approaches, various fundamental deep learning architectures such as MLPs, CNNs, RNNs, and GNNs have been developed. However, the structural limitations caused by the inductive biases of each deep learning architecture constrained their performance. Transformer models, which excel at handling long-term dependencies, have become significant architectural components for time series forecasting. However, recent research has shown that alternatives such as simple linear layers can outperform Transformers. These findings have opened up new possibilities for using diverse architectures, ranging from fundamental deep learning models to emerging architectures and hybrid approaches. In this context, architectural modeling of time series forecasting has now entered a renaissance. This survey not only provides a historical context for time series forecasting but also offers comprehensive and timely analysis of the movement toward architectural diversification. By comparing and re-examining deep learning models, we uncover new perspectives and present recent trends, including hybrid, diffusion, Mamba, and foundation models. By focusing on the inherent characteristics of time series data, we also address open challenges that have gained attention in time series forecasting, such as channel dependency, distribution shift, causality, and feature extraction. These contributions help lower entry barriers for newcomers by providing a systematic understanding of the diverse research areas in time series forecasting (TSF), while offering seasoned researchers broader perspectives and new opportunities through in-depth exploration of TSF challenges. (Shortened due to arXiv's 1,920-character limit. Full version in the paper.)
△ Less
Submitted 1 May, 2025; v1 submitted 24 October, 2024;
originally announced November 2024.