-
Cross-Frame Representation Alignment for Fine-Tuning Video Diffusion Models
Authors:
Sungwon Hwang,
Hyojin Jang,
Kinam Kim,
Minho Park,
Jaegul choo
Abstract:
Fine-tuning Video Diffusion Models (VDMs) at the user level to generate videos that reflect specific attributes of training data presents notable challenges, yet remains underexplored despite its practical importance. Meanwhile, recent work such as Representation Alignment (REPA) has shown promise in improving the convergence and quality of DiT-based image diffusion models by aligning, or assimila…
▽ More
Fine-tuning Video Diffusion Models (VDMs) at the user level to generate videos that reflect specific attributes of training data presents notable challenges, yet remains underexplored despite its practical importance. Meanwhile, recent work such as Representation Alignment (REPA) has shown promise in improving the convergence and quality of DiT-based image diffusion models by aligning, or assimilating, its internal hidden states with external pretrained visual features, suggesting its potential for VDM fine-tuning. In this work, we first propose a straightforward adaptation of REPA for VDMs and empirically show that, while effective for convergence, it is suboptimal in preserving semantic consistency across frames. To address this limitation, we introduce Cross-frame Representation Alignment (CREPA), a novel regularization technique that aligns hidden states of a frame with external features from neighboring frames. Empirical evaluations on large-scale VDMs, including CogVideoX-5B and Hunyuan Video, demonstrate that CREPA improves both visual fidelity and cross-frame semantic coherence when fine-tuned with parameter-efficient methods such as LoRA. We further validate CREPA across diverse datasets with varying attributes, confirming its broad applicability. Project page: https://crepavideo.github.io
△ Less
Submitted 10 June, 2025;
originally announced June 2025.
-
Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models
Authors:
Sangwon Jang,
Taekyung Ki,
Jaehyeong Jo,
Jaehong Yoon,
Soo Ye Kim,
Zhe Lin,
Sung Ju Hwang
Abstract:
Advancements in diffusion models have significantly improved video quality, directing attention to fine-grained controllability. However, many existing methods depend on fine-tuning large-scale video models for specific tasks, which becomes increasingly impractical as model sizes continue to grow. In this work, we present Frame Guidance, a training-free guidance for controllable video generation b…
▽ More
Advancements in diffusion models have significantly improved video quality, directing attention to fine-grained controllability. However, many existing methods depend on fine-tuning large-scale video models for specific tasks, which becomes increasingly impractical as model sizes continue to grow. In this work, we present Frame Guidance, a training-free guidance for controllable video generation based on frame-level signals, such as keyframes, style reference images, sketches, or depth maps. For practical training-free guidance, we propose a simple latent processing method that dramatically reduces memory usage, and apply a novel latent optimization strategy designed for globally coherent video generation. Frame Guidance enables effective control across diverse tasks, including keyframe guidance, stylization, and looping, without any training, compatible with any video models. Experimental results show that Frame Guidance can produce high-quality controlled videos for a wide range of tasks and input signals.
△ Less
Submitted 8 June, 2025;
originally announced June 2025.
-
Understanding Community-Level Blocklists in Decentralized Social Media
Authors:
Owen Xingjian Zhang,
Sohyeon Hwang,
Yuhan Liu,
Manoel Horta Ribeiro,
Andrés Monroy-Hernández
Abstract:
Community-level blocklists are key to content moderation practices in decentralized social media. These blocklists enable moderators to prevent other communities, such as those acting in bad faith, from interacting with their own -- and, if shared publicly, warn others about communities worth blocking. Prior work has examined blocklists in centralized social media, noting their potential for colle…
▽ More
Community-level blocklists are key to content moderation practices in decentralized social media. These blocklists enable moderators to prevent other communities, such as those acting in bad faith, from interacting with their own -- and, if shared publicly, warn others about communities worth blocking. Prior work has examined blocklists in centralized social media, noting their potential for collective moderation outcomes, but has focused on blocklists as individual-level tools. To understand how moderators perceive and utilize community-level blocklists and what additional support they may need, we examine social media communities running Mastodon, an open-source microblogging software built on the ActivityPub protocol. We conducted (1) content analysis of the community-level blocklist ecosystem, and (2) semi-structured interviews with twelve Mastodon moderators. Our content analysis revealed wide variation in blocklist goals, inclusion criteria, and transparency. Interviews showed moderators balance proactive safety, reactive practices, and caution around false positives when using blocklists for moderation. They noted challenges and limitations in current blocklist use, suggesting design improvements like comment receipts, category filters, and collaborative voting. We discuss implications for decentralized content moderation, highlighting trade-offs between openness, safety, and nuance; the complexity of moderator roles; and opportunities for future design.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
ECoRAG: Evidentiality-guided Compression for Long Context RAG
Authors:
Yeonseok Jeong,
Jinsu Kim,
Dohyeon Lee,
Seung-won Hwang
Abstract:
Large Language Models (LLMs) have shown remarkable performance in Open-Domain Question Answering (ODQA) by leveraging external documents through Retrieval-Augmented Generation (RAG). To reduce RAG overhead, from longer context, context compression is necessary. However, prior compression methods do not focus on filtering out non-evidential information, which limit the performance in LLM-based RAG.…
▽ More
Large Language Models (LLMs) have shown remarkable performance in Open-Domain Question Answering (ODQA) by leveraging external documents through Retrieval-Augmented Generation (RAG). To reduce RAG overhead, from longer context, context compression is necessary. However, prior compression methods do not focus on filtering out non-evidential information, which limit the performance in LLM-based RAG. We thus propose Evidentiality-guided RAG, or ECoRAG framework. ECoRAG improves LLM performance by compressing retrieved documents based on evidentiality, ensuring whether answer generation is supported by the correct evidence. As an additional step, ECoRAG reflects whether the compressed content provides sufficient evidence, and if not, retrieves more until sufficient. Experiments show that ECoRAG improves LLM performance on ODQA tasks, outperforming existing compression methods. Furthermore, ECoRAG is highly cost-efficient, as it not only reduces latency but also minimizes token usage by retaining only the necessary information to generate the correct answer. Code is available at https://github.com/ldilab/ECoRAG.
△ Less
Submitted 6 June, 2025; v1 submitted 5 June, 2025;
originally announced June 2025.
-
HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model
Authors:
Youngwan Lee,
Kangsan Kim,
Kwanyong Park,
Ilcahe Jung,
Soojin Jang,
Seanie Lee,
Yong-Ju Lee,
Sung Ju Hwang
Abstract:
Despite emerging efforts to enhance the safety of Vision-Language Models (VLMs), current approaches face two main shortcomings. 1) Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content, often overlooking contextually unsafe outcomes from seemingly benign pairs. This narrow coverage leaves VLMs vulnerable to jailbreak attacks in…
▽ More
Despite emerging efforts to enhance the safety of Vision-Language Models (VLMs), current approaches face two main shortcomings. 1) Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content, often overlooking contextually unsafe outcomes from seemingly benign pairs. This narrow coverage leaves VLMs vulnerable to jailbreak attacks in unseen configurations. 2) Prior methods rely primarily on data-centric tuning, with limited architectural innovations to intrinsically strengthen safety. We address these gaps by introducing a holistic safety dataset and benchmark, HoliSafe, that spans all five safe/unsafe image-text combinations, providing a more robust basis for both training and evaluation. We further propose SafeLLaVA, a novel VLM augmented with a learnable safety meta token and a dedicated safety head. The meta token encodes harmful visual cues during training, intrinsically guiding the language model toward safer responses, while the safety head offers interpretable harmfulness classification aligned with refusal rationales. Experiments show that SafeLLaVA, trained on HoliSafe, achieves state-of-the-art safety performance across multiple VLM benchmarks. Additionally, the HoliSafe benchmark itself reveals critical vulnerabilities in existing models. We hope that HoliSafe and SafeLLaVA will spur further research into robust and interpretable VLM safety, expanding future avenues for multimodal alignment.
△ Less
Submitted 11 June, 2025; v1 submitted 5 June, 2025;
originally announced June 2025.
-
Backbone Augmented Training for Adaptations
Authors:
Jae Wan Park,
Junhyeok Kim,
Youngjun Jun,
Hyunah Ko,
Seong Jae Hwang
Abstract:
Adaptations facilitate efficient training of large backbone models, including diffusion models for image generation and transformer-based language models. While various adaptation techniques enhance performance with minimal computational resources, limited adaptation data often leads to challenges in training. To address this, we focus on the enormous amount of backbone data used to pre-train the…
▽ More
Adaptations facilitate efficient training of large backbone models, including diffusion models for image generation and transformer-based language models. While various adaptation techniques enhance performance with minimal computational resources, limited adaptation data often leads to challenges in training. To address this, we focus on the enormous amount of backbone data used to pre-train the backbone models. We propose Backbone Augmented Training (BAT), a method that leverages backbone data to augment the adaptation dataset. First, we formulate and prove two mathematical key propositions: one establishes the validity of BAT, while the other identifies a condition under which BAT benefits adaptation. Furthermore, we introduce an advanced data selection scheme that satisfies these propositions and present ALBAT algorithm to implement this approach. ALBAT efficiently enhances adaptation training in both personalization and language generation tasks with scarce data.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
Searching for Dark Galaxies with HI detection from the Arecibo Legacy Fast ALFA (ALFALFA) survey
Authors:
Minseong Kwon,
Ho Seong Hwang,
Brian R. Kent,
Ilsang Yoon,
Gain Lee,
Hyein Yoon
Abstract:
We present a catalog of 142 dark galaxy candidates in a region covered by the Arecibo Legacy Fast ALFA (ALFALFA) survey. We start with 344 ALFALFA HI sources without optical counterparts and remove those that do not seem to have dark galaxy origin. To do that, we first eliminate 83 sources that are known HI clouds probably formed from tidal interactions between galaxies and 13 sources that have op…
▽ More
We present a catalog of 142 dark galaxy candidates in a region covered by the Arecibo Legacy Fast ALFA (ALFALFA) survey. We start with 344 ALFALFA HI sources without optical counterparts and remove those that do not seem to have dark galaxy origin. To do that, we first eliminate 83 sources that are known HI clouds probably formed from tidal interactions between galaxies and 13 sources that have optical counterparts. We then remove 56 sources located near other HI sources, which are likely to be HI clouds. We further exclude 10 sources that have nearby HI sources within the ALFALFA beam and 40 sources potentially associated with nearby galaxies. We perform visual inspection of optical images from DESI Legacy Imaging Survey with an $r$-band surface brightness limit of $\sim 28.5 \rm \ mag\ arcsec^{-2}$ as well as NUV images from GALEX to confirm the absence of stellar emission. We additionally inspect infrared images from WISE and AKARI for dust emission. As a result, we identify 142 dark galaxy candidates and analyze their physical properties by comparing with luminous galaxies. We find that the dark galaxy candidates generally have smaller dynamical masses, higher HI-to-dynamical mass ratios, and are located in less dense regions when compared to luminous galaxies, which is consistent with results from cosmological simulations. This sample provides an important testbed for studying the role of dark matter in galaxy formation and evolution.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
PCoreSet: Effective Active Learning through Knowledge Distillation from Vision-Language Models
Authors:
Seongjae Kang,
Dong Bok Lee,
Hyungjoon Jang,
Dongseop Kim,
Sung Ju Hwang
Abstract:
Knowledge distillation (KD) is a widely used framework for training compact, task-specific models by leveraging the knowledge of teacher models. However, its application to active learning (AL), which aims to minimize annotation costs through iterative sample selection, remains underexplored. This gap stems from the fact that KD typically assumes access to sufficient labeled data, whereas AL opera…
▽ More
Knowledge distillation (KD) is a widely used framework for training compact, task-specific models by leveraging the knowledge of teacher models. However, its application to active learning (AL), which aims to minimize annotation costs through iterative sample selection, remains underexplored. This gap stems from the fact that KD typically assumes access to sufficient labeled data, whereas AL operates in data-scarce scenarios where task-specific teacher models are often unavailable. In this paper, we introduce ActiveKD, a framework that integrates AL with KD by leveraging the zero- and few-shot capabilities of large vision-language models (VLMs). A key aspect of ActiveKD is the structured prediction bias of VLMs -- i.e., their predictions form clusters in the probability space. We regard this structure as an inductive bias of the teacher model, capturing generalizable output patterns beneficial to student learning. To exploit this bias, we propose Probabilistic CoreSet (PCoreSet), a selection strategy that maximizes coverage in the probability space rather than the feature space. PCoreSet strategically selects categorically diverse unlabeled samples, facilitating more efficient transfer of teacher knowledge under limited annotation budgets. Evaluations on 11 datasets show that PCoreSet consistently outperforms existing selection methods within the ActiveKD framework, advancing research at the intersection of AL and KD.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
Parallel Rescaling: Rebalancing Consistency Guidance for Personalized Diffusion Models
Authors:
JungWoo Chae,
Jiyoon Kim,
Sangheum Hwang
Abstract:
Personalizing diffusion models to specific users or concepts remains challenging, particularly when only a few reference images are available. Existing methods such as DreamBooth and Textual Inversion often overfit to limited data, causing misalignment between generated images and text prompts when attempting to balance identity fidelity with prompt adherence. While Direct Consistency Optimization…
▽ More
Personalizing diffusion models to specific users or concepts remains challenging, particularly when only a few reference images are available. Existing methods such as DreamBooth and Textual Inversion often overfit to limited data, causing misalignment between generated images and text prompts when attempting to balance identity fidelity with prompt adherence. While Direct Consistency Optimization (DCO) with its consistency-guided sampling partially alleviates this issue, it still struggles with complex or stylized prompts. In this paper, we propose a parallel rescaling technique for personalized diffusion models. Our approach explicitly decomposes the consistency guidance signal into parallel and orthogonal components relative to classifier free guidance (CFG). By rescaling the parallel component, we minimize disruptive interference with CFG while preserving the subject's identity. Unlike prior personalization methods, our technique does not require additional training data or expensive annotations. Extensive experiments show improved prompt alignment and visual fidelity compared to baseline methods, even on challenging stylized prompts. These findings highlight the potential of parallel rescaled guidance to yield more stable and accurate personalization for diverse user inputs.
△ Less
Submitted 31 May, 2025;
originally announced June 2025.
-
ComposeRAG: A Modular and Composable RAG for Corpus-Grounded Multi-Hop Question Answering
Authors:
Ruofan Wu,
Youngwon Lee,
Fan Shu,
Danmei Xu,
Seung-won Hwang,
Zhewei Yao,
Yuxiong He,
Feng Yan
Abstract:
Retrieval-Augmented Generation (RAG) systems are increasingly diverse, yet many suffer from monolithic designs that tightly couple core functions like query reformulation, retrieval, reasoning, and verification. This limits their interpretability, systematic evaluation, and targeted improvement, especially for complex multi-hop question answering. We introduce ComposeRAG, a novel modular abstracti…
▽ More
Retrieval-Augmented Generation (RAG) systems are increasingly diverse, yet many suffer from monolithic designs that tightly couple core functions like query reformulation, retrieval, reasoning, and verification. This limits their interpretability, systematic evaluation, and targeted improvement, especially for complex multi-hop question answering. We introduce ComposeRAG, a novel modular abstraction that decomposes RAG pipelines into atomic, composable modules. Each module, such as Question Decomposition, Query Rewriting, Retrieval Decision, and Answer Verification, acts as a parameterized transformation on structured inputs/outputs, allowing independent implementation, upgrade, and analysis. To enhance robustness against errors in multi-step reasoning, ComposeRAG incorporates a self-reflection mechanism that iteratively revisits and refines earlier steps upon verification failure. Evaluated on four challenging multi-hop QA benchmarks, ComposeRAG consistently outperforms strong baselines in both accuracy and grounding fidelity. Specifically, it achieves up to a 15% accuracy improvement over fine-tuning-based methods and up to a 5% gain over reasoning-specialized pipelines under identical retrieval conditions. Crucially, ComposeRAG significantly enhances grounding: its verification-first design reduces ungrounded answers by over 10% in low-quality retrieval settings, and by approximately 3% even with strong corpora. Comprehensive ablation studies validate the modular architecture, demonstrating distinct and additive contributions from each component. These findings underscore ComposeRAG's capacity to deliver flexible, transparent, scalable, and high-performing multi-hop reasoning with improved grounding and interpretability.
△ Less
Submitted 30 May, 2025;
originally announced June 2025.
-
CREFT: Sequential Multi-Agent LLM for Character Relation Extraction
Authors:
Ye Eun Chun,
Taeyoon Hwang,
Seung-won Hwang,
Byung-Hak Kim
Abstract:
Understanding complex character relations is crucial for narrative analysis and efficient script evaluation, yet existing extraction methods often fail to handle long-form narratives with nuanced interactions. To address this challenge, we present CREFT, a novel sequential framework leveraging specialized Large Language Model (LLM) agents. First, CREFT builds a base character graph through knowled…
▽ More
Understanding complex character relations is crucial for narrative analysis and efficient script evaluation, yet existing extraction methods often fail to handle long-form narratives with nuanced interactions. To address this challenge, we present CREFT, a novel sequential framework leveraging specialized Large Language Model (LLM) agents. First, CREFT builds a base character graph through knowledge distillation, then iteratively refines character composition, relation extraction, role identification, and group assignments. Experiments on a curated Korean drama dataset demonstrate that CREFT significantly outperforms single-agent LLM baselines in both accuracy and completeness. By systematically visualizing character networks, CREFT streamlines narrative comprehension and accelerates script review -- offering substantial benefits to the entertainment, publishing, and educational sectors.
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
From Token to Action: State Machine Reasoning to Mitigate Overthinking in Information Retrieval
Authors:
Dohyeon Lee,
Yeonseok Jeong,
Seung-won Hwang
Abstract:
Chain-of-Thought (CoT) prompting enables complex reasoning in large language models (LLMs), including applications in information retrieval (IR). However, it often leads to overthinking, where models produce excessively long and semantically redundant traces with little or no benefit. We identify two key challenges in IR: redundant trajectories that revisit similar states and misguided reasoning t…
▽ More
Chain-of-Thought (CoT) prompting enables complex reasoning in large language models (LLMs), including applications in information retrieval (IR). However, it often leads to overthinking, where models produce excessively long and semantically redundant traces with little or no benefit. We identify two key challenges in IR: redundant trajectories that revisit similar states and misguided reasoning that diverges from user intent. To address these, we propose State Machine Reasoning (SMR), a transition-based reasoning framework composed of discrete actions (Refine, Rerank, Stop) that support early stopping and fine-grained control. Experiments on the BEIR and BRIGHT benchmarks show that SMR improves retrieval performance (nDCG@10) by 3.4% while reducing token usage by 74.4%. It generalizes across LLMs and retrievers without requiring task-specific tuning, offering a practical alternative to conventional CoT reasoning. The code and details are available at https://github.com/ldilab/SMR.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Bayesian Neural Scaling Law Extrapolation with Prior-Fitted Networks
Authors:
Dongwoo Lee,
Dong Bok Lee,
Steven Adriaensen,
Juho Lee,
Sung Ju Hwang,
Frank Hutter,
Seon Joo Kim,
Hae Beom Lee
Abstract:
Scaling has been a major driver of recent advancements in deep learning. Numerous empirical studies have found that scaling laws often follow the power-law and proposed several variants of power-law functions to predict the scaling behavior at larger scales. However, existing methods mostly rely on point estimation and do not quantify uncertainty, which is crucial for real-world applications invol…
▽ More
Scaling has been a major driver of recent advancements in deep learning. Numerous empirical studies have found that scaling laws often follow the power-law and proposed several variants of power-law functions to predict the scaling behavior at larger scales. However, existing methods mostly rely on point estimation and do not quantify uncertainty, which is crucial for real-world applications involving decision-making problems such as determining the expected performance improvements achievable by investing additional computational resources. In this work, we explore a Bayesian framework based on Prior-data Fitted Networks (PFNs) for neural scaling law extrapolation. Specifically, we design a prior distribution that enables the sampling of infinitely many synthetic functions resembling real-world neural scaling laws, allowing our PFN to meta-learn the extrapolation. We validate the effectiveness of our approach on real-world neural scaling laws, comparing it against both the existing point estimation methods and Bayesian approaches. Our method demonstrates superior performance, particularly in data-limited scenarios such as Bayesian active learning, underscoring its potential for reliable, uncertainty-aware extrapolation in practical applications.
△ Less
Submitted 11 June, 2025; v1 submitted 28 May, 2025;
originally announced May 2025.
-
Seeing the Politics of Decentralized Social Media Protocols
Authors:
Tolulope Oshinowo,
Sohyeon Hwang,
Amy X. Zhang,
Andrés Monroy-Hernández
Abstract:
Calls to decentralize feed-based social media have been driven by concerns about the concentrated power of centralized platforms and their societal impact. In response, numerous decentralized social media protocols have emerged, each interpreting "decentralization" in different ways. We analyze four such protocols -- ActivityPub, AT Protocol, Nostr, and Farcaster -- to develop a novel conceptual f…
▽ More
Calls to decentralize feed-based social media have been driven by concerns about the concentrated power of centralized platforms and their societal impact. In response, numerous decentralized social media protocols have emerged, each interpreting "decentralization" in different ways. We analyze four such protocols -- ActivityPub, AT Protocol, Nostr, and Farcaster -- to develop a novel conceptual framework for understanding how protocols operationalize decentralization. Drawing from protocol documentation, media coverage, and first-hand interviews with protocol developers and experts, we contextualize each protocol's approach within their respective socio-technical goals. Our framework highlights how control over key components is distributed differently across each protocol, shaping who holds power over what kinds of decisions. How components are arranged in relation to one another further impacts how component owners might offset each other's power in shaping social media. We argue that examining protocols as artifacts reveals how values shape infrastructure and power dynamics -- and that with a holistic framework as a guide, we can more effectively evaluate and design decentralized platforms aligned with the social and political futures we envision.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
Agentic Predictor: Performance Prediction for Agentic Workflows via Multi-View Encoding
Authors:
Patara Trirat,
Wonyong Jeong,
Sung Ju Hwang
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but optimizing LLM-based agentic systems remains challenging due to the vast search space of agent configurations, prompting strategies, and communication patterns. Existing approaches often rely on heuristic-based tuning or exhaustive evaluation, which can be computationally expensive and suboptimal. This…
▽ More
Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but optimizing LLM-based agentic systems remains challenging due to the vast search space of agent configurations, prompting strategies, and communication patterns. Existing approaches often rely on heuristic-based tuning or exhaustive evaluation, which can be computationally expensive and suboptimal. This paper proposes Agentic Predictor, a lightweight predictor for efficient agentic workflow evaluation. Agentic Predictor is equipped with a multi-view workflow encoding technique that leverages multi-view representation learning of agentic systems by incorporating code architecture, textual prompts, and interaction graph features. To achieve high predictive accuracy while significantly reducing the number of required workflow evaluations for training a predictor, Agentic Predictor employs cross-domain unsupervised pretraining. By learning to approximate task success rates, Agentic Predictor enables fast and accurate selection of optimal agentic workflow configurations for a given task, significantly reducing the need for expensive trial-and-error evaluations. Experiments on a carefully curated benchmark spanning three domains show that our predictor outperforms state-of-the-art methods in both predictive accuracy and workflow utility, highlighting the potential of performance predictors in streamlining the design of LLM-based agentic workflows.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
AcuRank: Uncertainty-Aware Adaptive Computation for Listwise Reranking
Authors:
Soyoung Yoon,
Gyuwan Kim,
Gyu-Hwung Cho,
Seung-won Hwang
Abstract:
Listwise reranking with large language models (LLMs) enhances top-ranked results in retrieval-based applications. Due to the limit in context size and high inference cost of long context, reranking is typically performed over a fixed size of small subsets, with the final ranking aggregated from these partial results. This fixed computation disregards query difficulty and document distribution, lea…
▽ More
Listwise reranking with large language models (LLMs) enhances top-ranked results in retrieval-based applications. Due to the limit in context size and high inference cost of long context, reranking is typically performed over a fixed size of small subsets, with the final ranking aggregated from these partial results. This fixed computation disregards query difficulty and document distribution, leading to inefficiencies. We propose AcuRank, an adaptive reranking framework that dynamically adjusts both the amount and target of computation based on uncertainty estimates over document relevance. Using a Bayesian TrueSkill model, we iteratively refine relevance estimates until reaching sufficient confidence levels, and our explicit modeling of ranking uncertainty enables principled control over reranking behavior and avoids unnecessary updates to confident predictions. Results on the TREC-DL and BEIR benchmarks show that our method consistently achieves a superior accuracy-efficiency trade-off and scales better with compute than fixed-computation baselines. These results highlight the effectiveness and generalizability of our method across diverse retrieval tasks and LLM-based reranking models.
△ Less
Submitted 24 May, 2025;
originally announced May 2025.
-
The Relational Origins of Rules in Online Communities
Authors:
Charles Kiene,
Sohyeon Hwang,
Nathan TeBlunthuis,
Carl Colglazier,
Aaron Shaw,
Benjamin Mako Hill
Abstract:
Where do rules come from in online communities? While prior studies of online community governance in social computing have sought to characterize rules by their functions within communities and documented practices of rule enforcement, they have largely overlooked rule adoption and change. This study investigates how and why online communities adopt and change their rules. We conducted a grounded…
▽ More
Where do rules come from in online communities? While prior studies of online community governance in social computing have sought to characterize rules by their functions within communities and documented practices of rule enforcement, they have largely overlooked rule adoption and change. This study investigates how and why online communities adopt and change their rules. We conducted a grounded theory-based analysis of 40 in-depth interviews with community leaders from subreddits, Fandom wikis, and Fediverse servers, and identified seven processes involved in the adoption of online community rules. Our findings reveal that, beyond regulating behavior and solving functional intra-community problems, rules are also adopted and changed for relational reasons, such as signaling or reinforcing community legitimacy and identity to other communities. While rule change was often prompted by challenges during community growth or decline, change also depended on volunteer leaders' work capacity, the presence of member feedback mechanisms, and relational dynamics between leaders and members. The findings extend prior theories from social computing and organizational research, illustrating how institutionalist and ecological explanations of the relational origins of rules complement more functional accounts. The results also support design recommendations that integrate the relational aspects of rules and rulemaking to facilitate successful governance across communities' lifecycles.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
Distilling LLM Agent into Small Models with Retrieval and Code Tools
Authors:
Minki Kang,
Jongwon Jeong,
Seanie Lee,
Jaewoong Cho,
Sung Ju Hwang
Abstract:
Large language models (LLMs) excel at complex reasoning tasks but remain computationally expensive, limiting their practical deployment. To address this, recent works have focused on distilling reasoning capabilities into smaller language models (sLMs) using chain-of-thought (CoT) traces from teacher LLMs. However, this approach struggles in scenarios requiring rare factual knowledge or precise co…
▽ More
Large language models (LLMs) excel at complex reasoning tasks but remain computationally expensive, limiting their practical deployment. To address this, recent works have focused on distilling reasoning capabilities into smaller language models (sLMs) using chain-of-thought (CoT) traces from teacher LLMs. However, this approach struggles in scenarios requiring rare factual knowledge or precise computation, where sLMs often hallucinate due to limited capability. In this work, we propose Agent Distillation, a framework for transferring not only reasoning capability but full task-solving behavior from LLM-based agents into sLMs with retrieval and code tools. We improve agent distillation along two complementary axes: (1) we introduce a prompting method called first-thought prefix to enhance the quality of teacher-generated trajectories; and (2) we propose a self-consistent action generation for improving test-time robustness of small agents. We evaluate our method on eight reasoning tasks across factual and mathematical domains, covering both in-domain and out-of-domain generalization. Our results show that sLMs as small as 0.5B, 1.5B, 3B parameters can achieve performance competitive with next-tier larger 1.5B, 3B, 7B models fine-tuned using CoT distillation, demonstrating the potential of agent distillation for building practical, tool-using small agents. Our code is available at https://github.com/Nardien/agent-distillation.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
MiLQ: Benchmarking IR Models for Bilingual Web Search with Mixed Language Queries
Authors:
Jonghwi Kim,
Deokhyung Kang,
Seonjeong Hwang,
Yunsu Kim,
Jungseul Ok,
Gary Lee
Abstract:
Despite bilingual speakers frequently using mixed-language queries in web searches, Information Retrieval (IR) research on them remains scarce. To address this, we introduce MiLQ,Mixed-Language Query test set, the first public benchmark of mixed-language queries, confirmed as realistic and highly preferred. Experiments show that multilingual IR models perform moderately on MiLQ and inconsistently…
▽ More
Despite bilingual speakers frequently using mixed-language queries in web searches, Information Retrieval (IR) research on them remains scarce. To address this, we introduce MiLQ,Mixed-Language Query test set, the first public benchmark of mixed-language queries, confirmed as realistic and highly preferred. Experiments show that multilingual IR models perform moderately on MiLQ and inconsistently across native, English, and mixed-language queries, also suggesting code-switched training data's potential for robust IR models handling such queries. Meanwhile, intentional English mixing in queries proves an effective strategy for bilinguals searching English documents, which our analysis attributes to enhanced token matching compared to native queries.
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
MADCluster: Model-agnostic Anomaly Detection with Self-supervised Clustering Network
Authors:
Sangyong Lee,
Subo Hwang,
Dohoon Kim
Abstract:
In this paper, we propose MADCluster, a novel model-agnostic anomaly detection framework utilizing self-supervised clustering. MADCluster is applicable to various deep learning architectures and addresses the 'hypersphere collapse' problem inherent in existing deep learning-based anomaly detection methods. The core idea is to cluster normal pattern data into a 'single cluster' while simultaneously…
▽ More
In this paper, we propose MADCluster, a novel model-agnostic anomaly detection framework utilizing self-supervised clustering. MADCluster is applicable to various deep learning architectures and addresses the 'hypersphere collapse' problem inherent in existing deep learning-based anomaly detection methods. The core idea is to cluster normal pattern data into a 'single cluster' while simultaneously learning the cluster center and mapping data close to this center. Also, to improve expressiveness and enable effective single clustering, we propose a new 'One-directed Adaptive loss'. The optimization of this loss is mathematically proven. MADCluster consists of three main components: Base Embedder capturing high-dimensional temporal dynamics, Cluster Distance Mapping, and Sequence-wise Clustering for continuous center updates. Its model-agnostic characteristics are achieved by applying various architectures to the Base Embedder. Experiments on four time series benchmark datasets demonstrate that applying MADCluster improves the overall performance of comparative models. In conclusion, the compatibility of MADCluster shows potential for enhancing model performance across various architectures.
△ Less
Submitted 11 June, 2025; v1 submitted 22 May, 2025;
originally announced May 2025.
-
Multispectral Detection Transformer with Infrared-Centric Sensor Fusion
Authors:
Seongmin Hwang,
Daeyoung Han,
Moongu Jeon
Abstract:
Multispectral object detection aims to leverage complementary information from visible (RGB) and infrared (IR) modalities to enable robust performance under diverse environmental conditions. In this letter, we propose IC-Fusion, a multispectral object detector that effectively fuses visible and infrared features through a lightweight and modalityaware design. Motivated by wavelet analysis and empi…
▽ More
Multispectral object detection aims to leverage complementary information from visible (RGB) and infrared (IR) modalities to enable robust performance under diverse environmental conditions. In this letter, we propose IC-Fusion, a multispectral object detector that effectively fuses visible and infrared features through a lightweight and modalityaware design. Motivated by wavelet analysis and empirical observations, we find that IR images contain structurally rich high-frequency information critical for object localization, while RGB images provide complementary semantic context. To exploit this, we adopt a compact RGB backbone and design a novel fusion module comprising a Multi-Scale Feature Distillation (MSFD) block to enhance RGB features and a three-stage fusion block with Cross-Modal Channel Shuffle Gate (CCSG) and Cross-Modal Large Kernel Gate (CLKG) to facilitate effective cross-modal interaction. Experiments on the FLIR and LLVIP benchmarks demonstrate the effectiveness and efficiency of our IR-centric fusion strategy. Our code is available at https://github.com/smin-hwang/IC-Fusion.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
FedSVD: Adaptive Orthogonalization for Private Federated Learning with LoRA
Authors:
Seanie Lee,
Sangwoo Park,
Dong Bok Lee,
Dominik Wagner,
Haebin Seong,
Tobias Bocklet,
Juho Lee,
Sung Ju Hwang
Abstract:
Low-Rank Adaptation (LoRA), which introduces a product of two trainable low-rank matrices into frozen pre-trained weights, is widely used for efficient fine-tuning of language models in federated learning (FL). However, when combined with differentially private stochastic gradient descent (DP-SGD), LoRA faces substantial noise amplification: DP-SGD perturbs per-sample gradients, and the matrix mul…
▽ More
Low-Rank Adaptation (LoRA), which introduces a product of two trainable low-rank matrices into frozen pre-trained weights, is widely used for efficient fine-tuning of language models in federated learning (FL). However, when combined with differentially private stochastic gradient descent (DP-SGD), LoRA faces substantial noise amplification: DP-SGD perturbs per-sample gradients, and the matrix multiplication of the LoRA update ($BA$) intensifies this effect. Freezing one matrix (e.g., $A$) reduces the noise but restricts model expressiveness, often resulting in suboptimal adaptation. To address this, we propose FedSVD, a simple yet effective method that introduces a global reparameterization based on singular value decomposition (SVD). In our approach, each client optimizes only the $B$ matrix and transmits it to the server. The server aggregates the $B$ matrices, computes the product $BA$ using the previous $A$, and refactorizes the result via SVD. This yields a new adaptive $A$ composed of the orthonormal right singular vectors of $BA$, and an updated $B$ containing the remaining SVD components. This reparameterization avoids quadratic noise amplification, while allowing $A$ to better capture the principal directions of the aggregate updates. Moreover, the orthonormal structure of $A$ bounds the gradient norms of $B$ and preserves more signal under DP-SGD, as confirmed by our theoretical analysis. As a result, FedSVD consistently improves stability and performance across a variety of privacy settings and benchmarks, outperforming relevant baselines under both private and non-private regimes.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Beyond Individual UX: Defining Group Experience(GX) as a New Paradigm for Group-centered AI
Authors:
Soohwan Lee,
Seoyeong Hwang,
Kyungho Lee
Abstract:
Recent advancements in HCI and AI have predominantly centered on individual user experiences, often neglecting the emergent dynamics of group interactions. This provocation introduces Group Experience(GX) to capture the collective perceptual, emotional, and cognitive dimensions that arise when individuals interact in cohesive groups. We challenge the conventional Human-centered AI paradigm and pro…
▽ More
Recent advancements in HCI and AI have predominantly centered on individual user experiences, often neglecting the emergent dynamics of group interactions. This provocation introduces Group Experience(GX) to capture the collective perceptual, emotional, and cognitive dimensions that arise when individuals interact in cohesive groups. We challenge the conventional Human-centered AI paradigm and propose Group-centered AI(GCAI) as a framework that actively mediates group dynamics, amplifies diverse voices, and fosters ethical collective decision-making. Drawing on social psychology, organizational behavior, and group dynamics, we outline a group-centered design approach that balances individual autonomy with collective interests while developing novel evaluative metrics. Our analysis emphasizes rethinking traditional methodologies that focus solely on individual outcomes and advocates for innovative strategies to capture group collaboration. We call on researchers to bridge the gap between micro-level experiences and macro-level impacts, ultimately enriching and transforming collaborative human interactions.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
PRETI: Patient-Aware Retinal Foundation Model via Metadata-Guided Representation Learning
Authors:
Yeonkyung Lee,
Woojung Han,
Youngjun Jun,
Hyeonmin Kim,
Jungkyung Cho,
Seong Jae Hwang
Abstract:
Retinal foundation models have significantly advanced retinal image analysis by leveraging self-supervised learning to reduce dependence on labeled data while achieving strong generalization. Many recent approaches enhance retinal image understanding using report supervision, but obtaining clinical reports is often costly and challenging. In contrast, metadata (e.g., age, gender) is widely availab…
▽ More
Retinal foundation models have significantly advanced retinal image analysis by leveraging self-supervised learning to reduce dependence on labeled data while achieving strong generalization. Many recent approaches enhance retinal image understanding using report supervision, but obtaining clinical reports is often costly and challenging. In contrast, metadata (e.g., age, gender) is widely available and serves as a valuable resource for analyzing disease progression. To effectively incorporate patient-specific information, we propose PRETI, a retinal foundation model that integrates metadata-aware learning with robust self-supervised representation learning. We introduce Learnable Metadata Embedding (LME), which dynamically refines metadata representations. Additionally, we construct patient-level data pairs, associating images from the same individual to improve robustness against non-clinical variations. To further optimize retinal image representation, we propose Retina-Aware Adaptive Masking (RAAM), a strategy that selectively applies masking within the retinal region and dynamically adjusts the masking ratio during training. PRETI captures both global structures and fine-grained pathological details, resulting in superior diagnostic performance. Extensive experiments demonstrate that PRETI achieves state-of-the-art results across diverse diseases and biomarker predictions using in-house and public data, indicating the importance of metadata-guided foundation models in retinal disease analysis. Our code and pretrained model are available at https://github.com/MICV-yonsei/PRETI
△ Less
Submitted 18 May, 2025;
originally announced May 2025.
-
Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction
Authors:
Jeffrey Willette,
Heejun Lee,
Sung Ju Hwang
Abstract:
The attention mechanism of a transformer has a quadratic complexity, leading to high inference costs and latency for long sequences. However, attention matrices are mostly sparse, which implies that many entries may be omitted from computation for efficient inference. Sparse attention inference methods aim to reduce this computational burden; however, they also come with a troublesome performance…
▽ More
The attention mechanism of a transformer has a quadratic complexity, leading to high inference costs and latency for long sequences. However, attention matrices are mostly sparse, which implies that many entries may be omitted from computation for efficient inference. Sparse attention inference methods aim to reduce this computational burden; however, they also come with a troublesome performance degradation. We discover that one reason for this degradation is that the sparse calculation induces a distributional shift in the attention outputs. The distributional shift causes decoding-time queries to fail to align well with the appropriate keys from the prefill stage, leading to a drop in performance. We propose a simple, novel, and effective procedure for correcting this distributional shift, bringing the distribution of sparse attention outputs closer to that of quadratic attention. Our method can be applied on top of any sparse attention method, and results in an average 36%pt performance increase, recovering 88% of quadratic attention accuracy on the 131K RULER benchmark when applied on top of sliding window attention with sink tokens while only adding a small overhead. Our method can maintain approximately 98.5% sparsity over full quadratic attention, making our model 32 times faster than Flash Attention 2 when processing 1M token prefills.
△ Less
Submitted 16 May, 2025;
originally announced May 2025.
-
Redshift Evolution of the Intrinsic Alignments of Galaxies and Subhalos in the Horizon Run 5 Simulation
Authors:
Sanghyeon Han,
Motonari Tonegawa,
Ho Seong Hwang,
Yohan Dubois,
Juhan Kim,
Yonghwi Kim,
Oh-Kyoung Kwon,
Jaehyun Lee,
Owain N. Snaith,
Brad K. Gibson,
Changbom Park
Abstract:
We investigate the redshift evolution of intrinsic alignments of the shapes of galaxies and subhalos with the large-scale structures of the universe using the cosmological hydrodynamic simulation, $\textit{Horizon Run 5}$. To this end, early-type galaxies are selected from the simulated galaxy catalogs based on stellar mass and kinematic morphology. The shapes of galaxies and subhalos are computed…
▽ More
We investigate the redshift evolution of intrinsic alignments of the shapes of galaxies and subhalos with the large-scale structures of the universe using the cosmological hydrodynamic simulation, $\textit{Horizon Run 5}$. To this end, early-type galaxies are selected from the simulated galaxy catalogs based on stellar mass and kinematic morphology. The shapes of galaxies and subhalos are computed using the reduced inertia tensor derived from mass-weighted particle positions. We find that the misalignment between galaxies and their corresponding dark-matter subhalos decreases over time. We further analyze the two-point correlation between galaxy or subhalo shapes and the large-scale density field traced by their spatial distribution, and quantify the amplitude using the nonlinear alignment model across a wide redshift range from $z = 0.625$ to $z = 2.5$. We find that the intrinsic alignment amplitude, $A_{\rm NLA}$, of galaxies remains largely constant with redshift, whereas that of dark matter subhalos exhibits moderate redshift evolution, with a power-law slope that deviates from zero at a significance level exceeding $3σ$. Additionally, $A_{\rm NLA}$ is found to depend on both the stellar mass and kinematic morphology of galaxies. Notably, our results are broadly consistent with existing observational constraints. Our findings are in good agreement with previous results of other cosmological simulations.
△ Less
Submitted 16 May, 2025;
originally announced May 2025.
-
System Prompt Optimization with Meta-Learning
Authors:
Yumin Choi,
Jinheon Baek,
Sung Ju Hwang
Abstract:
Large Language Models (LLMs) have shown remarkable capabilities, with optimizing their input prompts playing a pivotal role in maximizing their performance. However, while LLM prompts consist of both the task-agnostic system prompts and task-specific user prompts, existing work on prompt optimization has focused on user prompts specific to individual queries or tasks, and largely overlooked the sy…
▽ More
Large Language Models (LLMs) have shown remarkable capabilities, with optimizing their input prompts playing a pivotal role in maximizing their performance. However, while LLM prompts consist of both the task-agnostic system prompts and task-specific user prompts, existing work on prompt optimization has focused on user prompts specific to individual queries or tasks, and largely overlooked the system prompt that is, once optimized, applicable across different tasks and domains. Motivated by this, we introduce the novel problem of bilevel system prompt optimization, whose objective is to design system prompts that are robust to diverse user prompts and transferable to unseen tasks. To tackle this problem, we then propose a meta-learning framework, which meta-learns the system prompt by optimizing it over various user prompts across multiple datasets, while simultaneously updating the user prompts in an iterative manner to ensure synergy between them. We conduct experiments on 14 unseen datasets spanning 5 different domains, on which we show that our approach produces system prompts that generalize effectively to diverse user prompts. Also, our findings reveal that the optimized system prompt enables rapid adaptation even to unseen tasks, requiring fewer optimization steps for test-time user prompts while achieving improved performance.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
GradMix: Gradient-based Selective Mixup for Robust Data Augmentation in Class-Incremental Learning
Authors:
Minsu Kim,
Seong-Hyeon Hwang,
Steven Euijong Whang
Abstract:
In the context of continual learning, acquiring new knowledge while maintaining previous knowledge presents a significant challenge. Existing methods often use experience replay techniques that store a small portion of previous task data for training. In experience replay approaches, data augmentation has emerged as a promising strategy to further improve the model performance by mixing limited pr…
▽ More
In the context of continual learning, acquiring new knowledge while maintaining previous knowledge presents a significant challenge. Existing methods often use experience replay techniques that store a small portion of previous task data for training. In experience replay approaches, data augmentation has emerged as a promising strategy to further improve the model performance by mixing limited previous task data with sufficient current task data. However, we theoretically and empirically analyze that training with mixed samples from random sample pairs may harm the knowledge of previous tasks and cause greater catastrophic forgetting. We then propose GradMix, a robust data augmentation method specifically designed for mitigating catastrophic forgetting in class-incremental learning. GradMix performs gradient-based selective mixup using a class-based criterion that mixes only samples from helpful class pairs and not from detrimental class pairs for reducing catastrophic forgetting. Our experiments on various real datasets show that GradMix outperforms data augmentation baselines in accuracy by minimizing the forgetting of previous knowledge.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Simple Semi-supervised Knowledge Distillation from Vision-Language Models via $\mathbf{\texttt{D}}$ual-$\mathbf{\texttt{H}}$ead $\mathbf{\texttt{O}}$ptimization
Authors:
Seongjae Kang,
Dong Bok Lee,
Hyungjoon Jang,
Sung Ju Hwang
Abstract:
Vision-language models (VLMs) have achieved remarkable success across diverse tasks by leveraging rich textual information with minimal labeled data. However, deploying such large models remains challenging, particularly in resource-constrained environments. Knowledge distillation (KD) offers a well-established solution to this problem; however, recent KD approaches from VLMs often involve multi-s…
▽ More
Vision-language models (VLMs) have achieved remarkable success across diverse tasks by leveraging rich textual information with minimal labeled data. However, deploying such large models remains challenging, particularly in resource-constrained environments. Knowledge distillation (KD) offers a well-established solution to this problem; however, recent KD approaches from VLMs often involve multi-stage training or additional tuning, increasing computational overhead and optimization complexity. In this paper, we propose $\mathbf{\texttt{D}}$ual-$\mathbf{\texttt{H}}$ead $\mathbf{\texttt{O}}$ptimization ($\mathbf{\texttt{DHO}}$) -- a simple yet effective KD framework that transfers knowledge from VLMs to compact, task-specific models in semi-supervised settings. Specifically, we introduce dual prediction heads that independently learn from labeled data and teacher predictions, and propose to linearly combine their outputs during inference. We observe that $\texttt{DHO}$ mitigates gradient conflicts between supervised and distillation signals, enabling more effective feature learning than single-head KD baselines. As a result, extensive experiments show that $\texttt{DHO}$ consistently outperforms baselines across multiple domains and fine-grained datasets. Notably, on ImageNet, it achieves state-of-the-art performance, improving accuracy by 3% and 0.1% with 1% and 10% labeled data, respectively, while using fewer parameters.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Boosting Cross-spectral Unsupervised Domain Adaptation for Thermal Semantic Segmentation
Authors:
Seokjun Kwon,
Jeongmin Shin,
Namil Kim,
Soonmin Hwang,
Yukyung Choi
Abstract:
In autonomous driving, thermal image semantic segmentation has emerged as a critical research area, owing to its ability to provide robust scene understanding under adverse visual conditions. In particular, unsupervised domain adaptation (UDA) for thermal image segmentation can be an efficient solution to address the lack of labeled thermal datasets. Nevertheless, since these methods do not effect…
▽ More
In autonomous driving, thermal image semantic segmentation has emerged as a critical research area, owing to its ability to provide robust scene understanding under adverse visual conditions. In particular, unsupervised domain adaptation (UDA) for thermal image segmentation can be an efficient solution to address the lack of labeled thermal datasets. Nevertheless, since these methods do not effectively utilize the complementary information between RGB and thermal images, they significantly decrease performance during domain adaptation. In this paper, we present a comprehensive study on cross-spectral UDA for thermal image semantic segmentation. We first propose a novel masked mutual learning strategy that promotes complementary information exchange by selectively transferring results between each spectral model while masking out uncertain regions. Additionally, we introduce a novel prototypical self-supervised loss designed to enhance the performance of the thermal segmentation model in nighttime scenarios. This approach addresses the limitations of RGB pre-trained networks, which cannot effectively transfer knowledge under low illumination due to the inherent constraints of RGB sensors. In experiments, our method achieves higher performance over previous UDA methods and comparable performance to state-of-the-art supervised methods.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
Directional Thermal Emission Across Both Polarizations in Planar Photonic Architectures
Authors:
David E. Abraham,
Daniel Cui,
Baolai Liang,
Jae S. Hwang,
Parthiban Santhanam,
Linus Kim,
Rayen Lin,
Aaswath P. Raman
Abstract:
Directional and spectral control of thermal emission is essential for applications in energy conversion, imaging, and sensing. Existing planar, lithography-free epsilon-near-zero (ENZ) films only support transverse-magnetic (TM) control of thermal emission via the Berreman mode and cannot address transverse-electric (TE) waves due to the absence of natural optical magnetism over optical and infrar…
▽ More
Directional and spectral control of thermal emission is essential for applications in energy conversion, imaging, and sensing. Existing planar, lithography-free epsilon-near-zero (ENZ) films only support transverse-magnetic (TM) control of thermal emission via the Berreman mode and cannot address transverse-electric (TE) waves due to the absence of natural optical magnetism over optical and infrared wavelengths Here, we introduce a hyperbolic metamaterial comprising alternating layers of degenerately-doped and intrinsic InAs that exhibits an epsilon-and-mu-near-zero (EMNZ) response, enabling dual-polarized, directionally and spectrally selective thermal emission. We first theoretically demonstrate that a mu-near-zero (MNZ) film on a perfect magnetic conductor supports a magnetic Berreman mode, absorbing TE-polarized radiation in analogy to the conventional Berreman mode supported in TM polarization. Using genetic and gradient-descent optimization, we design a dual-polarized emitter with independently tunable spectral peaks and emission angles. Parameter retrieval via homogenization confirms simultaneous EMNZ points at the target wavelengths and angles. Finally, experimental measurement of a sample fabricated via molecular beam epitaxy exhibits high absorptivity peaks for both polarizations in close agreement with simulations. This work realizes lithography-free, dual-polarized, spectrally and directionally selective emitters, offering a versatile platform for advanced infrared thermal management and device integration.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
RVSNUpy: A Python Package for Spectroscopic Redshift Measurement Based on Cross-Correlation
Authors:
Taewan Kim,
Jubee Sohn,
Ho Seong Hwang
Abstract:
We introduce RVSNUpy, a new Python package designed to measure spectroscopic redshifts. Based on inverse-variance weighted cross-correlation, RVSNUpy determines the redshifts by comparing observed spectra with various rest-frame template spectra. We test the performance of RVSNUpy based on ~ 6000 objects in the HectoMAP redshift survey observed with both SDSS and MMT/Hectospec. We demonstrate that…
▽ More
We introduce RVSNUpy, a new Python package designed to measure spectroscopic redshifts. Based on inverse-variance weighted cross-correlation, RVSNUpy determines the redshifts by comparing observed spectra with various rest-frame template spectra. We test the performance of RVSNUpy based on ~ 6000 objects in the HectoMAP redshift survey observed with both SDSS and MMT/Hectospec. We demonstrate that a slight redshift offset (~ 40 km/s) between SDSS and MMT/Hectospec measurements reported from previous studies results from the small offsets in the redshift template spectra used for SDSS and Hectospec reductions. We construct the universal set of template spectra, including empirical SDSS template spectra, carefully calibrated to the rest frame. Our test for the HectoMAP objects with duplicated observations shows that RVSNUpy with the universal template spectra yields the homogeneous redshift from the spectra obtained with different spectrographs. We highlight that RVSNUpy is a powerful redshift measurement tool for current and future large-scale spectroscopy surveys, including A-SPEC, DESI, 4MOST, and Subaru/PFS.
△ Less
Submitted 3 May, 2025;
originally announced May 2025.
-
COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning
Authors:
Xindi Wu,
Hee Seung Hwang,
Polina Kirichenko,
Olga Russakovsky
Abstract:
Multimodal Large Language Models (MLLMs) excel at simple vision-language tasks but struggle when faced with complex tasks that require multiple capabilities, such as simultaneously recognizing objects, counting them, and understanding their spatial relationships. This might be partially the result of the fact that Visual Instruction Tuning (VIT), a critical training step for MLLMs, has traditional…
▽ More
Multimodal Large Language Models (MLLMs) excel at simple vision-language tasks but struggle when faced with complex tasks that require multiple capabilities, such as simultaneously recognizing objects, counting them, and understanding their spatial relationships. This might be partially the result of the fact that Visual Instruction Tuning (VIT), a critical training step for MLLMs, has traditionally focused on scaling data volume, but not the compositional complexity of training examples. We propose COMPACT (COMPositional Atomic-to-complex visual Capability Tuning), which generates a training dataset explicitly controlling for the compositional complexity of the training examples. The data from COMPACT allows MLLMs to train on combinations of atomic capabilities to learn complex capabilities more efficiently. Across all benchmarks, COMPACT achieves comparable performance to the LLaVA-665k VIT while using less than 10% of its data budget, and even outperforms it on several, especially those involving complex multi-capability tasks. For example, COMPACT achieves substantial 83.3% improvement on MMStar and 94.0% improvement on MM-Vet compared to the full-scale VIT on particularly complex questions that require four or more atomic capabilities. COMPACT offers a scalable, data-efficient, visual compositional tuning recipe to improve on complex visual-language tasks.
△ Less
Submitted 30 April, 2025;
originally announced April 2025.
-
UniversalRAG: Retrieval-Augmented Generation over Corpora of Diverse Modalities and Granularities
Authors:
Woongyeong Yeo,
Kangsan Kim,
Soyeong Jeong,
Jinheon Baek,
Sung Ju Hwang
Abstract:
Retrieval-Augmented Generation (RAG) has shown substantial promise in improving factual accuracy by grounding model responses with external knowledge relevant to queries. However, most existing RAG approaches are limited to a text-only corpus, and while recent efforts have extended RAG to other modalities such as images and videos, they typically operate over a single modality-specific corpus. In…
▽ More
Retrieval-Augmented Generation (RAG) has shown substantial promise in improving factual accuracy by grounding model responses with external knowledge relevant to queries. However, most existing RAG approaches are limited to a text-only corpus, and while recent efforts have extended RAG to other modalities such as images and videos, they typically operate over a single modality-specific corpus. In contrast, real-world queries vary widely in the type of knowledge they require, which a single type of knowledge source cannot address. To address this, we introduce UniversalRAG, a novel RAG framework designed to retrieve and integrate knowledge from heterogeneous sources with diverse modalities and granularities. Specifically, motivated by the observation that forcing all modalities into a unified representation space derived from a single aggregated corpus causes a modality gap, where the retrieval tends to favor items from the same modality as the query, we propose a modality-aware routing mechanism that dynamically identifies the most appropriate modality-specific corpus and performs targeted retrieval within it. Also, beyond modality, we organize each modality into multiple granularity levels, enabling fine-tuned retrieval tailored to the complexity and scope of the query. We validate UniversalRAG on 8 benchmarks spanning multiple modalities, showing its superiority over various modality-specific and unified baselines.
△ Less
Submitted 19 May, 2025; v1 submitted 29 April, 2025;
originally announced April 2025.
-
DG-DETR: Toward Domain Generalized Detection Transformer
Authors:
Seongmin Hwang,
Daeyoung Han,
Moongu Jeon
Abstract:
End-to-end Transformer-based detectors (DETRs) have demonstrated strong detection performance. However, domain generalization (DG) research has primarily focused on convolutional neural network (CNN)-based detectors, while paying little attention to enhancing the robustness of DETRs. In this letter, we introduce a Domain Generalized DEtection TRansformer (DG-DETR), a simple, effective, and plug-an…
▽ More
End-to-end Transformer-based detectors (DETRs) have demonstrated strong detection performance. However, domain generalization (DG) research has primarily focused on convolutional neural network (CNN)-based detectors, while paying little attention to enhancing the robustness of DETRs. In this letter, we introduce a Domain Generalized DEtection TRansformer (DG-DETR), a simple, effective, and plug-and-play method that improves out-of-distribution (OOD) robustness for DETRs. Specifically, we propose a novel domain-agnostic query selection strategy that removes domain-induced biases from object queries via orthogonal projection onto the instance-specific style space. Additionally, we leverage a wavelet decomposition to disentangle features into domain-invariant and domain-specific components, enabling synthesis of diverse latent styles while preserving the semantic features of objects. Experimental results validate the effectiveness of DG-DETR. Our code is available at https://github.com/sminhwang/DG-DETR.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
DOSE : Drum One-Shot Extraction from Music Mixture
Authors:
Suntae Hwang,
Seonghyeon Kang,
Kyungsu Kim,
Semin Ahn,
Kyogu Lee
Abstract:
Drum one-shot samples are crucial for music production, particularly in sound design and electronic music. This paper introduces Drum One-Shot Extraction, a task in which the goal is to extract drum one-shots that are present in the music mixture. To facilitate this, we propose the Random Mixture One-shot Dataset (RMOD), comprising large-scale, randomly arranged music mixtures paired with correspo…
▽ More
Drum one-shot samples are crucial for music production, particularly in sound design and electronic music. This paper introduces Drum One-Shot Extraction, a task in which the goal is to extract drum one-shots that are present in the music mixture. To facilitate this, we propose the Random Mixture One-shot Dataset (RMOD), comprising large-scale, randomly arranged music mixtures paired with corresponding drum one-shot samples. Our proposed model, Drum One- Shot Extractor (DOSE), leverages neural audio codec language models for end-to-end extraction, bypassing traditional source separation steps. Additionally, we introduce a novel onset loss, designed to encourage accurate prediction of the initial transient of drum one-shots, which is essential for capturing timbral characteristics. We compare this approach against a source separation-based extraction method as a baseline. The results, evaluated using Frechet Audio Distance (FAD) and Multi-Scale Spectral loss (MSS), demonstrate that DOSE, enhanced with onset loss, outperforms the baseline, providing more accurate and higher-quality drum one-shots from music mixtures. The code, model checkpoint, and audio examples are available at https://github.com/HSUNEH/DOSE
△ Less
Submitted 25 April, 2025;
originally announced April 2025.
-
Enhancing Variational Autoencoders with Smooth Robust Latent Encoding
Authors:
Hyomin Lee,
Minseon Kim,
Sangwon Jang,
Jongheon Jeong,
Sung Ju Hwang
Abstract:
Variational Autoencoders (VAEs) have played a key role in scaling up diffusion-based generative models, as in Stable Diffusion, yet questions regarding their robustness remain largely underexplored. Although adversarial training has been an established technique for enhancing robustness in predictive models, it has been overlooked for generative models due to concerns about potential fidelity degr…
▽ More
Variational Autoencoders (VAEs) have played a key role in scaling up diffusion-based generative models, as in Stable Diffusion, yet questions regarding their robustness remain largely underexplored. Although adversarial training has been an established technique for enhancing robustness in predictive models, it has been overlooked for generative models due to concerns about potential fidelity degradation by the nature of trade-offs between performance and robustness. In this work, we challenge this presumption, introducing Smooth Robust Latent VAE (SRL-VAE), a novel adversarial training framework that boosts both generation quality and robustness. In contrast to conventional adversarial training, which focuses on robustness only, our approach smooths the latent space via adversarial perturbations, promoting more generalizable representations while regularizing with originality representation to sustain original fidelity. Applied as a post-training step on pre-trained VAEs, SRL-VAE improves image robustness and fidelity with minimal computational overhead. Experiments show that SRL-VAE improves both generation quality, in image reconstruction and text-guided image editing, and robustness, against Nightshade attacks and image editing attacks. These results establish a new paradigm, showing that adversarial training, once thought to be detrimental to generative models, can instead enhance both fidelity and robustness.
△ Less
Submitted 23 April, 2025;
originally announced April 2025.
-
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
Authors:
Minju Seo,
Jinheon Baek,
Seongyun Lee,
Sung Ju Hwang
Abstract:
Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at understanding scientific documents and generating high-quality code. Inspired by this, we introduce PaperCoder, a multi-agent…
▽ More
Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at understanding scientific documents and generating high-quality code. Inspired by this, we introduce PaperCoder, a multi-agent LLM framework that transforms machine learning papers into functional code repositories. PaperCoder operates in three stages: planning, where it constructs a high-level roadmap, designs the system architecture with diagrams, identifies file dependencies, and generates configuration files; analysis, which focuses on interpreting implementation-specific details; and generation, where modular, dependency-aware code is produced. Moreover, each phase is instantiated through a set of specialized agents designed to collaborate effectively across the pipeline. We then evaluate PaperCoder on generating code implementations from machine learning papers based on both model-based and human evaluations, particularly from the authors of those papers, with author-released repositories as ground truth if available. Our results demonstrate the effectiveness of PaperCoder in creating high-quality, faithful implementations. Furthermore, it consistently shows strengths in the recently released PaperBench benchmark, surpassing strong baselines by substantial margins. Code is available at: https://github.com/going-doer/Paper2Code.
△ Less
Submitted 18 May, 2025; v1 submitted 23 April, 2025;
originally announced April 2025.
-
Breast density in MRI: an AI-based quantification and relationship to assessment in mammography
Authors:
Yaqian Chen,
Lin Li,
Hanxue Gu,
Haoyu Dong,
Derek L. Nguyen,
Allan D. Kirk,
Maciej A. Mazurowski,
E. Shelley Hwang
Abstract:
Mammographic breast density is a well-established risk factor for breast cancer. Recently there has been interest in breast MRI as an adjunct to mammography, as this modality provides an orthogonal and highly quantitative assessment of breast tissue. However, its 3D nature poses analytic challenges related to delineating and aggregating complex structures across slices. Here, we applied an in-hous…
▽ More
Mammographic breast density is a well-established risk factor for breast cancer. Recently there has been interest in breast MRI as an adjunct to mammography, as this modality provides an orthogonal and highly quantitative assessment of breast tissue. However, its 3D nature poses analytic challenges related to delineating and aggregating complex structures across slices. Here, we applied an in-house machine-learning algorithm to assess breast density on normal breasts in three MRI datasets. Breast density was consistent across different datasets (0.104 - 0.114). Analysis across different age groups also demonstrated strong consistency across datasets and confirmed a trend of decreasing density with age as reported in previous studies. MR breast density was correlated with mammographic breast density, although some notable differences suggest that certain breast density components are captured only on MRI. Future work will determine how to integrate MR breast density with current tools to improve future breast cancer risk prediction.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
Hardware-based Heterogeneous Memory Management for Large Language Model Inference
Authors:
Soojin Hwang,
Jungwoo Kim,
Sanghyeon Lee,
Hongbeen Kim,
Jaehyuk Huh
Abstract:
A large language model (LLM) is one of the most important emerging machine learning applications nowadays. However, due to its huge model size and runtime increase of the memory footprint, LLM inferences suffer from the lack of memory capacity in conventional systems consisting of multiple GPUs with a modest amount of high bandwidth memory. Moreover, since LLM contains many bandwidthintensive kern…
▽ More
A large language model (LLM) is one of the most important emerging machine learning applications nowadays. However, due to its huge model size and runtime increase of the memory footprint, LLM inferences suffer from the lack of memory capacity in conventional systems consisting of multiple GPUs with a modest amount of high bandwidth memory. Moreover, since LLM contains many bandwidthintensive kernels, only focusing on the memory capacity without considering the bandwidth incurs a serious performance degradation. To handle such conflicting memory capacity and bandwidth demands in a cost-effective way, this study investigates the potential of heterogeneous memory systems, proposing H2M2. It uses an asymmetric memory architecture consisting of capacity-centric and bandwidthcentric memory with computation units attached to each memory device. With the asymmetric memory, we first analyze the effect of kernel-memory mapping for the asymmetric memory. Second, we propose a dynamic runtime algorithm that finds a mapping solution considering the characteristics of LLM operations and the change of footprint during LLM inference. Third, we advocate the need for memory abstraction for the efficient management of the asymmetric memory. H2M2 outperforms the conventional homogeneous memory system with LPDDR by 1.46x, 1.55x, and 2.94x speedup in GPT3-175B, Chinchilla-70B, and Llama2-70B, respectively.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
SphereDiff: Tuning-free Omnidirectional Panoramic Image and Video Generation via Spherical Latent Representation
Authors:
Minho Park,
Taewoong Kang,
Jooyeol Yun,
Sungwon Hwang,
Jaegul Choo
Abstract:
The increasing demand for AR/VR applications has highlighted the need for high-quality 360-degree panoramic content. However, generating high-quality 360-degree panoramic images and videos remains a challenging task due to the severe distortions introduced by equirectangular projection (ERP). Existing approaches either fine-tune pretrained diffusion models on limited ERP datasets or attempt tuning…
▽ More
The increasing demand for AR/VR applications has highlighted the need for high-quality 360-degree panoramic content. However, generating high-quality 360-degree panoramic images and videos remains a challenging task due to the severe distortions introduced by equirectangular projection (ERP). Existing approaches either fine-tune pretrained diffusion models on limited ERP datasets or attempt tuning-free methods that still rely on ERP latent representations, leading to discontinuities near the poles. In this paper, we introduce SphereDiff, a novel approach for seamless 360-degree panoramic image and video generation using state-of-the-art diffusion models without additional tuning. We define a spherical latent representation that ensures uniform distribution across all perspectives, mitigating the distortions inherent in ERP. We extend MultiDiffusion to spherical latent space and propose a spherical latent sampling method to enable direct use of pretrained diffusion models. Moreover, we introduce distortion-aware weighted averaging to further improve the generation quality in the projection process. Our method outperforms existing approaches in generating 360-degree panoramic content while maintaining high fidelity, making it a robust solution for immersive AR/VR applications. The code is available here. https://github.com/pmh9960/SphereDiff
△ Less
Submitted 19 April, 2025;
originally announced April 2025.
-
BIGS: Bimanual Category-agnostic Interaction Reconstruction from Monocular Videos via 3D Gaussian Splatting
Authors:
Jeongwan On,
Kyeonghwan Gwak,
Gunyoung Kang,
Junuk Cha,
Soohyun Hwang,
Hyein Hwang,
Seungryul Baek
Abstract:
Reconstructing 3Ds of hand-object interaction (HOI) is a fundamental problem that can find numerous applications. Despite recent advances, there is no comprehensive pipeline yet for bimanual class-agnostic interaction reconstruction from a monocular RGB video, where two hands and an unknown object are interacting with each other. Previous works tackled the limited hand-object interaction case, whe…
▽ More
Reconstructing 3Ds of hand-object interaction (HOI) is a fundamental problem that can find numerous applications. Despite recent advances, there is no comprehensive pipeline yet for bimanual class-agnostic interaction reconstruction from a monocular RGB video, where two hands and an unknown object are interacting with each other. Previous works tackled the limited hand-object interaction case, where object templates are pre-known or only one hand is involved in the interaction. The bimanual interaction reconstruction exhibits severe occlusions introduced by complex interactions between two hands and an object. To solve this, we first introduce BIGS (Bimanual Interaction 3D Gaussian Splatting), a method that reconstructs 3D Gaussians of hands and an unknown object from a monocular video. To robustly obtain object Gaussians avoiding severe occlusions, we leverage prior knowledge of pre-trained diffusion model with score distillation sampling (SDS) loss, to reconstruct unseen object parts. For hand Gaussians, we exploit the 3D priors of hand model (i.e., MANO) and share a single Gaussian for two hands to effectively accumulate hand 3D information, given limited views. To further consider the 3D alignment between hands and objects, we include the interacting-subjects optimization step during Gaussian optimization. Our method achieves the state-of-the-art accuracy on two challenging datasets, in terms of 3D hand pose estimation (MPJPE), 3D object reconstruction (CDh, CDo, F10), and rendering quality (PSNR, SSIM, LPIPS), respectively.
△ Less
Submitted 12 April, 2025;
originally announced April 2025.
-
A Mean-Reverting Model of Exchange Rate Risk Premium Using Ornstein-Uhlenbeck Dynamics
Authors:
SeungJae Hwang
Abstract:
This paper examines the empirical failure of uncovered interest parity (UIP) and proposes a structural explanation based on a mean-reverting risk premium. We define a realized premium as the deviation between observed exchange rate returns and the interest rate differential, and demonstrate its strong mean-reverting behavior across multiple horizons. Motivated by this pattern, we model the risk pr…
▽ More
This paper examines the empirical failure of uncovered interest parity (UIP) and proposes a structural explanation based on a mean-reverting risk premium. We define a realized premium as the deviation between observed exchange rate returns and the interest rate differential, and demonstrate its strong mean-reverting behavior across multiple horizons. Motivated by this pattern, we model the risk premium using an Ornstein-Uhlenbeck (OU) process embedded within a stochastic differential equation for the exchange rate.
Our model yields closed-form approximations for future exchange rate distributions, which we evaluate using coverage-based backtesting. Applied to USD/KRW data from 2010 to 2025, the model shows strong predictive performance at both short-term and long-term horizons, while underperforming at intermediate (3-month) horizons and showing conservative behavior in the tails of long-term forecasts. These results suggest that exchange rate deviations from UIP may reflect structured, forecastable dynamics rather than pure noise, and point to future modeling improvements via regime-switching or time-varying volatility.
△ Less
Submitted 8 April, 2025;
originally announced April 2025.
-
A Redshift Survey of the Coma Cluster (A1656): Understanding the Nature of Subhalos in the Weak-lensing Map
Authors:
Wooseok Kang,
Ho Seong Hwang,
Nobuhiro Okabe,
Changbom Park
Abstract:
We study the physical properties of weak-lensing subhalos in the Coma cluster of galaxies using data from galaxy redshift surveys. The data include 12989 galaxies with measured spectroscopic redshifts (2184 from our MMT/Hectospec observation and 10807 from the literature). The $r$-band magnitude limit at which the differential spectroscopic completeness drops below 50% is 20.2 mag, which is spatia…
▽ More
We study the physical properties of weak-lensing subhalos in the Coma cluster of galaxies using data from galaxy redshift surveys. The data include 12989 galaxies with measured spectroscopic redshifts (2184 from our MMT/Hectospec observation and 10807 from the literature). The $r$-band magnitude limit at which the differential spectroscopic completeness drops below 50% is 20.2 mag, which is spatially uniform in a region of 4.5 deg$^{2}$ where the weak-lensing map of Okabe et al. (2014) exists. We identify 1337 member galaxies in this field and use them to understand the nature of 32 subhalos detected in the weak-lensing analysis. We use Gaussian Mixture Modeling (GMM) in the line-of-sight velocity domain to measure the mean velocity, the velocity dispersion, and the number of subhalo galaxies by mitigating the contamination from the interloping galaxies. Using subhalo properties calculated with GMM, we find no significant difference in the redshift space distribution between the cluster member galaxies and subhalos. We find that the weak-lensing mass shows strong correlations with the number of subhalo member galaxies, velocity dispersion, and dynamical mass of subhalos with power-law slopes of $0.54^{+0.16}_{-0.15}$, $0.93^{+0.35}_{-0.32}$, and $0.50^{+0.31}_{-0.18}$, respectively. The slope of the mass--velocity dispersion relation of the weak-lensing subhalos appears shallower than that of the galaxy clusters, galaxy groups, and individual galaxies. These results suggest that the combination of redshift surveys with weak-lensing maps can be a powerful tool for better understanding the nature of subhalos in clusters.
△ Less
Submitted 7 April, 2025;
originally announced April 2025.
-
Instruction-Guided Autoregressive Neural Network Parameter Generation
Authors:
Soro Bedionita,
Bruno Andreis,
Song Chong,
Sung Ju Hwang
Abstract:
Learning to generate neural network parameters conditioned on task descriptions and architecture specifications is pivotal for advancing model adaptability and transfer learning. Existing methods especially those based on diffusion models suffer from limited scalability to large architectures, rigidity in handling varying network depths, and disjointed parameter generation that undermines inter-la…
▽ More
Learning to generate neural network parameters conditioned on task descriptions and architecture specifications is pivotal for advancing model adaptability and transfer learning. Existing methods especially those based on diffusion models suffer from limited scalability to large architectures, rigidity in handling varying network depths, and disjointed parameter generation that undermines inter-layer coherence. In this work, we propose IGPG (Instruction Guided Parameter Generation), an autoregressive framework that unifies parameter synthesis across diverse tasks and architectures. IGPG leverages a VQ-VAE and an autoregressive model to generate neural network parameters, conditioned on task instructions, dataset, and architecture details. By autoregressively generating neural network weights' tokens, IGPG ensures inter-layer coherence and enables efficient adaptation across models and datasets. Operating at the token level, IGPG effectively captures complex parameter distributions aggregated from a broad spectrum of pretrained models. Extensive experiments on multiple vision datasets demonstrate that IGPG consolidates diverse pretrained models into a single, flexible generative framework. The synthesized parameters achieve competitive or superior performance relative to state-of-the-art methods, especially in terms of scalability and efficiency when applied to large architectures. These results underscore ICPG potential as a powerful tool for pretrained weight retrieval, model selection, and rapid task-specific fine-tuning.
△ Less
Submitted 2 April, 2025;
originally announced April 2025.
-
Spatial Transport Optimization by Repositioning Attention Map for Training-Free Text-to-Image Synthesis
Authors:
Woojung Han,
Yeonkyung Lee,
Chanyoung Kim,
Kwanghyun Park,
Seong Jae Hwang
Abstract:
Diffusion-based text-to-image (T2I) models have recently excelled in high-quality image generation, particularly in a training-free manner, enabling cost-effective adaptability and generalization across diverse tasks. However, while the existing methods have been continuously focusing on several challenges, such as "missing objects" and "mismatched attributes," another critical issue of "mislocate…
▽ More
Diffusion-based text-to-image (T2I) models have recently excelled in high-quality image generation, particularly in a training-free manner, enabling cost-effective adaptability and generalization across diverse tasks. However, while the existing methods have been continuously focusing on several challenges, such as "missing objects" and "mismatched attributes," another critical issue of "mislocated objects" remains where generated spatial positions fail to align with text prompts. Surprisingly, ensuring such seemingly basic functionality remains challenging in popular T2I models due to the inherent difficulty of imposing explicit spatial guidance via text forms. To address this, we propose STORM (Spatial Transport Optimization by Repositioning Attention Map), a novel training-free approach for spatially coherent T2I synthesis. STORM employs Spatial Transport Optimization (STO), rooted in optimal transport theory, to dynamically adjust object attention maps for precise spatial adherence, supported by a Spatial Transport (ST) Cost function that enhances spatial understanding. Our analysis shows that integrating spatial awareness is most effective in the early denoising stages, while later phases refine details. Extensive experiments demonstrate that STORM surpasses existing methods, effectively mitigating mislocated objects while improving missing and mismatched attributes, setting a new benchmark for spatial alignment in T2I synthesis.
△ Less
Submitted 28 March, 2025;
originally announced March 2025.
-
T-CIL: Temperature Scaling using Adversarial Perturbation for Calibration in Class-Incremental Learning
Authors:
Seong-Hyeon Hwang,
Minsu Kim,
Steven Euijong Whang
Abstract:
We study model confidence calibration in class-incremental learning, where models learn from sequential tasks with different class sets. While existing works primarily focus on accuracy, maintaining calibrated confidence has been largely overlooked. Unfortunately, most post-hoc calibration techniques are not designed to work with the limited memories of old-task data typical in class-incremental l…
▽ More
We study model confidence calibration in class-incremental learning, where models learn from sequential tasks with different class sets. While existing works primarily focus on accuracy, maintaining calibrated confidence has been largely overlooked. Unfortunately, most post-hoc calibration techniques are not designed to work with the limited memories of old-task data typical in class-incremental learning, as retaining a sufficient validation set would be impractical. Thus, we propose T-CIL, a novel temperature scaling approach for class-incremental learning without a validation set for old tasks, that leverages adversarially perturbed exemplars from memory. Directly using exemplars is inadequate for temperature optimization, since they are already used for training. The key idea of T-CIL is to perturb exemplars more strongly for old tasks than for the new task by adjusting the perturbation direction based on feature distance, with the single magnitude determined using the new-task validation set. This strategy makes the perturbation magnitude computed from the new task also applicable to old tasks, leveraging the tendency that the accuracy of old tasks is lower than that of the new task. We empirically show that T-CIL significantly outperforms various baselines in terms of calibration on real datasets and can be integrated with existing class-incremental learning techniques with minimal impact on accuracy.
△ Less
Submitted 28 March, 2025;
originally announced March 2025.
-
Design of Seamless Multi-modal Interaction Framework for Intelligent Virtual Agents in Wearable Mixed Reality Environment
Authors:
Ghazanfar Ali,
Hong-Quan Le,
Junho Kim,
Seoung-won Hwang,
Jae-In Hwang
Abstract:
In this paper, we present the design of a multimodal interaction framework for intelligent virtual agents in wearable mixed reality environments, especially for interactive applications at museums, botanical gardens, and similar places. These places need engaging and no-repetitive digital content delivery to maximize user involvement. An intelligent virtual agent is a promising mode for both purpo…
▽ More
In this paper, we present the design of a multimodal interaction framework for intelligent virtual agents in wearable mixed reality environments, especially for interactive applications at museums, botanical gardens, and similar places. These places need engaging and no-repetitive digital content delivery to maximize user involvement. An intelligent virtual agent is a promising mode for both purposes. Premises of framework is wearable mixed reality provided by MR devices supporting spatial mapping. We envisioned a seamless interaction framework by integrating potential features of spatial mapping, virtual character animations, speech recognition, gazing, domain-specific chatbot and object recognition to enhance virtual experiences and communication between users and virtual agents. By applying a modular approach and deploying computationally intensive modules on cloud-platform, we achieved a seamless virtual experience in a device with limited resources. Human-like gaze and speech interaction with a virtual agent made it more interactive. Automated mapping of body animations with the content of a speech made it more engaging. In our tests, the virtual agents responded within 2-4 seconds after the user query. The strength of the framework is flexibility and adaptability. It can be adapted to any wearable MR device supporting spatial mapping.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations
Authors:
Jeonghyeon Kim,
Sangheum Hwang
Abstract:
Prior research on out-of-distribution detection (OoDD) has primarily focused on single-modality models. Recently, with the advent of large-scale pretrained vision-language models such as CLIP, OoDD methods utilizing such multi-modal representations through zero-shot and prompt learning strategies have emerged. However, these methods typically involve either freezing the pretrained weights or only…
▽ More
Prior research on out-of-distribution detection (OoDD) has primarily focused on single-modality models. Recently, with the advent of large-scale pretrained vision-language models such as CLIP, OoDD methods utilizing such multi-modal representations through zero-shot and prompt learning strategies have emerged. However, these methods typically involve either freezing the pretrained weights or only partially tuning them, which can be suboptimal for downstream datasets. In this paper, we highlight that multi-modal fine-tuning (MMFT) can achieve notable OoDD performance. Despite some recent works demonstrating the impact of fine-tuning methods for OoDD, there remains significant potential for performance improvement. We investigate the limitation of naïve fine-tuning methods, examining why they fail to fully leverage the pretrained knowledge. Our empirical analysis suggests that this issue could stem from the modality gap within in-distribution (ID) embeddings. To address this, we propose a training objective that enhances cross-modal alignment by regularizing the distances between image and text embeddings of ID data. This adjustment helps in better utilizing pretrained textual information by aligning similar semantics from different modalities (i.e., text and image) more closely in the hyperspherical representation space. We theoretically demonstrate that the proposed regularization corresponds to the maximum likelihood estimation of an energy-based model on a hypersphere. Utilizing ImageNet-1k OoD benchmark datasets, we show that our method, combined with post-hoc OoDD approaches leveraging pretrained knowledge (e.g., NegLabel), significantly outperforms existing methods, achieving state-of-the-art OoDD performance and leading ID accuracy.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
Rethinking Glaucoma Calibration: Voting-Based Binocular and Metadata Integration
Authors:
Taejin Jeong,
Joohyeok Kim,
Jaehoon Joo,
Yeonwoo Jung,
Hyeonmin Kim,
Seong Jae Hwang
Abstract:
Glaucoma is an incurable ophthalmic disease that damages the optic nerve, leads to vision loss, and ranks among the leading causes of blindness worldwide. Diagnosing glaucoma typically involves fundus photography, optical coherence tomography (OCT), and visual field testing. However, the high cost of OCT often leads to reliance on fundus photography and visual field testing, both of which exhibit…
▽ More
Glaucoma is an incurable ophthalmic disease that damages the optic nerve, leads to vision loss, and ranks among the leading causes of blindness worldwide. Diagnosing glaucoma typically involves fundus photography, optical coherence tomography (OCT), and visual field testing. However, the high cost of OCT often leads to reliance on fundus photography and visual field testing, both of which exhibit inherent inter-observer variability. This stems from glaucoma being a multifaceted disease that influenced by various factors. As a result, glaucoma diagnosis is highly subjective, emphasizing the necessity of calibration, which aligns predicted probabilities with actual disease likelihood. Proper calibration is essential to prevent overdiagnosis or misdiagnosis, which are critical concerns for high-risk diseases. Although AI has significantly improved diagnostic accuracy, overconfidence in models have worsen calibration performance. Recent study has begun focusing on calibration for glaucoma. Nevertheless, previous study has not fully considered glaucoma's systemic nature and the high subjectivity in its diagnostic process. To overcome these limitations, we propose V-ViT (Voting-based ViT), a novel framework that enhances calibration by incorporating disease-specific characteristics. V-ViT integrates binocular data and metadata, reflecting the multi-faceted nature of glaucoma diagnosis. Additionally, we introduce a MC dropout-based Voting System to address high subjectivity. Our approach achieves state-of-the-art performance across all metrics, including accuracy, demonstrating that our proposed methods are effective in addressing calibration issues. We validate our method using a custom dataset including binocular data.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.