Skip to main content

Showing 1–50 of 483 results for author: hwang, s

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.07990  [pdf, ps, other

    cs.CV cs.AI

    Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

    Authors: Jeongseok Hyun, Sukjun Hwang, Su Ho Han, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Joon-Young Lee, Seon Joo Kim, Minho Shim

    Abstract: Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. To address this, we propose a training-free spatio-temporal token merging method, named STTM. Our key insight is to exploit local spatial and temporal redundancy in video data which has been overlooked in pri… ▽ More

    Submitted 10 July, 2025; originally announced July 2025.

    Comments: Accepted at ICCV2025; Project page: https://www.jshyun.me/projects/sttm

  2. arXiv:2507.07955  [pdf, ps, other

    cs.LG

    Dynamic Chunking for End-to-End Hierarchical Sequence Modeling

    Authors: Sukjun Hwang, Brandon Wang, Albert Gu

    Abstract: Despite incredible progress in language models (LMs) in recent years, largely resulting from moving away from specialized models designed for specific tasks to general models based on powerful architectures (e.g. the Transformer) that learn everything from raw data, pre-processing steps such as tokenization remain a barrier to true end-to-end foundation models. We introduce a collection of new tec… ▽ More

    Submitted 10 July, 2025; originally announced July 2025.

  3. arXiv:2507.02687  [pdf, ps, other

    cs.CV cs.AI

    APT: Adaptive Personalized Training for Diffusion Models with Limited Data

    Authors: JungWoo Chae, Jiyoon Kim, JaeWoong Choi, Kyungyul Kim, Sangheum Hwang

    Abstract: Personalizing diffusion models using limited data presents significant challenges, including overfitting, loss of prior knowledge, and degradation of text alignment. Overfitting leads to shifts in the noise prediction distribution, disrupting the denoising trajectory and causing the model to lose semantic coherence. In this paper, we propose Adaptive Personalized Training (APT), a novel framework… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: CVPR 2025 camera ready. Project page: https://lgcnsai.github.io/apt

    MSC Class: 60J60; 68T07 ACM Class: I.2.6; I.2.10; I.4.9

    Journal ref: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 28619-28628

  4. arXiv:2506.23547  [pdf, ps, other

    cs.CV

    Oneta: Multi-Style Image Enhancement Using Eigentransformation Functions

    Authors: Jiwon Kim, Soohyun Hwang, Dong-O Kim, Changsu Han, Min Kyu Park, Chang-Su Kim

    Abstract: The first algorithm, called Oneta, for a novel task of multi-style image enhancement is proposed in this work. Oneta uses two point operators sequentially: intensity enhancement with a transformation function (TF) and color correction with a color correction matrix (CCM). This two-step enhancement model, though simple, achieves a high performance upper bound. Also, we introduce eigentransformation… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  5. arXiv:2506.23518  [pdf, ps, other

    cs.CV

    WAVE: Warp-Based View Guidance for Consistent Novel View Synthesis Using a Single Image

    Authors: Jiwoo Park, Tae Eun Choi, Youngjun Jun, Seong Jae Hwang

    Abstract: Generating high-quality novel views of a scene from a single image requires maintaining structural coherence across different views, referred to as view consistency. While diffusion models have driven advancements in novel view synthesis, they still struggle to preserve spatial continuity across views. Diffusion models have been combined with 3D models to address the issue, but such approaches lac… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  6. arXiv:2506.15596  [pdf, ps, other

    cs.CV

    Mono-Modalizing Extremely Heterogeneous Multi-Modal Medical Image Registration

    Authors: Kyobin Choo, Hyunkyung Han, Jinyeong Kim, Chanyong Yoon, Seong Jae Hwang

    Abstract: In clinical practice, imaging modalities with functional characteristics, such as positron emission tomography (PET) and fractional anisotropy (FA), are often aligned with a structural reference (e.g., MRI, CT) for accurate interpretation or group analysis, necessitating multi-modal deformable image registration (DIR). However, due to the extreme heterogeneity of these modalities compared to stand… ▽ More

    Submitted 30 June, 2025; v1 submitted 18 June, 2025; originally announced June 2025.

    Comments: 11 pages, 3 figures, 2 tables, Accepted at Medical Image Computing and Computer Assisted Intervention (MICCAI) 2025

    ACM Class: I.4.5; I.4.9; J.3

  7. arXiv:2506.14213  [pdf, ps, other

    cs.CL

    Chaining Event Spans for Temporal Relation Grounding

    Authors: Jongho Kim, Dohyeon Lee, Minsoo Kim, Seung-won Hwang

    Abstract: Accurately understanding temporal relations between events is a critical building block of diverse tasks, such as temporal reading comprehension (TRC) and relation extraction (TRE). For example in TRC, we need to understand the temporal semantic differences between the following two questions that are lexically near-identical: "What finished right before the decision?" or "What finished right afte… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

    Comments: In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1689-1700

  8. arXiv:2506.14203  [pdf, ps, other

    cs.CL

    Intended Target Identification for Anomia Patients with Gradient-based Selective Augmentation

    Authors: Jongho Kim, Romain Storaï, Seung-won Hwang

    Abstract: In this study, we investigate the potential of language models (LMs) in aiding patients experiencing anomia, a difficulty identifying the names of items. Identifying the intended target item from patient's circumlocution involves the two challenges of term failure and error: (1) The terms relevant to identifying the item remain unseen. (2) What makes the challenge unique is inherent perturbed term… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

    Comments: EMNLP 2024 Findings (long)

    Journal ref: In Findings of the Association for Computational Linguistics, EMNLP 2024, pages 10513-10527

  9. arXiv:2506.11877  [pdf, ps, other

    cs.LG cs.AI

    Robust Molecular Property Prediction via Densifying Scarce Labeled Data

    Authors: Jina Kim, Jeffrey Willette, Bruno Andreis, Sung Ju Hwang

    Abstract: A widely recognized limitation of molecular prediction models is their reliance on structures observed in the training data, resulting in poor generalization to out-of-distribution compounds. Yet in drug discovery, the compounds most critical for advancing research often lie beyond the training set, making the bias toward the training data particularly problematic. This mismatch introduces substan… ▽ More

    Submitted 7 July, 2025; v1 submitted 13 June, 2025; originally announced June 2025.

  10. arXiv:2506.09229  [pdf, ps, other

    cs.CV

    Cross-Frame Representation Alignment for Fine-Tuning Video Diffusion Models

    Authors: Sungwon Hwang, Hyojin Jang, Kinam Kim, Minho Park, Jaegul Choo

    Abstract: Fine-tuning Video Diffusion Models (VDMs) at the user level to generate videos that reflect specific attributes of training data presents notable challenges, yet remains underexplored despite its practical importance. Meanwhile, recent work such as Representation Alignment (REPA) has shown promise in improving the convergence and quality of DiT-based image diffusion models by aligning, or assimila… ▽ More

    Submitted 25 June, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

    Comments: Project page: https://crepavideo.github.io

  11. arXiv:2506.07177  [pdf, ps, other

    cs.CV cs.AI

    Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models

    Authors: Sangwon Jang, Taekyung Ki, Jaehyeong Jo, Jaehong Yoon, Soo Ye Kim, Zhe Lin, Sung Ju Hwang

    Abstract: Advancements in diffusion models have significantly improved video quality, directing attention to fine-grained controllability. However, many existing methods depend on fine-tuning large-scale video models for specific tasks, which becomes increasingly impractical as model sizes continue to grow. In this work, we present Frame Guidance, a training-free guidance for controllable video generation b… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

    Comments: Project page: https://frame-guidance-video.github.io/

  12. arXiv:2506.05522  [pdf, other

    cs.SI cs.HC

    Understanding Community-Level Blocklists in Decentralized Social Media

    Authors: Owen Xingjian Zhang, Sohyeon Hwang, Yuhan Liu, Manoel Horta Ribeiro, Andrés Monroy-Hernández

    Abstract: Community-level blocklists are key to content moderation practices in decentralized social media. These blocklists enable moderators to prevent other communities, such as those acting in bad faith, from interacting with their own -- and, if shared publicly, warn others about communities worth blocking. Prior work has examined blocklists in centralized social media, noting their potential for colle… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

  13. arXiv:2506.05167  [pdf, ps, other

    cs.CL cs.AI cs.IR

    ECoRAG: Evidentiality-guided Compression for Long Context RAG

    Authors: Yeonseok Jeong, Jinsu Kim, Dohyeon Lee, Seung-won Hwang

    Abstract: Large Language Models (LLMs) have shown remarkable performance in Open-Domain Question Answering (ODQA) by leveraging external documents through Retrieval-Augmented Generation (RAG). To reduce RAG overhead, from longer context, context compression is necessary. However, prior compression methods do not focus on filtering out non-evidential information, which limit the performance in LLM-based RAG.… ▽ More

    Submitted 6 June, 2025; v1 submitted 5 June, 2025; originally announced June 2025.

  14. arXiv:2506.04704  [pdf, ps, other

    cs.CV cs.AI

    HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model

    Authors: Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilcahe Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang

    Abstract: Despite emerging efforts to enhance the safety of Vision-Language Models (VLMs), current approaches face two main shortcomings. 1) Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content, often overlooking contextually unsafe outcomes from seemingly benign pairs. This narrow coverage leaves VLMs vulnerable to jailbreak attacks in… ▽ More

    Submitted 11 June, 2025; v1 submitted 5 June, 2025; originally announced June 2025.

    Comments: Project page: https://youngwanlee.github.io/holisafe

  15. arXiv:2506.04288  [pdf, ps, other

    cs.LG

    Backbone Augmented Training for Adaptations

    Authors: Jae Wan Park, Junhyeok Kim, Youngjun Jun, Hyunah Ko, Seong Jae Hwang

    Abstract: Adaptations facilitate efficient training of large backbone models, including diffusion models for image generation and transformer-based language models. While various adaptation techniques enhance performance with minimal computational resources, limited adaptation data often leads to challenges in training. To address this, we focus on the enormous amount of backbone data used to pre-train the… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

  16. arXiv:2506.00910  [pdf, other

    cs.LG cs.AI

    PCoreSet: Effective Active Learning through Knowledge Distillation from Vision-Language Models

    Authors: Seongjae Kang, Dong Bok Lee, Hyungjoon Jang, Dongseop Kim, Sung Ju Hwang

    Abstract: Knowledge distillation (KD) is a widely used framework for training compact, task-specific models by leveraging the knowledge of teacher models. However, its application to active learning (AL), which aims to minimize annotation costs through iterative sample selection, remains underexplored. This gap stems from the fact that KD typically assumes access to sufficient labeled data, whereas AL opera… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

    Comments: 35 pages, 30 figures

  17. arXiv:2506.00607  [pdf, ps, other

    cs.CV cs.AI

    Parallel Rescaling: Rebalancing Consistency Guidance for Personalized Diffusion Models

    Authors: JungWoo Chae, Jiyoon Kim, Sangheum Hwang

    Abstract: Personalizing diffusion models to specific users or concepts remains challenging, particularly when only a few reference images are available. Existing methods such as DreamBooth and Textual Inversion often overfit to limited data, causing misalignment between generated images and text prompts when attempting to balance identity fidelity with prompt adherence. While Direct Consistency Optimization… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

  18. arXiv:2506.00232  [pdf, ps, other

    cs.CL

    ComposeRAG: A Modular and Composable RAG for Corpus-Grounded Multi-Hop Question Answering

    Authors: Ruofan Wu, Youngwon Lee, Fan Shu, Danmei Xu, Seung-won Hwang, Zhewei Yao, Yuxiong He, Feng Yan

    Abstract: Retrieval-Augmented Generation (RAG) systems are increasingly diverse, yet many suffer from monolithic designs that tightly couple core functions like query reformulation, retrieval, reasoning, and verification. This limits their interpretability, systematic evaluation, and targeted improvement, especially for complex multi-hop question answering. We introduce ComposeRAG, a novel modular abstracti… ▽ More

    Submitted 30 May, 2025; originally announced June 2025.

  19. arXiv:2505.24553  [pdf, ps, other

    cs.CL cs.AI

    CREFT: Sequential Multi-Agent LLM for Character Relation Extraction

    Authors: Ye Eun Chun, Taeyoon Hwang, Seung-won Hwang, Byung-Hak Kim

    Abstract: Understanding complex character relations is crucial for narrative analysis and efficient script evaluation, yet existing extraction methods often fail to handle long-form narratives with nuanced interactions. To address this challenge, we present CREFT, a novel sequential framework leveraging specialized Large Language Model (LLM) agents. First, CREFT builds a base character graph through knowled… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

  20. arXiv:2505.23059  [pdf, ps, other

    cs.IR cs.AI

    From Token to Action: State Machine Reasoning to Mitigate Overthinking in Information Retrieval

    Authors: Dohyeon Lee, Yeonseok Jeong, Seung-won Hwang

    Abstract: Chain-of-Thought (CoT) prompting enables complex reasoning in large language models (LLMs), including applications in information retrieval (IR). However, it often leads to overthinking, where models produce excessively long and semantically redundant traces with little or no benefit. We identify two key challenges in IR: redundant trajectories that revisit similar states and misguided reasoning t… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

  21. arXiv:2505.23032  [pdf, ps, other

    cs.LG cs.AI

    Bayesian Neural Scaling Law Extrapolation with Prior-Data Fitted Networks

    Authors: Dongwoo Lee, Dong Bok Lee, Steven Adriaensen, Juho Lee, Sung Ju Hwang, Frank Hutter, Seon Joo Kim, Hae Beom Lee

    Abstract: Scaling has been a major driver of recent advancements in deep learning. Numerous empirical studies have found that scaling laws often follow the power-law and proposed several variants of power-law functions to predict the scaling behavior at larger scales. However, existing methods mostly rely on point estimation and do not quantify uncertainty, which is crucial for real-world applications invol… ▽ More

    Submitted 15 June, 2025; v1 submitted 28 May, 2025; originally announced May 2025.

    Comments: Accepted to ICML 2025

  22. arXiv:2505.22962  [pdf, ps, other

    cs.SI cs.CY cs.HC

    Seeing the Politics of Decentralized Social Media Protocols

    Authors: Tolulope Oshinowo, Sohyeon Hwang, Amy X. Zhang, Andrés Monroy-Hernández

    Abstract: Calls to decentralize feed-based social media have been driven by concerns about the concentrated power of centralized platforms and their societal impact. In response, numerous decentralized social media protocols have emerged, each interpreting "decentralization" in different ways. We analyze four such protocols -- ActivityPub, AT Protocol, Nostr, and Farcaster -- to develop a novel conceptual f… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

    Comments: 22 pages, 6 figures, 3 tables

  23. arXiv:2505.19764  [pdf, ps, other

    cs.LG cs.AI

    Agentic Predictor: Performance Prediction for Agentic Workflows via Multi-View Encoding

    Authors: Patara Trirat, Wonyong Jeong, Sung Ju Hwang

    Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but optimizing LLM-based agentic systems remains challenging due to the vast search space of agent configurations, prompting strategies, and communication patterns. Existing approaches often rely on heuristic-based tuning or exhaustive evaluation, which can be computationally expensive and suboptimal. This… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

    Comments: Code will be available at https://github.com/DeepAuto-AI/agentic-predictor

  24. arXiv:2505.18512  [pdf, ps, other

    cs.IR cs.AI cs.CL cs.LG

    AcuRank: Uncertainty-Aware Adaptive Computation for Listwise Reranking

    Authors: Soyoung Yoon, Gyuwan Kim, Gyu-Hwung Cho, Seung-won Hwang

    Abstract: Listwise reranking with large language models (LLMs) enhances top-ranked results in retrieval-based applications. Due to the limit in context size and high inference cost of long context, reranking is typically performed over a fixed size of small subsets, with the final ranking aggregated from these partial results. This fixed computation disregards query difficulty and document distribution, lea… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

    Comments: 22 pages, 3 figures. The first two authors contributed equally. Author order is randomly determined via coin toss

  25. arXiv:2505.18318  [pdf, ps, other

    cs.HC

    The Relational Origins of Rules in Online Communities

    Authors: Charles Kiene, Sohyeon Hwang, Nathan TeBlunthuis, Carl Colglazier, Aaron Shaw, Benjamin Mako Hill

    Abstract: Where do rules come from in online communities? While prior studies of online community governance in social computing have sought to characterize rules by their functions within communities and documented practices of rule enforcement, they have largely overlooked rule adoption and change. This study investigates how and why online communities adopt and change their rules. We conducted a grounded… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

  26. arXiv:2505.17612  [pdf, other

    cs.CL cs.AI

    Distilling LLM Agent into Small Models with Retrieval and Code Tools

    Authors: Minki Kang, Jongwon Jeong, Seanie Lee, Jaewoong Cho, Sung Ju Hwang

    Abstract: Large language models (LLMs) excel at complex reasoning tasks but remain computationally expensive, limiting their practical deployment. To address this, recent works have focused on distilling reasoning capabilities into smaller language models (sLMs) using chain-of-thought (CoT) traces from teacher LLMs. However, this approach struggles in scenarios requiring rare factual knowledge or precise co… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

    Comments: preprint, v1

  27. arXiv:2505.16631  [pdf, other

    cs.IR cs.CL

    MiLQ: Benchmarking IR Models for Bilingual Web Search with Mixed Language Queries

    Authors: Jonghwi Kim, Deokhyung Kang, Seonjeong Hwang, Yunsu Kim, Jungseul Ok, Gary Lee

    Abstract: Despite bilingual speakers frequently using mixed-language queries in web searches, Information Retrieval (IR) research on them remains scarce. To address this, we introduce MiLQ,Mixed-Language Query test set, the first public benchmark of mixed-language queries, confirmed as realistic and highly preferred. Experiments show that multilingual IR models perform moderately on MiLQ and inconsistently… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

    Comments: 16 pages, 9 figures

  28. arXiv:2505.16223  [pdf, ps, other

    cs.AI cs.LG

    MADCluster: Model-agnostic Anomaly Detection with Self-supervised Clustering Network

    Authors: Sangyong Lee, Subo Hwang, Dohoon Kim

    Abstract: In this paper, we propose MADCluster, a novel model-agnostic anomaly detection framework utilizing self-supervised clustering. MADCluster is applicable to various deep learning architectures and addresses the 'hypersphere collapse' problem inherent in existing deep learning-based anomaly detection methods. The core idea is to cluster normal pattern data into a 'single cluster' while simultaneously… ▽ More

    Submitted 11 June, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

    Comments: 24 pages, 9 figures

  29. arXiv:2505.15137  [pdf, ps, other

    cs.CV

    Multispectral Detection Transformer with Infrared-Centric Sensor Fusion

    Authors: Seongmin Hwang, Daeyoung Han, Moongu Jeon

    Abstract: Multispectral object detection aims to leverage complementary information from visible (RGB) and infrared (IR) modalities to enable robust performance under diverse environmental conditions. In this letter, we propose IC-Fusion, a multispectral object detector that effectively fuses visible and infrared features through a lightweight and modalityaware design. Motivated by wavelet analysis and empi… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: Under Review

  30. arXiv:2505.12805  [pdf, other

    cs.LG cs.AI

    FedSVD: Adaptive Orthogonalization for Private Federated Learning with LoRA

    Authors: Seanie Lee, Sangwoo Park, Dong Bok Lee, Dominik Wagner, Haebin Seong, Tobias Bocklet, Juho Lee, Sung Ju Hwang

    Abstract: Low-Rank Adaptation (LoRA), which introduces a product of two trainable low-rank matrices into frozen pre-trained weights, is widely used for efficient fine-tuning of language models in federated learning (FL). However, when combined with differentially private stochastic gradient descent (DP-SGD), LoRA faces substantial noise amplification: DP-SGD perturbs per-sample gradients, and the matrix mul… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    Comments: preprint

  31. arXiv:2505.12780  [pdf, other

    cs.HC

    Beyond Individual UX: Defining Group Experience(GX) as a New Paradigm for Group-centered AI

    Authors: Soohwan Lee, Seoyeong Hwang, Kyungho Lee

    Abstract: Recent advancements in HCI and AI have predominantly centered on individual user experiences, often neglecting the emergent dynamics of group interactions. This provocation introduces Group Experience(GX) to capture the collective perceptual, emotional, and cognitive dimensions that arise when individuals interact in cohesive groups. We challenge the conventional Human-centered AI paradigm and pro… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    Comments: Accepted at DIS'25 Companion (Provocations)

  32. arXiv:2505.12233  [pdf, ps, other

    eess.IV cs.CV

    PRETI: Patient-Aware Retinal Foundation Model via Metadata-Guided Representation Learning

    Authors: Yeonkyung Lee, Woojung Han, Youngjun Jun, Hyeonmin Kim, Jungkyung Cho, Seong Jae Hwang

    Abstract: Retinal foundation models have significantly advanced retinal image analysis by leveraging self-supervised learning to reduce dependence on labeled data while achieving strong generalization. Many recent approaches enhance retinal image understanding using report supervision, but obtaining clinical reports is often costly and challenging. In contrast, metadata (e.g., age, gender) is widely availab… ▽ More

    Submitted 18 May, 2025; originally announced May 2025.

    Comments: MICCAI2025 early accept

  33. arXiv:2505.11254  [pdf, ps, other

    cs.LG

    Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction

    Authors: Jeffrey Willette, Heejun Lee, Sung Ju Hwang

    Abstract: The attention mechanism of a transformer has a quadratic complexity, leading to high inference costs and latency for long sequences. However, attention matrices are mostly sparse, which implies that many entries may be omitted from computation for efficient inference. Sparse attention inference methods aim to reduce this computational burden; however, they also come with a troublesome performance… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

  34. arXiv:2505.09666  [pdf, ps, other

    cs.CL cs.AI cs.LG

    System Prompt Optimization with Meta-Learning

    Authors: Yumin Choi, Jinheon Baek, Sung Ju Hwang

    Abstract: Large Language Models (LLMs) have shown remarkable capabilities, with optimizing their input prompts playing a pivotal role in maximizing their performance. However, while LLM prompts consist of both the task-agnostic system prompts and task-specific user prompts, existing work on prompt optimization has focused on user prompts specific to individual queries or tasks, and largely overlooked the sy… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

  35. arXiv:2505.08528  [pdf, ps, other

    cs.LG cs.AI cs.CV

    GradMix: Gradient-based Selective Mixup for Robust Data Augmentation in Class-Incremental Learning

    Authors: Minsu Kim, Seong-Hyeon Hwang, Steven Euijong Whang

    Abstract: In the context of continual learning, acquiring new knowledge while maintaining previous knowledge presents a significant challenge. Existing methods often use experience replay techniques that store a small portion of previous task data for training. In experience replay approaches, data augmentation has emerged as a promising strategy to further improve the model performance by mixing limited pr… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  36. arXiv:2505.07675  [pdf, other

    cs.LG cs.AI cs.CV

    Simple Semi-supervised Knowledge Distillation from Vision-Language Models via $\mathbf{\texttt{D}}$ual-$\mathbf{\texttt{H}}$ead $\mathbf{\texttt{O}}$ptimization

    Authors: Seongjae Kang, Dong Bok Lee, Hyungjoon Jang, Sung Ju Hwang

    Abstract: Vision-language models (VLMs) have achieved remarkable success across diverse tasks by leveraging rich textual information with minimal labeled data. However, deploying such large models remains challenging, particularly in resource-constrained environments. Knowledge distillation (KD) offers a well-established solution to this problem; however, recent KD approaches from VLMs often involve multi-s… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

    Comments: 41 pages, 19 figures, preprint

  37. arXiv:2505.06951  [pdf, ps, other

    cs.CV cs.RO

    Boosting Cross-spectral Unsupervised Domain Adaptation for Thermal Semantic Segmentation

    Authors: Seokjun Kwon, Jeongmin Shin, Namil Kim, Soonmin Hwang, Yukyung Choi

    Abstract: In autonomous driving, thermal image semantic segmentation has emerged as a critical research area, owing to its ability to provide robust scene understanding under adverse visual conditions. In particular, unsupervised domain adaptation (UDA) for thermal image segmentation can be an efficient solution to address the lack of labeled thermal datasets. Nevertheless, since these methods do not effect… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

    Comments: 7 pages, 4 figures, International Conference on Robotics and Automation(ICRA) 2025

  38. arXiv:2504.21850  [pdf, other

    cs.CV

    COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning

    Authors: Xindi Wu, Hee Seung Hwang, Polina Kirichenko, Olga Russakovsky

    Abstract: Multimodal Large Language Models (MLLMs) excel at simple vision-language tasks but struggle when faced with complex tasks that require multiple capabilities, such as simultaneously recognizing objects, counting them, and understanding their spatial relationships. This might be partially the result of the fact that Visual Instruction Tuning (VIT), a critical training step for MLLMs, has traditional… ▽ More

    Submitted 30 April, 2025; originally announced April 2025.

    Comments: 17 pages, 13 figures

  39. arXiv:2504.20734  [pdf, other

    cs.CL cs.AI cs.CV cs.IR cs.LG

    UniversalRAG: Retrieval-Augmented Generation over Corpora of Diverse Modalities and Granularities

    Authors: Woongyeong Yeo, Kangsan Kim, Soyeong Jeong, Jinheon Baek, Sung Ju Hwang

    Abstract: Retrieval-Augmented Generation (RAG) has shown substantial promise in improving factual accuracy by grounding model responses with external knowledge relevant to queries. However, most existing RAG approaches are limited to a text-only corpus, and while recent efforts have extended RAG to other modalities such as images and videos, they typically operate over a single modality-specific corpus. In… ▽ More

    Submitted 19 May, 2025; v1 submitted 29 April, 2025; originally announced April 2025.

    Comments: Project page : https://universalrag.github.io

  40. arXiv:2504.19574  [pdf, other

    cs.CV

    DG-DETR: Toward Domain Generalized Detection Transformer

    Authors: Seongmin Hwang, Daeyoung Han, Moongu Jeon

    Abstract: End-to-end Transformer-based detectors (DETRs) have demonstrated strong detection performance. However, domain generalization (DG) research has primarily focused on convolutional neural network (CNN)-based detectors, while paying little attention to enhancing the robustness of DETRs. In this letter, we introduce a Domain Generalized DEtection TRansformer (DG-DETR), a simple, effective, and plug-an… ▽ More

    Submitted 28 April, 2025; originally announced April 2025.

    Comments: Under Review

  41. arXiv:2504.18157  [pdf, other

    eess.AS cs.SD

    DOSE : Drum One-Shot Extraction from Music Mixture

    Authors: Suntae Hwang, Seonghyeon Kang, Kyungsu Kim, Semin Ahn, Kyogu Lee

    Abstract: Drum one-shot samples are crucial for music production, particularly in sound design and electronic music. This paper introduces Drum One-Shot Extraction, a task in which the goal is to extract drum one-shots that are present in the music mixture. To facilitate this, we propose the Random Mixture One-shot Dataset (RMOD), comprising large-scale, randomly arranged music mixtures paired with correspo… ▽ More

    Submitted 25 April, 2025; originally announced April 2025.

    Comments: Published in IEEE ICASSP 2025

  42. arXiv:2504.17219  [pdf, other

    cs.LG cs.AI cs.CR

    Enhancing Variational Autoencoders with Smooth Robust Latent Encoding

    Authors: Hyomin Lee, Minseon Kim, Sangwon Jang, Jongheon Jeong, Sung Ju Hwang

    Abstract: Variational Autoencoders (VAEs) have played a key role in scaling up diffusion-based generative models, as in Stable Diffusion, yet questions regarding their robustness remain largely underexplored. Although adversarial training has been an established technique for enhancing robustness in predictive models, it has been overlooked for generative models due to concerns about potential fidelity degr… ▽ More

    Submitted 23 April, 2025; originally announced April 2025.

    Comments: Under review

  43. arXiv:2504.17192  [pdf, other

    cs.CL

    Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

    Authors: Minju Seo, Jinheon Baek, Seongyun Lee, Sung Ju Hwang

    Abstract: Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at understanding scientific documents and generating high-quality code. Inspired by this, we introduce PaperCoder, a multi-agent… ▽ More

    Submitted 18 May, 2025; v1 submitted 23 April, 2025; originally announced April 2025.

  44. arXiv:2504.15192  [pdf

    cs.CV cs.AI

    Breast density in MRI: an AI-based quantification and relationship to assessment in mammography

    Authors: Yaqian Chen, Lin Li, Hanxue Gu, Haoyu Dong, Derek L. Nguyen, Allan D. Kirk, Maciej A. Mazurowski, E. Shelley Hwang

    Abstract: Mammographic breast density is a well-established risk factor for breast cancer. Recently there has been interest in breast MRI as an adjunct to mammography, as this modality provides an orthogonal and highly quantitative assessment of breast tissue. However, its 3D nature poses analytic challenges related to delineating and aggregating complex structures across slices. Here, we applied an in-hous… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

    Comments: 13 pages, 5 figures

  45. arXiv:2504.14893  [pdf, other

    cs.AR

    Hardware-based Heterogeneous Memory Management for Large Language Model Inference

    Authors: Soojin Hwang, Jungwoo Kim, Sanghyeon Lee, Hongbeen Kim, Jaehyuk Huh

    Abstract: A large language model (LLM) is one of the most important emerging machine learning applications nowadays. However, due to its huge model size and runtime increase of the memory footprint, LLM inferences suffer from the lack of memory capacity in conventional systems consisting of multiple GPUs with a modest amount of high bandwidth memory. Moreover, since LLM contains many bandwidthintensive kern… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

  46. arXiv:2504.14396  [pdf, other

    cs.CV

    SphereDiff: Tuning-free Omnidirectional Panoramic Image and Video Generation via Spherical Latent Representation

    Authors: Minho Park, Taewoong Kang, Jooyeol Yun, Sungwon Hwang, Jaegul Choo

    Abstract: The increasing demand for AR/VR applications has highlighted the need for high-quality 360-degree panoramic content. However, generating high-quality 360-degree panoramic images and videos remains a challenging task due to the severe distortions introduced by equirectangular projection (ERP). Existing approaches either fine-tune pretrained diffusion models on limited ERP datasets or attempt tuning… ▽ More

    Submitted 19 April, 2025; originally announced April 2025.

  47. arXiv:2504.09097  [pdf, other

    cs.CV

    BIGS: Bimanual Category-agnostic Interaction Reconstruction from Monocular Videos via 3D Gaussian Splatting

    Authors: Jeongwan On, Kyeonghwan Gwak, Gunyoung Kang, Junuk Cha, Soohyun Hwang, Hyein Hwang, Seungryul Baek

    Abstract: Reconstructing 3Ds of hand-object interaction (HOI) is a fundamental problem that can find numerous applications. Despite recent advances, there is no comprehensive pipeline yet for bimanual class-agnostic interaction reconstruction from a monocular RGB video, where two hands and an unknown object are interacting with each other. Previous works tackled the limited hand-object interaction case, whe… ▽ More

    Submitted 12 April, 2025; originally announced April 2025.

    Comments: Accepted to CVPR 2025

  48. arXiv:2504.02012  [pdf, other

    cs.LG

    Instruction-Guided Autoregressive Neural Network Parameter Generation

    Authors: Soro Bedionita, Bruno Andreis, Song Chong, Sung Ju Hwang

    Abstract: Learning to generate neural network parameters conditioned on task descriptions and architecture specifications is pivotal for advancing model adaptability and transfer learning. Existing methods especially those based on diffusion models suffer from limited scalability to large architectures, rigidity in handling varying network depths, and disjointed parameter generation that undermines inter-la… ▽ More

    Submitted 2 April, 2025; originally announced April 2025.

  49. arXiv:2503.22168  [pdf, other

    cs.CV

    Spatial Transport Optimization by Repositioning Attention Map for Training-Free Text-to-Image Synthesis

    Authors: Woojung Han, Yeonkyung Lee, Chanyoung Kim, Kwanghyun Park, Seong Jae Hwang

    Abstract: Diffusion-based text-to-image (T2I) models have recently excelled in high-quality image generation, particularly in a training-free manner, enabling cost-effective adaptability and generalization across diverse tasks. However, while the existing methods have been continuously focusing on several challenges, such as "missing objects" and "mismatched attributes," another critical issue of "mislocate… ▽ More

    Submitted 28 March, 2025; originally announced March 2025.

    Comments: CVPR2025

  50. arXiv:2503.22163  [pdf, other

    cs.LG

    T-CIL: Temperature Scaling using Adversarial Perturbation for Calibration in Class-Incremental Learning

    Authors: Seong-Hyeon Hwang, Minsu Kim, Steven Euijong Whang

    Abstract: We study model confidence calibration in class-incremental learning, where models learn from sequential tasks with different class sets. While existing works primarily focus on accuracy, maintaining calibrated confidence has been largely overlooked. Unfortunately, most post-hoc calibration techniques are not designed to work with the limited memories of old-task data typical in class-incremental l… ▽ More

    Submitted 28 March, 2025; originally announced March 2025.

    Comments: Accepted to CVPR 2025