-
GaRA-SAM: Robustifying Segment Anything Model with Gated-Rank Adaptation
Authors:
Sohyun Lee,
Yeho Kwon,
Lukas Hoyer,
Suha Kwak
Abstract:
Improving robustness of the Segment Anything Model (SAM) to input degradations is critical for its deployment in high-stakes applications such as autonomous driving and robotics. Our approach to this challenge prioritizes three key aspects: first, parameter efficiency to maintain the inherent generalization capability of SAM; second, fine-grained and input-aware robustification to precisely addres…
▽ More
Improving robustness of the Segment Anything Model (SAM) to input degradations is critical for its deployment in high-stakes applications such as autonomous driving and robotics. Our approach to this challenge prioritizes three key aspects: first, parameter efficiency to maintain the inherent generalization capability of SAM; second, fine-grained and input-aware robustification to precisely address the input corruption; and third, adherence to standard training protocols for ease of training. To this end, we propose gated-rank adaptation (GaRA). GaRA introduces lightweight adapters into intermediate layers of the frozen SAM, where each adapter dynamically adjusts the effective rank of its weight matrix based on the input by selectively activating (rank-1) components of the matrix using a learned gating module. This adjustment enables fine-grained and input-aware robustification without compromising the generalization capability of SAM. Our model, GaRA-SAM, significantly outperforms prior work on all robust segmentation benchmarks. In particular, it surpasses the previous best IoU score by up to 21.3\%p on ACDC, a challenging real corrupted image dataset.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
StarFT: Robust Fine-tuning of Zero-shot Models via Spuriosity Alignment
Authors:
Younghyun Kim,
Jongheon Jeong,
Sangkyung Kwak,
Kyungmin Lee,
Juho Lee,
Jinwoo Shin
Abstract:
Learning robust representations from data often requires scale, which has led to the success of recent zero-shot models such as CLIP. However, the obtained robustness can easily be deteriorated when these models are fine-tuned on other downstream tasks (e.g., of smaller scales). Previous works often interpret this phenomenon in the context of domain shift, developing fine-tuning methods that aim t…
▽ More
Learning robust representations from data often requires scale, which has led to the success of recent zero-shot models such as CLIP. However, the obtained robustness can easily be deteriorated when these models are fine-tuned on other downstream tasks (e.g., of smaller scales). Previous works often interpret this phenomenon in the context of domain shift, developing fine-tuning methods that aim to preserve the original domain as much as possible. However, in a different context, fine-tuned models with limited data are also prone to learning features that are spurious to humans, such as background or texture. In this paper, we propose StarFT (Spurious Textual Alignment Regularization), a novel framework for fine-tuning zero-shot models to enhance robustness by preventing them from learning spuriosity. We introduce a regularization that aligns the output distribution for spuriosity-injected labels with the original zero-shot model, ensuring that the model is not induced to extract irrelevant features further from these descriptions. We leverage recent language models to get such spuriosity-injected labels by generating alternative textual descriptions that highlight potentially confounding features. Extensive experiments validate the robust generalization of StarFT and its emerging properties: zero-shot group robustness and improved zero-shot classification. Notably, StarFT boosts both worst-group and average accuracy by 14.30% and 3.02%, respectively, in the Waterbirds group shift scenario, where other robust fine-tuning baselines show even degraded performance.
△ Less
Submitted 20 May, 2025; v1 submitted 19 May, 2025;
originally announced May 2025.
-
Improving Sound Source Localization with Joint Slot Attention on Image and Audio
Authors:
Inho Kim,
Youngkil Song,
Jicheol Park,
Won Hwa Kim,
Suha Kwak
Abstract:
Sound source localization (SSL) is the task of locating the source of sound within an image. Due to the lack of localization labels, the de facto standard in SSL has been to represent an image and audio as a single embedding vector each, and use them to learn SSL via contrastive learning. To this end, previous work samples one of local image features as the image embedding and aggregates all local…
▽ More
Sound source localization (SSL) is the task of locating the source of sound within an image. Due to the lack of localization labels, the de facto standard in SSL has been to represent an image and audio as a single embedding vector each, and use them to learn SSL via contrastive learning. To this end, previous work samples one of local image features as the image embedding and aggregates all local audio features to obtain the audio embedding, which is far from optimal due to the presence of noise and background irrelevant to the actual target in the input. We present a novel SSL method that addresses this chronic issue by joint slot attention on image and audio. To be specific, two slots competitively attend image and audio features to decompose them into target and off-target representations, and only target representations of image and audio are used for contrastive learning. Also, we introduce cross-modal attention matching to further align local features of image and audio. Our method achieved the best in almost all settings on three public benchmarks for SSL, and substantially outperformed all the prior work in cross-modal retrieval.
△ Less
Submitted 11 May, 2025; v1 submitted 21 April, 2025;
originally announced April 2025.
-
TestDG: Test-time Domain Generalization for Continual Test-time Adaptation
Authors:
Sohyun Lee,
Nayeong Kim,
Juwon Kang,
Seong Joon Oh,
Suha Kwak
Abstract:
This paper studies continual test-time adaptation (CTTA), the task of adapting a model to constantly changing unseen domains in testing while preserving previously learned knowledge. Existing CTTA methods mostly focus on adaptation to the current test domain only, overlooking generalization to arbitrary test domains a model may face in the future. To tackle this limitation, we present a novel onli…
▽ More
This paper studies continual test-time adaptation (CTTA), the task of adapting a model to constantly changing unseen domains in testing while preserving previously learned knowledge. Existing CTTA methods mostly focus on adaptation to the current test domain only, overlooking generalization to arbitrary test domains a model may face in the future. To tackle this limitation, we present a novel online test-time domain generalization framework for CTTA, dubbed TestDG. TestDG aims to learn features invariant to both current and previous test domains on the fly during testing, improving the potential for effective generalization to future domains. To this end, we propose a new model architecture and a test-time adaptation strategy dedicated to learning domain-invariant features, along with a new data structure and optimization algorithm for effectively managing information from previous test domains. TestDG achieved state of the art on four public CTTA benchmarks. Moreover, it showed superior generalization to unseen test domains.
△ Less
Submitted 3 June, 2025; v1 submitted 7 April, 2025;
originally announced April 2025.
-
Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval
Authors:
Boseung Jeong,
Jicheol Park,
Sungyeon Kim,
Suha Kwak
Abstract:
Video-text retrieval, the task of retrieving videos based on a textual query or vice versa, is of paramount importance for video understanding and multimodal information retrieval. Recent methods in this area rely primarily on visual and textual features and often ignore audio, although it helps enhance overall comprehension of video content. Moreover, traditional models that incorporate audio bli…
▽ More
Video-text retrieval, the task of retrieving videos based on a textual query or vice versa, is of paramount importance for video understanding and multimodal information retrieval. Recent methods in this area rely primarily on visual and textual features and often ignore audio, although it helps enhance overall comprehension of video content. Moreover, traditional models that incorporate audio blindly utilize the audio input regardless of whether it is useful or not, resulting in suboptimal video representation. To address these limitations, we propose a novel video-text retrieval framework, Audio-guided VIdeo representation learning with GATEd attention (AVIGATE), that effectively leverages audio cues through a gated attention mechanism that selectively filters out uninformative audio signals. In addition, we propose an adaptive margin-based contrastive loss to deal with the inherently unclear positive-negative relationship between video and text, which facilitates learning better video-text alignment. Our extensive experiments demonstrate that AVIGATE achieves state-of-the-art performance on all the public benchmarks.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
GENIUS: A Generative Framework for Universal Multimodal Search
Authors:
Sungyeon Kim,
Xinliang Zhu,
Xiaofan Lin,
Muhammet Bastan,
Douglas Gray,
Suha Kwak
Abstract:
Generative retrieval is an emerging approach in information retrieval that generates identifiers (IDs) of target data based on a query, providing an efficient alternative to traditional embedding-based retrieval methods. However, existing models are task-specific and fall short of embedding-based retrieval in performance. This paper proposes GENIUS, a universal generative retrieval framework suppo…
▽ More
Generative retrieval is an emerging approach in information retrieval that generates identifiers (IDs) of target data based on a query, providing an efficient alternative to traditional embedding-based retrieval methods. However, existing models are task-specific and fall short of embedding-based retrieval in performance. This paper proposes GENIUS, a universal generative retrieval framework supporting diverse tasks across multiple modalities and domains. At its core, GENIUS introduces modality-decoupled semantic quantization, transforming multimodal data into discrete IDs encoding both modality and semantics. Moreover, to enhance generalization, we propose a query augmentation that interpolates between a query and its target, allowing GENIUS to adapt to varied query forms. Evaluated on the M-BEIR benchmark, it surpasses prior generative methods by a clear margin. Unlike embedding-based retrieval, GENIUS consistently maintains high retrieval speed across database size, with competitive performance across multiple benchmarks. With additional re-ranking, GENIUS often achieves results close to those of embedding-based methods while preserving efficiency.
△ Less
Submitted 25 March, 2025;
originally announced March 2025.
-
Enhancing Cost Efficiency in Active Learning with Candidate Set Query
Authors:
Yeho Gwon,
Sehyun Hwang,
Hoyoung Kim,
Jungseul Ok,
Suha Kwak
Abstract:
This paper introduces a cost-efficient active learning (AL) framework for classification, featuring a novel query design called candidate set query. Unlike traditional AL queries requiring the oracle to examine all possible classes, our method narrows down the set of candidate classes likely to include the ground-truth class, significantly reducing the search space and labeling cost. Moreover, we…
▽ More
This paper introduces a cost-efficient active learning (AL) framework for classification, featuring a novel query design called candidate set query. Unlike traditional AL queries requiring the oracle to examine all possible classes, our method narrows down the set of candidate classes likely to include the ground-truth class, significantly reducing the search space and labeling cost. Moreover, we leverage conformal prediction to dynamically generate small yet reliable candidate sets, adapting to model enhancement over successive AL rounds. To this end, we introduce an acquisition function designed to prioritize data points that offer high information gain at lower cost. Empirical evaluations on CIFAR-10, CIFAR-100, and ImageNet64x64 demonstrate the effectiveness and scalability of our framework. Notably, it reduces labeling cost by 42% on ImageNet64x64.
△ Less
Submitted 10 February, 2025;
originally announced February 2025.
-
Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens
Authors:
Dongwon Kim,
Ju He,
Qihang Yu,
Chenglin Yang,
Xiaohui Shen,
Suha Kwak,
Liang-Chieh Chen
Abstract:
Image tokenizers form the foundation of modern text-to-image generative models but are notoriously difficult to train. Furthermore, most existing text-to-image models rely on large-scale, high-quality private datasets, making them challenging to replicate. In this work, we introduce Text-Aware Transformer-based 1-Dimensional Tokenizer (TA-TiTok), an efficient and powerful image tokenizer that can…
▽ More
Image tokenizers form the foundation of modern text-to-image generative models but are notoriously difficult to train. Furthermore, most existing text-to-image models rely on large-scale, high-quality private datasets, making them challenging to replicate. In this work, we introduce Text-Aware Transformer-based 1-Dimensional Tokenizer (TA-TiTok), an efficient and powerful image tokenizer that can utilize either discrete or continuous 1-dimensional tokens. TA-TiTok uniquely integrates textual information during the tokenizer decoding stage (i.e., de-tokenization), accelerating convergence and enhancing performance. TA-TiTok also benefits from a simplified, yet effective, one-stage training process, eliminating the need for the complex two-stage distillation used in previous 1-dimensional tokenizers. This design allows for seamless scalability to large datasets. Building on this, we introduce a family of text-to-image Masked Generative Models (MaskGen), trained exclusively on open data while achieving comparable performance to models trained on private data. We aim to release both the efficient, strong TA-TiTok tokenizers and the open-data, open-weight MaskGen models to promote broader access and democratize the field of text-to-image masked generative models.
△ Less
Submitted 13 January, 2025;
originally announced January 2025.
-
MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting
Authors:
Sangwoon Kwak,
Joonsoo Kim,
Jun Young Jeong,
Won-Sik Cheong,
Jihyong Oh,
Munchurl Kim
Abstract:
3D Gaussian Splatting (3DGS) has made significant strides in scene representation and neural rendering, with intense efforts focused on adapting it for dynamic scenes. Despite delivering remarkable rendering quality and speed, existing methods struggle with storage demands and representing complex real-world motions. To tackle these issues, we propose MoDecGS, a memory-efficient Gaussian splatting…
▽ More
3D Gaussian Splatting (3DGS) has made significant strides in scene representation and neural rendering, with intense efforts focused on adapting it for dynamic scenes. Despite delivering remarkable rendering quality and speed, existing methods struggle with storage demands and representing complex real-world motions. To tackle these issues, we propose MoDecGS, a memory-efficient Gaussian splatting framework designed for reconstructing novel views in challenging scenarios with complex motions. We introduce GlobaltoLocal Motion Decomposition (GLMD) to effectively capture dynamic motions in a coarsetofine manner. This approach leverages Global Canonical Scaffolds (Global CS) and Local Canonical Scaffolds (Local CS), extending static Scaffold representation to dynamic video reconstruction. For Global CS, we propose Global Anchor Deformation (GAD) to efficiently represent global dynamics along complex motions, by directly deforming the implicit Scaffold attributes which are anchor position, offset, and local context features. Next, we finely adjust local motions via the Local Gaussian Deformation (LGD) of Local CS explicitly. Additionally, we introduce Temporal Interval Adjustment (TIA) to automatically control the temporal coverage of each Local CS during training, allowing MoDecGS to find optimal interval assignments based on the specified number of temporal segments. Extensive evaluations demonstrate that MoDecGS achieves an average 70% reduction in model size over stateoftheart methods for dynamic 3D Gaussians from realworld dynamic videos while maintaining or even improving rendering quality.
△ Less
Submitted 24 March, 2025; v1 submitted 7 January, 2025;
originally announced January 2025.
-
Improving Text-based Person Search via Part-level Cross-modal Correspondence
Authors:
Jicheol Park,
Boseung Jeong,
Dongwon Kim,
Suha Kwak
Abstract:
Text-based person search is the task of finding person images that are the most relevant to the natural language text description given as query. The main challenge of this task is a large gap between the target images and text queries, which makes it difficult to establish correspondence and distinguish subtle differences across people. To address this challenge, we introduce an efficient encoder…
▽ More
Text-based person search is the task of finding person images that are the most relevant to the natural language text description given as query. The main challenge of this task is a large gap between the target images and text queries, which makes it difficult to establish correspondence and distinguish subtle differences across people. To address this challenge, we introduce an efficient encoder-decoder model that extracts coarse-to-fine embedding vectors which are semantically aligned across the two modalities without supervision for the alignment. There is another challenge of learning to capture fine-grained information with only person IDs as supervision, where similar body parts of different individuals are considered different due to the lack of part-level supervision. To tackle this, we propose a novel ranking loss, dubbed commonality-based margin ranking loss, which quantifies the degree of commonality of each body part and reflects it during the learning of fine-grained body part details. As a consequence, it enables our method to achieve the best records on three public benchmarks.
△ Less
Submitted 31 December, 2024;
originally announced January 2025.
-
The Impact of AI Assistance on Radiology Reporting: A Pilot Study Using Simulated AI Draft Reports
Authors:
Julián N. Acosta,
Siddhant Dogra,
Subathra Adithan,
Kay Wu,
Michael Moritz,
Stephen Kwak,
Pranav Rajpurkar
Abstract:
Radiologists face increasing workload pressures amid growing imaging volumes, creating risks of burnout and delayed reporting times. While artificial intelligence (AI) based automated radiology report generation shows promise for reporting workflow optimization, evidence of its real-world impact on clinical accuracy and efficiency remains limited. This study evaluated the effect of draft reports o…
▽ More
Radiologists face increasing workload pressures amid growing imaging volumes, creating risks of burnout and delayed reporting times. While artificial intelligence (AI) based automated radiology report generation shows promise for reporting workflow optimization, evidence of its real-world impact on clinical accuracy and efficiency remains limited. This study evaluated the effect of draft reports on radiology reporting workflows by conducting a three reader multi-case study comparing standard versus AI-assisted reporting workflows. In both workflows, radiologists reviewed the cases and modified either a standard template (standard workflow) or an AI-generated draft report (AI-assisted workflow) to create the final report. For controlled evaluation, we used GPT-4 to generate simulated AI drafts and deliberately introduced 1-3 errors in half the cases to mimic real AI system performance. The AI-assisted workflow significantly reduced average reporting time from 573 to 435 seconds (p=0.003), without a statistically significant difference in clinically significant errors between workflows. These findings suggest that AI-generated drafts can meaningfully accelerate radiology reporting while maintaining diagnostic accuracy, offering a practical solution to address mounting workload challenges in clinical practice.
△ Less
Submitted 16 December, 2024;
originally announced December 2024.
-
ActFusion: a Unified Diffusion Model for Action Segmentation and Anticipation
Authors:
Dayoung Gong,
Suha Kwak,
Minsu Cho
Abstract:
Temporal action segmentation and long-term action anticipation are two popular vision tasks for the temporal analysis of actions in videos. Despite apparent relevance and potential complementarity, these two problems have been investigated as separate and distinct tasks. In this work, we tackle these two problems, action segmentation and action anticipation, jointly using a unified diffusion model…
▽ More
Temporal action segmentation and long-term action anticipation are two popular vision tasks for the temporal analysis of actions in videos. Despite apparent relevance and potential complementarity, these two problems have been investigated as separate and distinct tasks. In this work, we tackle these two problems, action segmentation and action anticipation, jointly using a unified diffusion model dubbed ActFusion. The key idea to unification is to train the model to effectively handle both visible and invisible parts of the sequence in an integrated manner; the visible part is for temporal segmentation, and the invisible part is for future anticipation. To this end, we introduce a new anticipative masking strategy during training in which a late part of the video frames is masked as invisible, and learnable tokens replace these frames to learn to predict the invisible future. Experimental results demonstrate the bi-directional benefits between action segmentation and anticipation. ActFusion achieves the state-of-the-art performance across the standard benchmarks of 50 Salads, Breakfast, and GTEA, outperforming task-specific models in both of the two tasks with a single unified model through joint learning.
△ Less
Submitted 5 December, 2024;
originally announced December 2024.
-
Controllable Human Image Generation with Personalized Multi-Garments
Authors:
Yisol Choi,
Sangkyung Kwak,
Sihyun Yu,
Hyungwon Choi,
Jinwoo Shin
Abstract:
We present BootComp, a novel framework based on text-to-image diffusion models for controllable human image generation with multiple reference garments. Here, the main bottleneck is data acquisition for training: collecting a large-scale dataset of high-quality reference garment images per human subject is quite challenging, i.e., ideally, one needs to manually gather every single garment photogra…
▽ More
We present BootComp, a novel framework based on text-to-image diffusion models for controllable human image generation with multiple reference garments. Here, the main bottleneck is data acquisition for training: collecting a large-scale dataset of high-quality reference garment images per human subject is quite challenging, i.e., ideally, one needs to manually gather every single garment photograph worn by each human. To address this, we propose a data generation pipeline to construct a large synthetic dataset, consisting of human and multiple-garment pairs, by introducing a model to extract any reference garment images from each human image. To ensure data quality, we also propose a filtering strategy to remove undesirable generated data based on measuring perceptual similarities between the garment presented in human image and extracted garment. Finally, by utilizing the constructed synthetic dataset, we train a diffusion model having two parallel denoising paths that use multiple garment images as conditions to generate human images while preserving their fine-grained details. We further show the wide-applicability of our framework by adapting it to different types of reference-based generation in the fashion domain, including virtual try-on, and controllable human image generation with other conditions, e.g., pose, face, etc.
△ Less
Submitted 1 April, 2025; v1 submitted 25 November, 2024;
originally announced November 2024.
-
Minimax Optimal Two-Sample Testing under Local Differential Privacy
Authors:
Jongmin Mun,
Seungwoo Kwak,
Ilmun Kim
Abstract:
We explore the trade-off between privacy and statistical utility in private two-sample testing under local differential privacy (LDP) for both multinomial and continuous data. We begin by addressing the multinomial case, where we introduce private permutation tests using practical privacy mechanisms such as Laplace, discrete Laplace, and Google's RAPPOR. We then extend our multinomial approach to…
▽ More
We explore the trade-off between privacy and statistical utility in private two-sample testing under local differential privacy (LDP) for both multinomial and continuous data. We begin by addressing the multinomial case, where we introduce private permutation tests using practical privacy mechanisms such as Laplace, discrete Laplace, and Google's RAPPOR. We then extend our multinomial approach to continuous data via binning and study its uniform separation rates under LDP over Hölder and Besov smoothness classes. The proposed tests for both discrete and continuous cases rigorously control the type I error for any finite sample size, strictly adhere to LDP constraints, and achieve minimax separation rates under LDP. The attained minimax rates reveal inherent privacy-utility trade-offs that are unavoidable in private testing. To address scenarios with unknown smoothness parameters in density testing, we propose an adaptive test based on a Bonferroni-type approach that ensures robust performance without prior knowledge of the smoothness parameters. We validate our theoretical findings with extensive numerical experiments and demonstrate the practical relevance and effectiveness of our proposed methods.
△ Less
Submitted 22 November, 2024; v1 submitted 13 November, 2024;
originally announced November 2024.
-
Bootstrapping Top-down Information for Self-modulating Slot Attention
Authors:
Dongwon Kim,
Seoyeon Kim,
Suha Kwak
Abstract:
Object-centric learning (OCL) aims to learn representations of individual objects within visual scenes without manual supervision, facilitating efficient and effective visual reasoning. Traditional OCL methods primarily employ bottom-up approaches that aggregate homogeneous visual features to represent objects. However, in complex visual environments, these methods often fall short due to the hete…
▽ More
Object-centric learning (OCL) aims to learn representations of individual objects within visual scenes without manual supervision, facilitating efficient and effective visual reasoning. Traditional OCL methods primarily employ bottom-up approaches that aggregate homogeneous visual features to represent objects. However, in complex visual environments, these methods often fall short due to the heterogeneous nature of visual features within an object. To address this, we propose a novel OCL framework incorporating a top-down pathway. This pathway first bootstraps the semantics of individual objects and then modulates the model to prioritize features relevant to these semantics. By dynamically modulating the model based on its own output, our top-down pathway enhances the representational quality of objects. Our framework achieves state-of-the-art performance across multiple synthetic and real-world object-discovery benchmarks.
△ Less
Submitted 7 November, 2024; v1 submitted 4 November, 2024;
originally announced November 2024.
-
Surfaces proper homotopy equivalent to graphs and their Dehn-Nielsen-Baer maps
Authors:
Ryan Dickmann,
Hannah Hoganson,
Sanghoon Kwak
Abstract:
Motivated by the recent work of Algom-Kfir and Bestinva introducing the mapping class group of an infinite graph via proper homotopy equivalences, we give a necessary and sufficient condition for a surface to be properly homotopy equivalent to a graph. We consider second-countable orientable surfaces that are possibly infinite-type and have noncompact boundary. For surfaces proper homotopy equival…
▽ More
Motivated by the recent work of Algom-Kfir and Bestinva introducing the mapping class group of an infinite graph via proper homotopy equivalences, we give a necessary and sufficient condition for a surface to be properly homotopy equivalent to a graph. We consider second-countable orientable surfaces that are possibly infinite-type and have noncompact boundary. For surfaces proper homotopy equivalent to graphs, we explore the basic properties of the induced map between the mapping class groups of the surface and the graph. We view this induced map as the basis of a Dehn-Nielsen-Baer analog in the setting of infinite-type surfaces.
△ Less
Submitted 28 October, 2024;
originally announced October 2024.
-
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Authors:
Sihyun Yu,
Sangkyung Kwak,
Huiwon Jang,
Jongheon Jeong,
Jonathan Huang,
Jinwoo Shin,
Saining Xie
Abstract:
Recent studies have shown that the denoising process in (generative) diffusion models can induce meaningful (discriminative) representations inside the model, though the quality of these representations still lags behind those learned through recent self-supervised learning methods. We argue that one main bottleneck in training large-scale diffusion models for generation lies in effectively learni…
▽ More
Recent studies have shown that the denoising process in (generative) diffusion models can induce meaningful (discriminative) representations inside the model, though the quality of these representations still lags behind those learned through recent self-supervised learning methods. We argue that one main bottleneck in training large-scale diffusion models for generation lies in effectively learning these representations. Moreover, training can be made easier by incorporating high-quality external visual representations, rather than relying solely on the diffusion models to learn them independently. We study this by introducing a straightforward regularization called REPresentation Alignment (REPA), which aligns the projections of noisy input hidden states in denoising networks with clean image representations obtained from external, pretrained visual encoders. The results are striking: our simple strategy yields significant improvements in both training efficiency and generation quality when applied to popular diffusion and flow-based transformers, such as DiTs and SiTs. For instance, our method can speed up SiT training by over 17.5$\times$, matching the performance (without classifier-free guidance) of a SiT-XL model trained for 7M steps in less than 400K steps. In terms of final generation quality, our approach achieves state-of-the-art results of FID=1.42 using classifier-free guidance with the guidance interval.
△ Less
Submitted 28 February, 2025; v1 submitted 9 October, 2024;
originally announced October 2024.
-
Nonunique Ergodicity on the Boundary of Outer space
Authors:
Mladen Bestvina,
Elizabeth Field,
Sanghoon Kwak
Abstract:
To an $\mathbb{R}$-tree in the boundary of Outer space, we associate two simplices: the simplex of projective length measures, and the simplex of projective dual currents. For both kinds of simplices, we estimate the dimension of maximal simplices for arational $\mathbb{R}$-trees in the boundary of Outer space.
To an $\mathbb{R}$-tree in the boundary of Outer space, we associate two simplices: the simplex of projective length measures, and the simplex of projective dual currents. For both kinds of simplices, we estimate the dimension of maximal simplices for arational $\mathbb{R}$-trees in the boundary of Outer space.
△ Less
Submitted 23 September, 2024;
originally announced September 2024.
-
PLOT: Text-based Person Search with Part Slot Attention for Corresponding Part Discovery
Authors:
Jicheol Park,
Dongwon Kim,
Boseung Jeong,
Suha Kwak
Abstract:
Text-based person search, employing free-form text queries to identify individuals within a vast image collection, presents a unique challenge in aligning visual and textual representations, particularly at the human part level. Existing methods often struggle with part feature extraction and alignment due to the lack of direct part-level supervision and reliance on heuristic features. We propose…
▽ More
Text-based person search, employing free-form text queries to identify individuals within a vast image collection, presents a unique challenge in aligning visual and textual representations, particularly at the human part level. Existing methods often struggle with part feature extraction and alignment due to the lack of direct part-level supervision and reliance on heuristic features. We propose a novel framework that leverages a part discovery module based on slot attention to autonomously identify and align distinctive parts across modalities, enhancing interpretability and retrieval accuracy without explicit part-level correspondence supervision. Additionally, text-based dynamic part attention adjusts the importance of each part, further improving retrieval outcomes. Our method is evaluated on three public benchmarks, significantly outperforming existing methods.
△ Less
Submitted 20 September, 2024;
originally announced September 2024.
-
Improving Robustness to Multiple Spurious Correlations by Multi-Objective Optimization
Authors:
Nayeong Kim,
Juwon Kang,
Sungsoo Ahn,
Jungseul Ok,
Suha Kwak
Abstract:
We study the problem of training an unbiased and accurate model given a dataset with multiple biases. This problem is challenging since the multiple biases cause multiple undesirable shortcuts during training, and even worse, mitigating one may exacerbate the other. We propose a novel training method to tackle this challenge. Our method first groups training data so that different groups induce di…
▽ More
We study the problem of training an unbiased and accurate model given a dataset with multiple biases. This problem is challenging since the multiple biases cause multiple undesirable shortcuts during training, and even worse, mitigating one may exacerbate the other. We propose a novel training method to tackle this challenge. Our method first groups training data so that different groups induce different shortcuts, and then optimizes a linear combination of group-wise losses while adjusting their weights dynamically to alleviate conflicts between the groups in performance; this approach, rooted in the multi-objective optimization theory, encourages to achieve the minimax Pareto solution. We also present a new benchmark with multiple biases, dubbed MultiCelebA, for evaluating debiased training methods under realistic and challenging scenarios. Our method achieved the best on three datasets with multiple biases, and also showed superior performance on conventional single-bias datasets.
△ Less
Submitted 5 September, 2024;
originally announced September 2024.
-
Efficient and Versatile Robust Fine-Tuning of Zero-shot Models
Authors:
Sungyeon Kim,
Boseung Jeong,
Donghyun Kim,
Suha Kwak
Abstract:
Large-scale image-text pre-trained models enable zero-shot classification and provide consistent accuracy across various data distributions. Nonetheless, optimizing these models in downstream tasks typically requires fine-tuning, which reduces generalization to out-of-distribution (OOD) data and demands extensive computational resources. We introduce Robust Adapter (R-Adapter), a novel method for…
▽ More
Large-scale image-text pre-trained models enable zero-shot classification and provide consistent accuracy across various data distributions. Nonetheless, optimizing these models in downstream tasks typically requires fine-tuning, which reduces generalization to out-of-distribution (OOD) data and demands extensive computational resources. We introduce Robust Adapter (R-Adapter), a novel method for fine-tuning zero-shot models to downstream tasks while simultaneously addressing both these issues. Our method integrates lightweight modules into the pre-trained model and employs novel self-ensemble techniques to boost OOD robustness and reduce storage expenses substantially. Furthermore, we propose MPM-NCE loss designed for fine-tuning on vision-language downstream tasks. It ensures precise alignment of multiple image-text pairs and discriminative feature learning. By extending the benchmark for robust fine-tuning beyond classification to include diverse tasks such as cross-modal retrieval and open vocabulary segmentation, we demonstrate the broad applicability of R-Adapter. Our extensive experiments demonstrate that R-Adapter achieves state-of-the-art performance across a diverse set of tasks, tuning only 13% of the parameters of the CLIP encoders.
△ Less
Submitted 11 August, 2024;
originally announced August 2024.
-
Online Temporal Action Localization with Memory-Augmented Transformer
Authors:
Youngkil Song,
Dongkeun Kim,
Minsu Cho,
Suha Kwak
Abstract:
Online temporal action localization (On-TAL) is the task of identifying multiple action instances given a streaming video. Since existing methods take as input only a video segment of fixed size per iteration, they are limited in considering long-term context and require tuning the segment size carefully. To overcome these limitations, we propose memory-augmented transformer (MATR). MATR utilizes…
▽ More
Online temporal action localization (On-TAL) is the task of identifying multiple action instances given a streaming video. Since existing methods take as input only a video segment of fixed size per iteration, they are limited in considering long-term context and require tuning the segment size carefully. To overcome these limitations, we propose memory-augmented transformer (MATR). MATR utilizes the memory queue that selectively preserves the past segment features, allowing to leverage long-term context for inference. We also propose a novel action localization method that observes the current input segment to predict the end time of the ongoing action and accesses the memory queue to estimate the start time of the action. Our method outperformed existing methods on two datasets, THUMOS14 and MUSES, surpassing not only TAL methods in the online setting but also some offline TAL methods.
△ Less
Submitted 6 August, 2024;
originally announced August 2024.
-
Classification Matters: Improving Video Action Detection with Class-Specific Attention
Authors:
Jinsung Lee,
Taeoh Kim,
Inwoong Lee,
Minho Shim,
Dongyoon Wee,
Minsu Cho,
Suha Kwak
Abstract:
Video action detection (VAD) aims to detect actors and classify their actions in a video. We figure that VAD suffers more from classification rather than localization of actors. Hence, we analyze how prevailing methods form features for classification and find that they prioritize actor regions, yet often overlooking the essential contextual information necessary for accurate classification. Accor…
▽ More
Video action detection (VAD) aims to detect actors and classify their actions in a video. We figure that VAD suffers more from classification rather than localization of actors. Hence, we analyze how prevailing methods form features for classification and find that they prioritize actor regions, yet often overlooking the essential contextual information necessary for accurate classification. Accordingly, we propose to reduce the bias toward actor and encourage paying attention to the context that is relevant to each action class. By assigning a class-dedicated query to each action class, our model can dynamically determine where to focus for effective classification. The proposed model demonstrates superior performance on three challenging benchmarks with significantly fewer parameters and less computation.
△ Less
Submitted 11 September, 2024; v1 submitted 29 July, 2024;
originally announced July 2024.
-
FREST: Feature RESToration for Semantic Segmentation under Multiple Adverse Conditions
Authors:
Sohyun Lee,
Namyup Kim,
Sungyeon Kim,
Suha Kwak
Abstract:
Robust semantic segmentation under adverse conditions is crucial in real-world applications. To address this challenging task in practical scenarios where labeled normal condition images are not accessible in training, we propose FREST, a novel feature restoration framework for source-free domain adaptation (SFDA) of semantic segmentation to adverse conditions. FREST alternates two steps: (1) lear…
▽ More
Robust semantic segmentation under adverse conditions is crucial in real-world applications. To address this challenging task in practical scenarios where labeled normal condition images are not accessible in training, we propose FREST, a novel feature restoration framework for source-free domain adaptation (SFDA) of semantic segmentation to adverse conditions. FREST alternates two steps: (1) learning the condition embedding space that only separates the condition information from the features and (2) restoring features of adverse condition images on the learned condition embedding space. By alternating these two steps, FREST gradually restores features where the effect of adverse conditions is reduced. FREST achieved a state of the art on two public benchmarks (i.e., ACDC and RobotCar) for SFDA to adverse conditions. Moreover, it shows superior generalization ability on unseen datasets.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
Direct Preference Optimization for Suppressing Hallucinated Prior Exams in Radiology Report Generation
Authors:
Oishi Banerjee,
Hong-Yu Zhou,
Subathra Adithan,
Stephen Kwak,
Kay Wu,
Pranav Rajpurkar
Abstract:
Recent advances in generative vision-language models (VLMs) have exciting potential implications for AI in radiology, yet VLMs are also known to produce hallucinations, nonsensical text, and other unwanted behaviors that can waste clinicians' time and cause patient harm. Drawing on recent work on direct preference optimization (DPO), we propose a simple method for modifying the behavior of pretrai…
▽ More
Recent advances in generative vision-language models (VLMs) have exciting potential implications for AI in radiology, yet VLMs are also known to produce hallucinations, nonsensical text, and other unwanted behaviors that can waste clinicians' time and cause patient harm. Drawing on recent work on direct preference optimization (DPO), we propose a simple method for modifying the behavior of pretrained VLMs performing radiology report generation by suppressing unwanted types of generations. We apply our method to the prevention of hallucinations of prior exams, addressing a long-established problem behavior in models performing chest X-ray report generation. Across our experiments, we find that DPO fine-tuning achieves a 3.2-4.8x reduction in lines hallucinating prior exams while maintaining model performance on clinical accuracy metrics. Our work is, to the best of our knowledge, the first work to apply DPO to medical VLMs, providing a data- and compute- efficient way to suppress problem behaviors while maintaining overall clinical accuracy.
△ Less
Submitted 14 June, 2024; v1 submitted 10 June, 2024;
originally announced June 2024.
-
Extreme Point Supervised Instance Segmentation
Authors:
Hyeonjun Lee,
Sehyun Hwang,
Suha Kwak
Abstract:
This paper introduces a novel approach to learning instance segmentation using extreme points, i.e., the topmost, leftmost, bottommost, and rightmost points, of each object. These points are readily available in the modern bounding box annotation process while offering strong clues for precise segmentation, and thus allows to improve performance at the same annotation cost with box-supervised meth…
▽ More
This paper introduces a novel approach to learning instance segmentation using extreme points, i.e., the topmost, leftmost, bottommost, and rightmost points, of each object. These points are readily available in the modern bounding box annotation process while offering strong clues for precise segmentation, and thus allows to improve performance at the same annotation cost with box-supervised methods. Our work considers extreme points as a part of the true instance mask and propagates them to identify potential foreground and background points, which are all together used for training a pseudo label generator. Then pseudo labels given by the generator are in turn used for supervised learning of our final model. On three public benchmarks, our method significantly outperforms existing box-supervised methods, further narrowing the gap with its fully supervised counterpart. In particular, our model generates high-quality masks when a target object is separated into multiple parts, where previous box-supervised methods often fail.
△ Less
Submitted 3 June, 2024; v1 submitted 31 May, 2024;
originally announced May 2024.
-
Distilling Diffusion Models into Conditional GANs
Authors:
Minguk Kang,
Richard Zhang,
Connelly Barnes,
Sylvain Paris,
Suha Kwak,
Jaesik Park,
Eli Shechtman,
Jun-Yan Zhu,
Taesung Park
Abstract:
We propose a method to distill a complex multistep diffusion model into a single-step conditional GAN student model, dramatically accelerating inference, while preserving image quality. Our approach interprets diffusion distillation as a paired image-to-image translation task, using noise-to-image pairs of the diffusion model's ODE trajectory. For efficient regression loss computation, we propose…
▽ More
We propose a method to distill a complex multistep diffusion model into a single-step conditional GAN student model, dramatically accelerating inference, while preserving image quality. Our approach interprets diffusion distillation as a paired image-to-image translation task, using noise-to-image pairs of the diffusion model's ODE trajectory. For efficient regression loss computation, we propose E-LatentLPIPS, a perceptual loss operating directly in diffusion model's latent space, utilizing an ensemble of augmentations. Furthermore, we adapt a diffusion model to construct a multi-scale discriminator with a text alignment loss to build an effective conditional GAN-based formulation. E-LatentLPIPS converges more efficiently than many existing distillation methods, even accounting for dataset construction costs. We demonstrate that our one-step generator outperforms cutting-edge one-step diffusion distillation models -- DMD, SDXL-Turbo, and SDXL-Lightning -- on the zero-shot COCO benchmark.
△ Less
Submitted 17 July, 2024; v1 submitted 9 May, 2024;
originally announced May 2024.
-
Active Label Correction for Semantic Segmentation with Foundation Models
Authors:
Hoyoung Kim,
Sehyun Hwang,
Suha Kwak,
Jungseul Ok
Abstract:
Training and validating models for semantic segmentation require datasets with pixel-wise annotations, which are notoriously labor-intensive. Although useful priors such as foundation models or crowdsourced datasets are available, they are error-prone. We hence propose an effective framework of active label correction (ALC) based on a design of correction query to rectify pseudo labels of pixels,…
▽ More
Training and validating models for semantic segmentation require datasets with pixel-wise annotations, which are notoriously labor-intensive. Although useful priors such as foundation models or crowdsourced datasets are available, they are error-prone. We hence propose an effective framework of active label correction (ALC) based on a design of correction query to rectify pseudo labels of pixels, which in turn is more annotator-friendly than the standard one inquiring to classify a pixel directly according to our theoretical analysis and user study. Specifically, leveraging foundation models providing useful zero-shot predictions on pseudo labels and superpixels, our method comprises two key techniques: (i) an annotator-friendly design of correction query with the pseudo labels, and (ii) an acquisition function looking ahead label expansions based on the superpixels. Experimental results on PASCAL, Cityscapes, and Kvasir-SEG datasets demonstrate the effectiveness of our ALC framework, outperforming prior methods for active semantic segmentation and label correction. Notably, utilizing our method, we obtained a revised dataset of PASCAL by rectifying errors in 2.6 million pixels in PASCAL dataset.
△ Less
Submitted 4 June, 2024; v1 submitted 16 March, 2024;
originally announced March 2024.
-
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Authors:
Gemini Team,
Petko Georgiev,
Ving Ian Lei,
Ryan Burnell,
Libin Bai,
Anmol Gulati,
Garrett Tanzer,
Damien Vincent,
Zhufeng Pan,
Shibo Wang,
Soroosh Mariooryad,
Yifan Ding,
Xinyang Geng,
Fred Alcober,
Roy Frostig,
Mark Omernick,
Lexi Walker,
Cosmin Paduraru,
Christina Sorokin,
Andrea Tacchetti,
Colin Gaffney,
Samira Daruki,
Olcan Sercinoglu,
Zach Gleicher,
Juliette Love
, et al. (1112 additional authors not shown)
Abstract:
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February…
▽ More
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.
△ Less
Submitted 16 December, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
Improving Diffusion Models for Authentic Virtual Try-on in the Wild
Authors:
Yisol Choi,
Sangkyung Kwak,
Kyungmin Lee,
Hyungwon Choi,
Jinwoo Shin
Abstract:
This paper considers image-based virtual try-on, which renders an image of a person wearing a curated garment, given a pair of images depicting the person and the garment, respectively. Previous works adapt existing exemplar-based inpainting diffusion models for virtual try-on to improve the naturalness of the generated visuals compared to other methods (e.g., GAN-based), but they fail to preserve…
▽ More
This paper considers image-based virtual try-on, which renders an image of a person wearing a curated garment, given a pair of images depicting the person and the garment, respectively. Previous works adapt existing exemplar-based inpainting diffusion models for virtual try-on to improve the naturalness of the generated visuals compared to other methods (e.g., GAN-based), but they fail to preserve the identity of the garments. To overcome this limitation, we propose a novel diffusion model that improves garment fidelity and generates authentic virtual try-on images. Our method, coined IDM-VTON, uses two different modules to encode the semantics of garment image; given the base UNet of the diffusion model, 1) the high-level semantics extracted from a visual encoder are fused to the cross-attention layer, and then 2) the low-level features extracted from parallel UNet are fused to the self-attention layer. In addition, we provide detailed textual prompts for both garment and person images to enhance the authenticity of the generated visuals. Finally, we present a customization method using a pair of person-garment images, which significantly improves fidelity and authenticity. Our experimental results show that our method outperforms previous approaches (both diffusion-based and GAN-based) in preserving garment details and generating authentic virtual try-on images, both qualitatively and quantitatively. Furthermore, the proposed customization method demonstrates its effectiveness in a real-world scenario. More visualizations are available in our project page: https://idm-vton.github.io
△ Less
Submitted 29 July, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
Direct Consistency Optimization for Robust Customization of Text-to-Image Diffusion Models
Authors:
Kyungmin Lee,
Sangkyung Kwak,
Kihyuk Sohn,
Jinwoo Shin
Abstract:
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, can generate visuals with a high degree of consistency. However, such fine-tuned models are not robust; they often fail to compose with concepts of pretrained model or other fine-tuned models. To address this, we propose a novel fine-tuning objective, dubbed Direct Consistency Optimization, which controls the deviation…
▽ More
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, can generate visuals with a high degree of consistency. However, such fine-tuned models are not robust; they often fail to compose with concepts of pretrained model or other fine-tuned models. To address this, we propose a novel fine-tuning objective, dubbed Direct Consistency Optimization, which controls the deviation between fine-tuning and pretrained models to retain the pretrained knowledge during fine-tuning. Through extensive experiments on subject and style customization, we demonstrate that our method positions itself on a superior Pareto frontier between subject (or style) consistency and image-text alignment over all previous baselines; it not only outperforms regular fine-tuning objective in image-text alignment, but also shows higher fidelity to the reference images than the method that fine-tunes with additional prior dataset. More importantly, the models fine-tuned with our method can be merged without interference, allowing us to generate custom subjects in a custom style by composing separately customized subject and style models. Notably, we show that our approach achieves better prompt fidelity and subject fidelity than those post-optimized for merging regular fine-tuned models.
△ Less
Submitted 12 December, 2024; v1 submitted 19 February, 2024;
originally announced February 2024.
-
A Korean Legal Judgment Prediction Dataset for Insurance Disputes
Authors:
Alice Saebom Kwak,
Cheonkam Jeong,
Ji Weon Lim,
Byeongcheol Min
Abstract:
This paper introduces a Korean legal judgment prediction (LJP) dataset for insurance disputes. Successful LJP models on insurance disputes can benefit insurance companies and their customers. It can save both sides' time and money by allowing them to predict how the result would come out if they proceed to the dispute mediation process. As is often the case with low-resource languages, there is a…
▽ More
This paper introduces a Korean legal judgment prediction (LJP) dataset for insurance disputes. Successful LJP models on insurance disputes can benefit insurance companies and their customers. It can save both sides' time and money by allowing them to predict how the result would come out if they proceed to the dispute mediation process. As is often the case with low-resource languages, there is a limitation on the amount of data available for this specific task. To mitigate this issue, we investigate how one can achieve a good performance despite the limitation in data. In our experiment, we demonstrate that Sentence Transformer Fine-tuning (SetFit, Tunstall et al., 2022) is a good alternative to standard fine-tuning when training data are limited. The models fine-tuned with the SetFit approach on our data show similar performance to the Korean LJP benchmark models (Hwang et al., 2022) despite the much smaller data size.
△ Less
Submitted 26 January, 2024;
originally announced January 2024.
-
Gemini: A Family of Highly Capable Multimodal Models
Authors:
Gemini Team,
Rohan Anil,
Sebastian Borgeaud,
Jean-Baptiste Alayrac,
Jiahui Yu,
Radu Soricut,
Johan Schalkwyk,
Andrew M. Dai,
Anja Hauth,
Katie Millican,
David Silver,
Melvin Johnson,
Ioannis Antonoglou,
Julian Schrittwieser,
Amelia Glaese,
Jilin Chen,
Emily Pitler,
Timothy Lillicrap,
Angeliki Lazaridou,
Orhan Firat,
James Molloy,
Michael Isard,
Paul R. Barham,
Tom Hennigan,
Benjamin Lee
, et al. (1326 additional authors not shown)
Abstract:
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr…
▽ More
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.
△ Less
Submitted 9 May, 2025; v1 submitted 18 December, 2023;
originally announced December 2023.
-
Frequency analysis and filter design for directed graphs with polar decomposition
Authors:
Semin Kwak,
Laura Shimabukuro,
Antonio Ortega
Abstract:
In this study, we challenge the traditional approach of frequency analysis on directed graphs, which typically relies on a single measure of signal variation such as total variation. We argue that the inherent directionality in directed graphs necessitates a multifaceted analytical approach that incorporates multiple signal variations definitions. Our methodology leverages the polar decomposition…
▽ More
In this study, we challenge the traditional approach of frequency analysis on directed graphs, which typically relies on a single measure of signal variation such as total variation. We argue that the inherent directionality in directed graphs necessitates a multifaceted analytical approach that incorporates multiple signal variations definitions. Our methodology leverages the polar decomposition to define two distinct variations, each associated with different matrices derived from this decomposition. This approach provides a novel interpretation in the node domain and reveals aspects of graph signals that may be overlooked with a singular measure of variation. Additionally, we develop graph filters specifically designed to smooth graph signals in accordance with our proposed variations. These filters allow for bypassing costly filtering operations associated with the original graph through effective cascading. We demonstrate the efficacy of our methodology using an M-block cyclic graph example, validating our claims and showcasing the advantages of our multifaceted approach in analyzing signals on directed graphs.
△ Less
Submitted 15 January, 2024; v1 submitted 18 December, 2023;
originally announced December 2023.
-
Activity Grammars for Temporal Action Segmentation
Authors:
Dayoung Gong,
Joonseok Lee,
Deunsol Jung,
Suha Kwak,
Minsu Cho
Abstract:
Sequence prediction on temporal data requires the ability to understand compositional structures of multi-level semantics beyond individual and contextual properties. The task of temporal action segmentation, which aims at translating an untrimmed activity video into a sequence of action segments, remains challenging for this reason. This paper addresses the problem by introducing an effective act…
▽ More
Sequence prediction on temporal data requires the ability to understand compositional structures of multi-level semantics beyond individual and contextual properties. The task of temporal action segmentation, which aims at translating an untrimmed activity video into a sequence of action segments, remains challenging for this reason. This paper addresses the problem by introducing an effective activity grammar to guide neural predictions for temporal action segmentation. We propose a novel grammar induction algorithm that extracts a powerful context-free grammar from action sequence data. We also develop an efficient generalized parser that transforms frame-level probability distributions into a reliable sequence of actions according to the induced grammar with recursive rules. Our approach can be combined with any neural network for temporal action segmentation to enhance the sequence prediction and discover its compositional structure. Experimental results demonstrate that our method significantly improves temporal action segmentation in terms of both performance and interpretability on two standard benchmarks, Breakfast and 50 Salads.
△ Less
Submitted 7 December, 2023;
originally announced December 2023.
-
Towards More Practical Group Activity Detection: A New Benchmark and Model
Authors:
Dongkeun Kim,
Youngkil Song,
Minsu Cho,
Suha Kwak
Abstract:
Group activity detection (GAD) is the task of identifying members of each group and classifying the activity of the group at the same time in a video. While GAD has been studied recently, there is still much room for improvement in both dataset and methodology due to their limited capability to address practical GAD scenarios. To resolve these issues, we first present a new dataset, dubbed Café. U…
▽ More
Group activity detection (GAD) is the task of identifying members of each group and classifying the activity of the group at the same time in a video. While GAD has been studied recently, there is still much room for improvement in both dataset and methodology due to their limited capability to address practical GAD scenarios. To resolve these issues, we first present a new dataset, dubbed Café. Unlike existing datasets, Café is constructed primarily for GAD and presents more practical scenarios and metrics, as well as being large-scale and providing rich annotations. Along with the dataset, we propose a new GAD model that deals with an unknown number of groups and latent group members efficiently and effectively. We evaluated our model on three datasets including Café, where it outperformed previous work in terms of both accuracy and inference speed.
△ Less
Submitted 25 July, 2024; v1 submitted 5 December, 2023;
originally announced December 2023.
-
Coarsely bounded generating sets for mapping class groups of infinite-type surfaces
Authors:
Thomas Hill,
Sanghoon Kwak,
Rebecca Rechkin
Abstract:
Mann and Rafi's seminal work initiated the study of the coarse geometry of big mapping class groups. Specifically, they construct coarsely bounded (CB) generating sets for mapping class groups of a large class of infinite-type surfaces. In this expository note, we illustrate examples of surfaces whose mapping class groups admit such generating sets, as well as those that do not, with the goal of e…
▽ More
Mann and Rafi's seminal work initiated the study of the coarse geometry of big mapping class groups. Specifically, they construct coarsely bounded (CB) generating sets for mapping class groups of a large class of infinite-type surfaces. In this expository note, we illustrate examples of surfaces whose mapping class groups admit such generating sets, as well as those that do not, with the goal of exploring the context of Mann--Rafi's hypotheses.
△ Less
Submitted 22 May, 2025; v1 submitted 4 December, 2023;
originally announced December 2023.
-
Style-Aware Radiology Report Generation with RadGraph and Few-Shot Prompting
Authors:
Benjamin Yan,
Ruochen Liu,
David E. Kuo,
Subathra Adithan,
Eduardo Pontes Reis,
Stephen Kwak,
Vasantha Kumar Venugopal,
Chloe P. O'Connell,
Agustina Saenz,
Pranav Rajpurkar,
Michael Moor
Abstract:
Automatically generated reports from medical images promise to improve the workflow of radiologists. Existing methods consider an image-to-report modeling task by directly generating a fully-fledged report from an image. However, this conflates the content of the report (e.g., findings and their attributes) with its style (e.g., format and choice of words), which can lead to clinically inaccurate…
▽ More
Automatically generated reports from medical images promise to improve the workflow of radiologists. Existing methods consider an image-to-report modeling task by directly generating a fully-fledged report from an image. However, this conflates the content of the report (e.g., findings and their attributes) with its style (e.g., format and choice of words), which can lead to clinically inaccurate reports. To address this, we propose a two-step approach for radiology report generation. First, we extract the content from an image; then, we verbalize the extracted content into a report that matches the style of a specific radiologist. For this, we leverage RadGraph -- a graph representation of reports -- together with large language models (LLMs). In our quantitative evaluations, we find that our approach leads to beneficial performance. Our human evaluation with clinical raters highlights that the AI-generated reports are indistinguishably tailored to the style of individual radiologist despite leveraging only a few examples as context.
△ Less
Submitted 31 October, 2023; v1 submitted 26 October, 2023;
originally announced October 2023.
-
A New Spectral Conjugate Subgradient Method with Application in Computed Tomography Image Reconstruction
Authors:
Milagros Loreto,
Thomas Humphries,
Chella Raghavan,
Kenneth Wu,
Sam Kwak
Abstract:
A new spectral conjugate subgradient method is presented to solve nonsmooth unconstrained optimization problems. The method combines the spectral conjugate gradient method for smooth problems with the spectral subgradient method for nonsmooth problems. We study the effect of two different choices of line search, as well as three formulas for determining the conjugate directions. In addition to num…
▽ More
A new spectral conjugate subgradient method is presented to solve nonsmooth unconstrained optimization problems. The method combines the spectral conjugate gradient method for smooth problems with the spectral subgradient method for nonsmooth problems. We study the effect of two different choices of line search, as well as three formulas for determining the conjugate directions. In addition to numerical experiments with standard nonsmooth test problems, we also apply the method to several image reconstruction problems in computed tomography, using total variation regularization. Performance profiles are used to compare the performance of the algorithm using different line search strategies and conjugate directions to that of the original spectral subgradient method. Our results show that the spectral conjugate subgradient algorithm outperforms the original spectral subgradient method, and that the use of the Polak-Ribiere formula for conjugate directions provides the best and most robust performance.
△ Less
Submitted 5 June, 2024; v1 submitted 26 September, 2023;
originally announced September 2023.
-
Active Learning for Semantic Segmentation with Multi-class Label Query
Authors:
Sehyun Hwang,
Sohyun Lee,
Hoyoung Kim,
Minhyeon Oh,
Jungseul Ok,
Suha Kwak
Abstract:
This paper proposes a new active learning method for semantic segmentation. The core of our method lies in a new annotation query design. It samples informative local image regions (e.g., superpixels), and for each of such regions, asks an oracle for a multi-hot vector indicating all classes existing in the region. This multi-class labeling strategy is substantially more efficient than existing on…
▽ More
This paper proposes a new active learning method for semantic segmentation. The core of our method lies in a new annotation query design. It samples informative local image regions (e.g., superpixels), and for each of such regions, asks an oracle for a multi-hot vector indicating all classes existing in the region. This multi-class labeling strategy is substantially more efficient than existing ones like segmentation, polygon, and even dominant class labeling in terms of annotation time per click. However, it introduces the class ambiguity issue in training as it assigns partial labels (i.e., a set of candidate classes) to individual pixels. We thus propose a new algorithm for learning semantic segmentation while disambiguating the partial labels in two stages. In the first stage, it trains a segmentation model directly with the partial labels through two new loss functions motivated by partial label learning and multiple instance learning. In the second stage, it disambiguates the partial labels by generating pixel-wise pseudo labels, which are used for supervised learning of the model. Equipped with a new acquisition function dedicated to the multi-class labeling, our method outperforms previous work on Cityscapes and PASCAL VOC 2012 while spending less annotation cost. Our code and results are available at https://github.com/sehyun03/MulActSeg.
△ Less
Submitted 6 November, 2023; v1 submitted 17 September, 2023;
originally announced September 2023.
-
Learning Unified Distance Metric Across Diverse Data Distributions with Parameter-Efficient Transfer Learning
Authors:
Sungyeon Kim,
Donghyun Kim,
Suha Kwak
Abstract:
A common practice in metric learning is to train and test an embedding model for each dataset. This dataset-specific approach fails to simulate real-world scenarios that involve multiple heterogeneous distributions of data. In this regard, we explore a new metric learning paradigm, called Unified Metric Learning (UML), which learns a unified distance metric capable of capturing relations across mu…
▽ More
A common practice in metric learning is to train and test an embedding model for each dataset. This dataset-specific approach fails to simulate real-world scenarios that involve multiple heterogeneous distributions of data. In this regard, we explore a new metric learning paradigm, called Unified Metric Learning (UML), which learns a unified distance metric capable of capturing relations across multiple data distributions. UML presents new challenges, such as imbalanced data distribution and bias towards dominant distributions. These issues cause standard metric learning methods to fail in learning a unified metric. To address these challenges, we propose Parameter-efficient Unified Metric leArning (PUMA), which consists of a pre-trained frozen model and two additional modules, stochastic adapter and prompt pool. These modules enable to capture dataset-specific knowledge while avoiding bias towards dominant distributions. Additionally, we compile a new unified metric learning benchmark with a total of 8 different datasets. PUMA outperforms the state-of-the-art dataset-specific models while using about 69 times fewer trainable parameters.
△ Less
Submitted 18 January, 2025; v1 submitted 16 September, 2023;
originally announced September 2023.
-
Generating Sets and Algebraic Properties of Pure Mapping Class Groups of Infinite Graphs
Authors:
George Domat,
Hannah Hoganson,
Sanghoon Kwak
Abstract:
We completely classify the locally finite, infinite graphs with pure mapping class groups admitting a coarsely bounded generating set. We also study algebraic properties of the pure mapping class group: We establish a semidirect product decomposition, compute first integral cohomology, and classify when they satisfy residual finiteness and the Tits alternative. These results provide a framework an…
▽ More
We completely classify the locally finite, infinite graphs with pure mapping class groups admitting a coarsely bounded generating set. We also study algebraic properties of the pure mapping class group: We establish a semidirect product decomposition, compute first integral cohomology, and classify when they satisfy residual finiteness and the Tits alternative. These results provide a framework and some initial steps towards quasi-isometric and algebraic rigidity of these groups.
△ Less
Submitted 14 September, 2023;
originally announced September 2023.
-
Shatter and Gather: Learning Referring Image Segmentation with Text Supervision
Authors:
Dongwon Kim,
Namyup Kim,
Cuiling Lan,
Suha Kwak
Abstract:
Referring image segmentation, the task of segmenting any arbitrary entities described in free-form texts, opens up a variety of vision applications. However, manual labeling of training data for this task is prohibitively costly, leading to lack of labeled data for training. We address this issue by a weakly supervised learning approach using text descriptions of training images as the only source…
▽ More
Referring image segmentation, the task of segmenting any arbitrary entities described in free-form texts, opens up a variety of vision applications. However, manual labeling of training data for this task is prohibitively costly, leading to lack of labeled data for training. We address this issue by a weakly supervised learning approach using text descriptions of training images as the only source of supervision. To this end, we first present a new model that discovers semantic entities in input image and then combines such entities relevant to text query to predict the mask of the referent. We also present a new loss function that allows the model to be trained without any further supervision. Our method was evaluated on four public benchmarks for referring image segmentation, where it clearly outperformed the existing method for the same task and recent open-vocabulary segmentation models on all the benchmarks.
△ Less
Submitted 24 October, 2023; v1 submitted 29 August, 2023;
originally announced August 2023.
-
SYNAuG: Exploiting Synthetic Data for Data Imbalance Problems
Authors:
Moon Ye-Bin,
Nam Hyeon-Woo,
Wonseok Choi,
Nayeong Kim,
Suha Kwak,
Tae-Hyun Oh
Abstract:
Data imbalance in training data often leads to biased predictions from trained models, which in turn causes ethical and social issues. A straightforward solution is to carefully curate training data, but given the enormous scale of modern neural networks, this is prohibitively labor-intensive and thus impractical. Inspired by recent developments in generative models, this paper explores the potent…
▽ More
Data imbalance in training data often leads to biased predictions from trained models, which in turn causes ethical and social issues. A straightforward solution is to carefully curate training data, but given the enormous scale of modern neural networks, this is prohibitively labor-intensive and thus impractical. Inspired by recent developments in generative models, this paper explores the potential of synthetic data to address the data imbalance problem. To be specific, our method, dubbed SYNAuG, leverages synthetic data to equalize the unbalanced distribution of training data. Our experiments demonstrate that, although a domain gap between real and synthetic data exists, training with SYNAuG followed by fine-tuning with a few real samples allows to achieve impressive performance on diverse tasks with different data imbalance issues, surpassing existing task-specific methods for the same purpose.
△ Less
Submitted 25 April, 2024; v1 submitted 2 August, 2023;
originally announced August 2023.
-
PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization
Authors:
Junhyeong Cho,
Gilhyun Nam,
Sungyeon Kim,
Hunmin Yang,
Suha Kwak
Abstract:
In a joint vision-language space, a text feature (e.g., from "a photo of a dog") could effectively represent its relevant image features (e.g., from dog photos). Also, a recent study has demonstrated the cross-modal transferability phenomenon of this joint space. From these observations, we propose PromptStyler which simulates various distribution shifts in the joint space by synthesizing diverse…
▽ More
In a joint vision-language space, a text feature (e.g., from "a photo of a dog") could effectively represent its relevant image features (e.g., from dog photos). Also, a recent study has demonstrated the cross-modal transferability phenomenon of this joint space. From these observations, we propose PromptStyler which simulates various distribution shifts in the joint space by synthesizing diverse styles via prompts without using any images to deal with source-free domain generalization. The proposed method learns to generate a variety of style features (from "a S* style of a") via learnable style word vectors for pseudo-words S*. To ensure that learned styles do not distort content information, we force style-content features (from "a S* style of a [class]") to be located nearby their corresponding content features (from "[class]") in the joint vision-language space. After learning style word vectors, we train a linear classifier using synthesized style-content features. PromptStyler achieves the state of the art on PACS, VLCS, OfficeHome and DomainNet, even though it does not require any images for training.
△ Less
Submitted 15 August, 2023; v1 submitted 27 July, 2023;
originally announced July 2023.
-
Syzygies of secant varieties of smooth projective curves and gonality sequences
Authors:
Junho Choe,
Sijong Kwak,
Jinhyung Park
Abstract:
The purpose of this paper is to prove that one can read off the gonality sequence of a smooth projective curve from syzygies of secant varieties of the curve embedded by a line bundle of sufficiently large degree. More precisely, together with Ein-Niu-Park's theorem, our main result shows that the gonality sequence of a smooth projective curve completely determines the shape of the minimal free re…
▽ More
The purpose of this paper is to prove that one can read off the gonality sequence of a smooth projective curve from syzygies of secant varieties of the curve embedded by a line bundle of sufficiently large degree. More precisely, together with Ein-Niu-Park's theorem, our main result shows that the gonality sequence of a smooth projective curve completely determines the shape of the minimal free resolutions of secant varieties of the curve of sufficiently large degree. This is a natural generalization of the gonality conjecture on syzygies of smooth projective curves established by Ein-Lazarsfeld and Rathmann to the secant varieties.
△ Less
Submitted 7 July, 2023;
originally announced July 2023.
-
Rate-Splitting Multiple Access for 6G Networks: Ten Promising Scenarios and Applications
Authors:
Jeonghun Park,
Byungju Lee,
Jinseok Choi,
Hoon Lee,
Namyoon Lee,
Seok-Hwan Park,
Kyoung-Jae Lee,
Junil Choi,
Sung Ho Chae,
Sang-Woon Jeon,
Kyung Sup Kwak,
Bruno Clerckx,
Wonjae Shin
Abstract:
In the upcoming 6G era, multiple access (MA) will play an essential role in achieving high throughput performances required in a wide range of wireless applications. Since MA and interference management are closely related issues, the conventional MA techniques are limited in that they cannot provide near-optimal performance in universal interference regimes. Recently, rate-splitting multiple acce…
▽ More
In the upcoming 6G era, multiple access (MA) will play an essential role in achieving high throughput performances required in a wide range of wireless applications. Since MA and interference management are closely related issues, the conventional MA techniques are limited in that they cannot provide near-optimal performance in universal interference regimes. Recently, rate-splitting multiple access (RSMA) has been gaining much attention. RSMA splits an individual message into two parts: a common part, decodable by every user, and a private part, decodable only by the intended user. Each user first decodes the common message and then decodes its private message by applying successive interference cancellation (SIC). By doing so, RSMA not only embraces the existing MA techniques as special cases but also provides significant performance gains by efficiently mitigating inter-user interference in a broad range of interference regimes. In this article, we first present the theoretical foundation of RSMA. Subsequently, we put forth four key benefits of RSMA: spectral efficiency, robustness, scalability, and flexibility. Upon this, we describe how RSMA can enable ten promising scenarios and applications along with future research directions to pave the way for 6G.
△ Less
Submitted 22 June, 2023;
originally announced June 2023.
-
Extending CLIP's Image-Text Alignment to Referring Image Segmentation
Authors:
Seoyeon Kim,
Minguk Kang,
Dongwon Kim,
Jaesik Park,
Suha Kwak
Abstract:
Referring Image Segmentation (RIS) is a cross-modal task that aims to segment an instance described by a natural language expression. Recent methods leverage large-scale pretrained unimodal models as backbones along with fusion techniques for joint reasoning across modalities. However, the inherent cross-modal nature of RIS raises questions about the effectiveness of unimodal backbones. We propose…
▽ More
Referring Image Segmentation (RIS) is a cross-modal task that aims to segment an instance described by a natural language expression. Recent methods leverage large-scale pretrained unimodal models as backbones along with fusion techniques for joint reasoning across modalities. However, the inherent cross-modal nature of RIS raises questions about the effectiveness of unimodal backbones. We propose RISCLIP, a novel framework that effectively leverages the cross-modal nature of CLIP for RIS. Observing CLIP's inherent alignment between image and text features, we capitalize on this starting point and introduce simple but strong modules that enhance unimodal feature extraction and leverage rich alignment knowledge in CLIP's image-text shared-embedding space. RISCLIP exhibits outstanding results on all three major RIS benchmarks and also outperforms previous CLIP-based methods, demonstrating the efficacy of our strategy in extending CLIP's image-text alignment to RIS.
△ Less
Submitted 7 April, 2024; v1 submitted 14 June, 2023;
originally announced June 2023.
-
Accelerated Bayesian inference of plasma profiles with self-consistent MHD equilibria at W7-X via neural networks
Authors:
Andrea Merlo,
Andrea Pavone,
Daniel Böckenhoff,
Ekkehard Pasch,
Golo Fuchert,
Kai Jakob Brunner,
Kian Rahbarnia,
Jonathan Schilling,
Udo Höfel,
Sehyun Kwak,
Jakob Svensson,
Thomas Sunn Pedersen,
the W7-X team
Abstract:
High-$\langle β\rangle$ operations require a fast and robust inference of plasma parameters with a self-consistent MHD equilibrium. Precalculated MHD equilibria are usually employed at W7-X due to the high computational cost. To address this, we couple a physics-regularized NN model that approximates the ideal-MHD equilibrium with the Bayesian modeling framework Minerva. We show the fast and robus…
▽ More
High-$\langle β\rangle$ operations require a fast and robust inference of plasma parameters with a self-consistent MHD equilibrium. Precalculated MHD equilibria are usually employed at W7-X due to the high computational cost. To address this, we couple a physics-regularized NN model that approximates the ideal-MHD equilibrium with the Bayesian modeling framework Minerva. We show the fast and robust inference of plasma profiles (electron temperature and density) with a self-consistent MHD equilibrium approximated by the NN model. We investigate the robustness of the inference across diverse synthetic W7-X plasma scenarios. The inferred plasma parameters and their uncertainties are compatible with the parameters inferred using the VMEC, and the inference time is reduced by more than two orders of magnitude. This work suggests that MHD self-consistent inferences of plasma parameters can be performed between shots.
△ Less
Submitted 22 May, 2023;
originally announced May 2023.
-
Adaptive Superpixel for Active Learning in Semantic Segmentation
Authors:
Hoyoung Kim,
Minhyeon Oh,
Sehyun Hwang,
Suha Kwak,
Jungseul Ok
Abstract:
Learning semantic segmentation requires pixel-wise annotations, which can be time-consuming and expensive. To reduce the annotation cost, we propose a superpixel-based active learning (AL) framework, which collects a dominant label per superpixel instead. To be specific, it consists of adaptive superpixel and sieving mechanisms, fully dedicated to AL. At each round of AL, we adaptively merge neigh…
▽ More
Learning semantic segmentation requires pixel-wise annotations, which can be time-consuming and expensive. To reduce the annotation cost, we propose a superpixel-based active learning (AL) framework, which collects a dominant label per superpixel instead. To be specific, it consists of adaptive superpixel and sieving mechanisms, fully dedicated to AL. At each round of AL, we adaptively merge neighboring pixels of similar learned features into superpixels. We then query a selected subset of these superpixels using an acquisition function assuming no uniform superpixel size. This approach is more efficient than existing methods, which rely only on innate features such as RGB color and assume uniform superpixel sizes. Obtaining a dominant label per superpixel drastically reduces annotators' burden as it requires fewer clicks. However, it inevitably introduces noisy annotations due to mismatches between superpixel and ground truth segmentation. To address this issue, we further devise a sieving mechanism that identifies and excludes potentially noisy annotations from learning. Our experiments on both Cityscapes and PASCAL VOC datasets demonstrate the efficacy of adaptive superpixel and sieving mechanisms.
△ Less
Submitted 20 August, 2023; v1 submitted 29 March, 2023;
originally announced March 2023.