-
Can ChatGPT implement finite element models for geotechnical engineering applications?
Authors:
Taegu Kim,
Tae Sup Yun,
Hyoung Suk Suh
Abstract:
This study assesses the capability of ChatGPT to generate finite element code for geotechnical engineering applications from a set of prompts. We tested three different initial boundary value problems using a hydro-mechanically coupled formulation for unsaturated soils, including the dissipation of excess pore water pressure through fluid mass diffusion in one-dimensional space, time-dependent dif…
▽ More
This study assesses the capability of ChatGPT to generate finite element code for geotechnical engineering applications from a set of prompts. We tested three different initial boundary value problems using a hydro-mechanically coupled formulation for unsaturated soils, including the dissipation of excess pore water pressure through fluid mass diffusion in one-dimensional space, time-dependent differential settlement of a strip footing, and gravity-driven seepage. For each case, initial prompting involved providing ChatGPT with necessary information for finite element implementation, such as balance and constitutive equations, problem geometry, initial and boundary conditions, material properties, and spatiotemporal discretization and solution strategies. Any errors and unexpected results were further addressed through prompt augmentation processes until the ChatGPT-generated finite element code passed the verification/validation test. Our results demonstrate that ChatGPT required minimal code revisions when using the FEniCS finite element library, owing to its high-level interfaces that enable efficient programming. In contrast, the MATLAB code generated by ChatGPT necessitated extensive prompt augmentations and/or direct human intervention, as it involves a significant amount of low-level programming required for finite element analysis, such as constructing shape functions or assembling global matrices. Given that prompt engineering for this task requires an understanding of the mathematical formulation and numerical techniques, this study suggests that while a large language model may not yet replace human programmers, it can greatly assist in the implementation of numerical models.
△ Less
Submitted 4 January, 2025;
originally announced January 2025.
-
Dynamic realization of emergent high-dimensional optical vortices
Authors:
Dongha Kim,
Geonhyeong Park,
Yun-Seok Choi,
Arthur Baucour,
Jisung Hwang,
Sanghyeok Park,
Hee Seong Yun,
Jonghwa Shin,
Haiwen Wang,
Shanhui Fan,
Dong Ki Yoon,
Min-Kyo Seo
Abstract:
The dimensionality of vortical structures has recently been extended beyond two dimensions, providing higher-order topological characteristics and robustness for high-capacity information processing and turbulence control. The generation of high-dimensional vortical structures has mostly been demonstrated in classical systems through the complex interference of fluidic, acoustic, or electromagneti…
▽ More
The dimensionality of vortical structures has recently been extended beyond two dimensions, providing higher-order topological characteristics and robustness for high-capacity information processing and turbulence control. The generation of high-dimensional vortical structures has mostly been demonstrated in classical systems through the complex interference of fluidic, acoustic, or electromagnetic waves. However, natural materials rarely support three- or higher-dimensional vortical structures and their physical interactions. Here, we present a high-dimensional gradient thickness optical cavity (GTOC) in which the optical coupling of planar metal-dielectric multilayers implements topological interactions across multiple dimensions. Topological interactions in high-dimensional GTOC construct non-trivial topological phases, which induce high-dimensional vortical structures in generalized parameter space in three, four dimensions, and beyond. These emergent high-dimensional vortical structures are observed under electro-optic tomography as optical vortex dynamics in two-dimensional real-space, employing the optical thicknesses of the dielectric layers as synthetic dimensions. We experimentally demonstrate emergent vortical structures, optical vortex lines and vortex rings, in a three-dimensional generalized parameter space and their topological transitions. Furthermore, we explore four-dimensional vortical structures, termed optical vortex sheets, which provide the programmability of real-space optical vortex dynamics. Our findings hold significant promise for emulating high-dimensional physics and developing active topological photonic devices.
△ Less
Submitted 2 January, 2025;
originally announced January 2025.
-
Controllable Thermo-Stimulated Luminescence in Niobate Persistent Phosphor by Constructing the Photovoltaic/Electrolytic Cell for Remote Intelligent Anti-Counterfeiting
Authors:
Yuanyuan Hu,
Dangli Gao,
Xiangyu Zhang,
Sining Yun
Abstract:
Persistent luminescence (PersL) carrying remote key information plays a crucial role for intelligent anti-counterfeiting applications. However, the weak PersL intensity accompanied by uncontrollability limits their practical application. Here we develop LiNbO3 (LNO):Pr,Bi phosphor with enhanced red PersL by trace doping Sm3+. The LNO:Pr,Bi,Sm phosphor exhibits quadruplet luminescence, including po…
▽ More
Persistent luminescence (PersL) carrying remote key information plays a crucial role for intelligent anti-counterfeiting applications. However, the weak PersL intensity accompanied by uncontrollability limits their practical application. Here we develop LiNbO3 (LNO):Pr,Bi phosphor with enhanced red PersL by trace doping Sm3+. The LNO:Pr,Bi,Sm phosphor exhibits quadruplet luminescence, including polychrome photoluminescence, PersL, and photo/thermo-stimulated luminescence (PSL/TSL). Particularly, the enhanced TSL can carry remote subjective information independent of the phosphor itself by controlling the temperature. A mechanism of afterglow enhancement is proposed based on constructing reversible photovoltaic cells and electrolytic cells by photothermal redox reactions using Bi3+ + VO and Bi3+/Pr3+ + VLi' ion pair. This study has sparked the exploration of designing the information storage PersL materials for more sophisticated remote intelligent anti-counterfeiting.
△ Less
Submitted 28 December, 2024;
originally announced December 2024.
-
Optical Coherence Elastography Measures Mechanical Tension in the Lens and Capsule in situ
Authors:
Xu Feng,
Guo-yang Li,
Yuxuan Jiang,
Owen Shortt-Nguyen,
Seok-Hyun Yun
Abstract:
Lens tension is essential for accommodative vision but remains challenging to measure with precision. Here, we present an optical coherence elastography (OCE) technique that quantifies both the tension and elastic modulus of lens tissue and capsule. This method derives mechanical parameters from surface wave dispersion across a critical frequency range of 1-30 kHz. Using isolated lenses from six-m…
▽ More
Lens tension is essential for accommodative vision but remains challenging to measure with precision. Here, we present an optical coherence elastography (OCE) technique that quantifies both the tension and elastic modulus of lens tissue and capsule. This method derives mechanical parameters from surface wave dispersion across a critical frequency range of 1-30 kHz. Using isolated lenses from six-month-old pigs, we measured intrinsic anterior capsular tensions of 0-20 kPa and posterior capsular tensions of 40-50 kPa, induced by intra-lenticular pressure at the cortical surface. Young's modulus (E) was 1.9 MPa for anterior capsules and 1.2 MPa for posterior capsules. Tensions in cortical tissue (E ~ 10 kPa) were below 1 kPa. Biaxial zonular stretching (~4% strain) increased anterior capsular tension from near zero to 64 kPa. This acousto-optical method holds significant promise for diagnosing and managing accommodative dysfunctions through lens mechanics assessment in clinical settings.
△ Less
Submitted 17 December, 2024;
originally announced December 2024.
-
SoMA: Singular Value Decomposed Minor Components Adaptation for Domain Generalizable Representation Learning
Authors:
Seokju Yun,
Seunghye Chae,
Dongheon Lee,
Youngmin Ro
Abstract:
Domain generalization (DG) aims to adapt a model using one or multiple source domains to ensure robust performance in unseen target domains. Recently, Parameter-Efficient Fine-Tuning (PEFT) of foundation models has shown promising results in the context of DG problem. Nevertheless, existing PEFT methods still struggle to strike a balance between preserving generalizable components of the pre-train…
▽ More
Domain generalization (DG) aims to adapt a model using one or multiple source domains to ensure robust performance in unseen target domains. Recently, Parameter-Efficient Fine-Tuning (PEFT) of foundation models has shown promising results in the context of DG problem. Nevertheless, existing PEFT methods still struggle to strike a balance between preserving generalizable components of the pre-trained model and learning task-specific features. To gain insights into the distribution of generalizable components, we begin by analyzing the pre-trained weights through the lens of singular value decomposition. Building on these insights, we introduce Singular Value Decomposed Minor Components Adaptation (SoMA), an approach that selectively tunes minor singular components while keeping the residual parts frozen. SoMA effectively retains the generalization ability of the pre-trained model while efficiently acquiring task-specific skills. Moreover, we freeze domain-generalizable blocks and employ an annealing weight decay strategy, thereby achieving an optimal balance in the delicate trade-off between generalizability and discriminability. SoMA attains state-of-the-art results on multiple benchmarks that span both domain generalized semantic segmentation to domain generalized object detection. In addition, our methods introduce no additional inference overhead or regularization loss, maintain compatibility with any backbone or head, and are designed to be versatile, allowing easy integration into a wide range of tasks.
△ Less
Submitted 21 March, 2025; v1 submitted 5 December, 2024;
originally announced December 2024.
-
Expanding Event Modality Applications through a Robust CLIP-Based Encoder
Authors:
Sungheon Jeong,
Hanning Chen,
Sanggeon Yun,
Suhyeon Cho,
Wenjun Huang,
Xiangjian Liu,
Mohsen Imani
Abstract:
This paper introduces a powerful encoder that transfers CLIP`s capabilities to event-based data, enhancing its utility and expanding its applicability across diverse domains. While large-scale datasets have significantly advanced image-based models, the scarcity of comprehensive event datasets has limited performance potential in event modality. To address this challenge, we adapt CLIP`s architect…
▽ More
This paper introduces a powerful encoder that transfers CLIP`s capabilities to event-based data, enhancing its utility and expanding its applicability across diverse domains. While large-scale datasets have significantly advanced image-based models, the scarcity of comprehensive event datasets has limited performance potential in event modality. To address this challenge, we adapt CLIP`s architecture to align event embeddings with image embeddings, supporting zero-shot learning and preserving text alignment while mitigating catastrophic forgetting. Our encoder achieves strong performance in object recognition, with competitive results in zero-shot and few-shot learning tasks. Notably, it generalizes effectively to events extracted from video data without requiring additional training, highlighting its versatility. Additionally, we integrate this encoder within a cross-modality framework that facilitates interaction across five modalities-Image, Event, Text, Sound, and Depth-expanding the possibilities for cross-modal applications. Overall, this work underscores the transformative potential of a robust event encoder, broadening the scope and utility of event-based data across various fields.
△ Less
Submitted 8 May, 2025; v1 submitted 4 December, 2024;
originally announced December 2024.
-
Hierarchical Framework for Retrosynthesis Prediction with Enhanced Reaction Center Localization
Authors:
Seongeun Yun,
Won Bo Lee
Abstract:
Retrosynthesis is essential for designing synthetic pathways for complex molecules and can be revolutionized by AI to automate and accelerate chemical synthesis planning for drug discovery and materials science. Here, we propose a hierarchical framework for retrosynthesis prediction that systematically integrates reaction center identification, action prediction, and termination decision into a un…
▽ More
Retrosynthesis is essential for designing synthetic pathways for complex molecules and can be revolutionized by AI to automate and accelerate chemical synthesis planning for drug discovery and materials science. Here, we propose a hierarchical framework for retrosynthesis prediction that systematically integrates reaction center identification, action prediction, and termination decision into a unified pipeline. Leveraging a molecular encoder pretrained with contrastive learning, the model captures both atom and bond level representations, enabling accurate identification of reaction centers and prediction of chemical actions. The framework addresses the scarcity of multiple reaction center data through augmentation strategies, enhancing the ability of the model to generalize to diverse reaction scenarios. The proposed approach achieves competitive performance across benchmark datasets, with notably high topk accuracy and exceptional reaction center identification capabilities, demonstrating its robustness in handling complex transformations. These advancements position the framework as a promising tool for future applications in material design and drug discovery.
△ Less
Submitted 29 November, 2024;
originally announced November 2024.
-
Pretrained LLM Adapted with LoRA as a Decision Transformer for Offline RL in Quantitative Trading
Authors:
Suyeol Yun
Abstract:
Developing effective quantitative trading strategies using reinforcement learning (RL) is challenging due to the high risks associated with online interaction with live financial markets. Consequently, offline RL, which leverages historical market data without additional exploration, becomes essential. However, existing offline RL methods often struggle to capture the complex temporal dependencies…
▽ More
Developing effective quantitative trading strategies using reinforcement learning (RL) is challenging due to the high risks associated with online interaction with live financial markets. Consequently, offline RL, which leverages historical market data without additional exploration, becomes essential. However, existing offline RL methods often struggle to capture the complex temporal dependencies inherent in financial time series and may overfit to historical patterns. To address these challenges, we introduce a Decision Transformer (DT) initialized with pre-trained GPT-2 weights and fine-tuned using Low-Rank Adaptation (LoRA). This architecture leverages the generalization capabilities of pre-trained language models and the efficiency of LoRA to learn effective trading policies from expert trajectories solely from historical data. Our model performs competitively with established offline RL algorithms, including Conservative Q-Learning (CQL), Implicit Q-Learning (IQL), and Behavior Cloning (BC), as well as a baseline Decision Transformer with randomly initialized GPT-2 weights and LoRA. Empirical results demonstrate that our approach effectively learns from expert trajectories and secures superior rewards in certain trading scenarios, highlighting the effectiveness of integrating pre-trained language models and parameter-efficient fine-tuning in offline RL for quantitative trading. Replication code for our experiments is publicly available at https://github.com/syyunn/finrl-dt
△ Less
Submitted 26 November, 2024;
originally announced November 2024.
-
Finding "Good Views" of Electrocardiogram Signals for Inferring Abnormalities in Cardiac Condition
Authors:
Hyewon Jeong,
Suyeol Yun,
Hammaad Adam
Abstract:
Electrocardiograms (ECGs) are an established technique to screen for abnormal cardiac signals. Recent work has established that it is possible to detect arrhythmia directly from the ECG signal using deep learning algorithms. While a few prior approaches with contrastive learning have been successful, the best way to define a positive sample remains an open question. In this project, we investigate…
▽ More
Electrocardiograms (ECGs) are an established technique to screen for abnormal cardiac signals. Recent work has established that it is possible to detect arrhythmia directly from the ECG signal using deep learning algorithms. While a few prior approaches with contrastive learning have been successful, the best way to define a positive sample remains an open question. In this project, we investigate several ways to define positive samples, and assess which approach yields the best performance in a downstream task of classifying arrhythmia. We explore spatiotemporal invariances, generic augmentations, demographic similarities, cardiac rhythms, and wave attributes of ECG as potential ways to match positive samples. We then evaluate each strategy with downstream task performance, and find that learned representations invariant to patient identity are powerful in arrhythmia detection. We made our code available in: https://github.com/mandiehyewon/goodviews_ecg.git
△ Less
Submitted 11 November, 2024;
originally announced November 2024.
-
Noise-Aware Ensemble Learning for Efficient Radar Modulation Recognition
Authors:
Do-Hyun Park,
Min-Wook Jeon,
Jinwoo Jeong,
Isaac Sim,
Sangbom Yun,
Junghyun Seo,
Hyoung-Nam Kim
Abstract:
Electronic warfare support (ES) systems intercept adversary radar signals and estimate various types of signal information, including modulation schemes. The accurate and rapid identification of modulation schemes under conditions of very low signal power remains a significant challenge for ES systems. This paper proposes a recognition model based on a noise-aware ensemble learning (NAEL) framewor…
▽ More
Electronic warfare support (ES) systems intercept adversary radar signals and estimate various types of signal information, including modulation schemes. The accurate and rapid identification of modulation schemes under conditions of very low signal power remains a significant challenge for ES systems. This paper proposes a recognition model based on a noise-aware ensemble learning (NAEL) framework to efficiently recognize radar modulation schemes in noisy environments. The NAEL framework evaluates the influence of noise on recognition and adaptively selects an appropriate neural network structure, offering significant advantages in terms of computational efficiency and recognition performance. We present the analysis results of the recognition performance of the proposed model based on experimental data. Our recognition model demonstrates superior recognition accuracy with low computational complexity compared to conventional classification models.
△ Less
Submitted 14 May, 2025; v1 submitted 22 November, 2024;
originally announced November 2024.
-
Exploiting Boosting in Hyperdimensional Computing for Enhanced Reliability in Healthcare
Authors:
SungHeon Jeong,
Hamza Errahmouni Barkam,
Sanggeon Yun,
Yeseong Kim,
Shaahin Angizi,
Mohsen Imani
Abstract:
Hyperdimensional computing (HDC) enables efficient data encoding and processing in high-dimensional space, benefiting machine learning and data analysis. However, underutilization of these spaces can lead to overfitting and reduced model reliability, especially in data-limited systems a critical issue in sectors like healthcare that demand robustness and consistent performance. We introduce BoostH…
▽ More
Hyperdimensional computing (HDC) enables efficient data encoding and processing in high-dimensional space, benefiting machine learning and data analysis. However, underutilization of these spaces can lead to overfitting and reduced model reliability, especially in data-limited systems a critical issue in sectors like healthcare that demand robustness and consistent performance. We introduce BoostHD, an approach that applies boosting algorithms to partition the hyperdimensional space into subspaces, creating an ensemble of weak learners. By integrating boosting with HDC, BoostHD enhances performance and reliability beyond existing HDC methods. Our analysis highlights the importance of efficient utilization of hyperdimensional spaces for improved model performance. Experiments on healthcare datasets show that BoostHD outperforms state-of-the-art methods. On the WESAD dataset, it achieved an accuracy of 98.37%, surpassing Random Forest, XGBoost, and OnlineHD. BoostHD also demonstrated superior inference efficiency and stability, maintaining high accuracy under data imbalance and noise. In person-specific evaluations, it achieved an average accuracy of 96.19%, outperforming other models. By addressing the limitations of both boosting and HDC, BoostHD expands the applicability of HDC in critical domains where reliability and precision are paramount.
△ Less
Submitted 13 January, 2025; v1 submitted 21 November, 2024;
originally announced November 2024.
-
Consensus Statement on Brillouin Light Scattering Microscopy of Biological Materials
Authors:
Pierre Bouvet,
Carlo Bevilacqua,
Yogeshwari Ambekar,
Giuseppe Antonacci,
Joshua Au,
Silvia Caponi,
Sophie Chagnon-Lessard,
Juergen Czarske,
Thomas Dehoux,
Daniele Fioretto,
Yujian Fu,
Jochen Guck,
Thorsten Hamann,
Dag Heinemann,
Torsten Jähnke,
Hubert Jean-Ruel,
Irina Kabakova,
Kristie Koski,
Nektarios Koukourakis,
David Krause,
Salvatore La Cavera III,
Timm Landes,
Jinhao Li,
Jeremie Margueritat,
Maurizio Mattarelli
, et al. (19 additional authors not shown)
Abstract:
Brillouin Light Scattering (BLS) spectroscopy is a non-invasive, non-contact, label-free optical technique that can provide information on the mechanical properties of a material on the sub-micron scale. Over the last decade it has seen increased applications in the life sciences, driven by the observed significance of mechanical properties in biological processes, the realization of more sensitiv…
▽ More
Brillouin Light Scattering (BLS) spectroscopy is a non-invasive, non-contact, label-free optical technique that can provide information on the mechanical properties of a material on the sub-micron scale. Over the last decade it has seen increased applications in the life sciences, driven by the observed significance of mechanical properties in biological processes, the realization of more sensitive BLS spectrometers and its extension to an imaging modality. As with other spectroscopic techniques, BLS measurements not only detect signals characteristic of the investigated sample, but also of the experimental apparatus, and can be significantly affected by measurement conditions. The aim of this consensus statement is to improve the comparability of BLS studies by providing reporting recommendations for the measured parameters and detailing common artifacts. Given that most BLS studies of biological matter are still at proof-of-concept stages and use different--often self-built--spectrometers, a consensus statement is particularly timely to assure unified advancement.
△ Less
Submitted 18 November, 2024;
originally announced November 2024.
-
Scalable Readability Evaluation for Graph Layouts: 2D Geometric Distributed Algorithms
Authors:
Sanggeon Yun
Abstract:
Graphs, consisting of vertices and edges, are vital for representing complex relationships in fields like social networks, finance, and blockchain. Visualizing these graphs helps analysts identify structural patterns, with readability metrics-such as node occlusion and edge crossing-assessing layout clarity. However, calculating these metrics is computationally intensive, making scalability a chal…
▽ More
Graphs, consisting of vertices and edges, are vital for representing complex relationships in fields like social networks, finance, and blockchain. Visualizing these graphs helps analysts identify structural patterns, with readability metrics-such as node occlusion and edge crossing-assessing layout clarity. However, calculating these metrics is computationally intensive, making scalability a challenge for large graphs. Without efficient readability metrics, layout generation processes-despite numerous studies focused on accelerating them-face bottleneck, making it challenging to select or produce optimized layouts swiftly. Previous approaches attempted to accelerate this process through machine learning models. Machine learning approaches aimed to predict readability scores from rendered images of graphs. While these models offered some improvement, they struggled with scalability and accuracy, especially for graphs with thousands of nodes. For instance, this approach requires substantial memory to process large images, as it relies on rendered images of the graph; graphs with more than 600 nodes cannot be inputted into the model, and errors can exceed 55% in some readability metrics due to difficulties in generalizing across diverse graph layouts. This study addresses these limitations by introducing scalable algorithms for readability evaluation in distributed environments, utilizing Spark's DataFrame and GraphFrame frameworks to efficiently manage large data volumes across multiple machines. Experimental results show that these distributed algorithms significantly reduce computation time, achieving up to a 17x speedup for node occlusion and a 146x improvement for edge crossing on large datasets. These enhancements make scalable graph readability evaluation practical and efficient, overcoming the limitations of previous machine-learning approaches.
△ Less
Submitted 14 November, 2024;
originally announced November 2024.
-
Continuous GNN-based Anomaly Detection on Edge using Efficient Adaptive Knowledge Graph Learning
Authors:
Sanggeon Yun,
Ryozo Masukawa,
William Youngwoo Chung,
Minhyoung Na,
Nathaniel Bastian,
Mohsen Imani
Abstract:
The increasing demand for robust security solutions across various industries has made Video Anomaly Detection (VAD) a critical task in applications such as intelligent surveillance, evidence investigation, and violence detection. Traditional approaches to VAD often rely on finetuning large pre-trained models, which can be computationally expensive and impractical for real-time or resource-constra…
▽ More
The increasing demand for robust security solutions across various industries has made Video Anomaly Detection (VAD) a critical task in applications such as intelligent surveillance, evidence investigation, and violence detection. Traditional approaches to VAD often rely on finetuning large pre-trained models, which can be computationally expensive and impractical for real-time or resource-constrained environments. To address this, MissionGNN introduced a more efficient method by training a graph neural network (GNN) using a fixed knowledge graph (KG) derived from large language models (LLMs) like GPT-4. While this approach demonstrated significant efficiency in computational power and memory, it faces limitations in dynamic environments where frequent updates to the KG are necessary due to evolving behavior trends and shifting data patterns. These updates typically require cloud-based computation, posing challenges for edge computing applications. In this paper, we propose a novel framework that facilitates continuous KG adaptation directly on edge devices, overcoming the limitations of cloud dependency. Our method dynamically modifies the KG through a three-phase process: pruning, alternating, and creating nodes, enabling real-time adaptation to changing data trends. This continuous learning approach enhances the robustness of anomaly detection models, making them more suitable for deployment in dynamic and resource-constrained environments.
△ Less
Submitted 13 January, 2025; v1 submitted 13 November, 2024;
originally announced November 2024.
-
Code-Switching Curriculum Learning for Multilingual Transfer in LLMs
Authors:
Haneul Yoo,
Cheonbok Park,
Sangdoo Yun,
Alice Oh,
Hwaran Lee
Abstract:
Large language models (LLMs) now exhibit near human-level performance in various tasks, but their performance drops drastically after a handful of high-resource languages due to the imbalance in pre-training data. Inspired by the human process of second language acquisition, particularly code-switching (the practice of language alternation in a conversation), we propose code-switching curriculum l…
▽ More
Large language models (LLMs) now exhibit near human-level performance in various tasks, but their performance drops drastically after a handful of high-resource languages due to the imbalance in pre-training data. Inspired by the human process of second language acquisition, particularly code-switching (the practice of language alternation in a conversation), we propose code-switching curriculum learning (CSCL) to enhance cross-lingual transfer for LLMs. CSCL mimics the stages of human language learning by progressively training models with a curriculum consisting of 1) token-level code-switching, 2) sentence-level code-switching, and 3) monolingual corpora. Using Qwen 2 as our underlying model, we demonstrate the efficacy of the CSCL in improving language transfer to Korean, achieving significant performance gains compared to monolingual continual pre-training methods. Ablation studies reveal that both token- and sentence-level code-switching significantly enhance cross-lingual transfer and that curriculum learning amplifies these effects. We also extend our findings into various languages, including Japanese (high-resource) and Indonesian (low-resource), and using two additional models (Gemma 2 and Phi 3.5). We further show that CSCL mitigates spurious correlations between language resources and safety alignment, presenting a robust, efficient framework for more equitable language transfer in LLMs. We observe that CSCL is effective for low-resource settings where high-quality, monolingual corpora for language transfer are hardly available.
△ Less
Submitted 4 November, 2024;
originally announced November 2024.
-
Hollowed Net for On-Device Personalization of Text-to-Image Diffusion Models
Authors:
Wonguk Cho,
Seokeon Choi,
Debasmit Das,
Matthias Reisser,
Taesup Kim,
Sungrack Yun,
Fatih Porikli
Abstract:
Recent advancements in text-to-image diffusion models have enabled the personalization of these models to generate custom images from textual prompts. This paper presents an efficient LoRA-based personalization approach for on-device subject-driven generation, where pre-trained diffusion models are fine-tuned with user-specific data on resource-constrained devices. Our method, termed Hollowed Net,…
▽ More
Recent advancements in text-to-image diffusion models have enabled the personalization of these models to generate custom images from textual prompts. This paper presents an efficient LoRA-based personalization approach for on-device subject-driven generation, where pre-trained diffusion models are fine-tuned with user-specific data on resource-constrained devices. Our method, termed Hollowed Net, enhances memory efficiency during fine-tuning by modifying the architecture of a diffusion U-Net to temporarily remove a fraction of its deep layers, creating a hollowed structure. This approach directly addresses on-device memory constraints and substantially reduces GPU memory requirements for training, in contrast to previous methods that primarily focus on minimizing training steps and reducing the number of parameters to update. Additionally, the personalized Hollowed Net can be transferred back into the original U-Net, enabling inference without additional memory overhead. Quantitative and qualitative analyses demonstrate that our approach not only reduces training memory to levels as low as those required for inference but also maintains or improves personalization performance compared to existing methods.
△ Less
Submitted 2 November, 2024;
originally announced November 2024.
-
Conditional Synthesis of 3D Molecules with Time Correction Sampler
Authors:
Hojung Jung,
Youngrok Park,
Laura Schmid,
Jaehyeong Jo,
Dongkyu Lee,
Bongsang Kim,
Se-Young Yun,
Jinwoo Shin
Abstract:
Diffusion models have demonstrated remarkable success in various domains, including molecular generation. However, conditional molecular generation remains a fundamental challenge due to an intrinsic trade-off between targeting specific chemical properties and generating meaningful samples from the data distribution. In this work, we present Time-Aware Conditional Synthesis (TACS), a novel approac…
▽ More
Diffusion models have demonstrated remarkable success in various domains, including molecular generation. However, conditional molecular generation remains a fundamental challenge due to an intrinsic trade-off between targeting specific chemical properties and generating meaningful samples from the data distribution. In this work, we present Time-Aware Conditional Synthesis (TACS), a novel approach to conditional generation on diffusion models. It integrates adaptively controlled plug-and-play "online" guidance into a diffusion model, driving samples toward the desired properties while maintaining validity and stability. A key component of our algorithm is our new type of diffusion sampler, Time Correction Sampler (TCS), which is used to control guidance and ensure that the generated molecules remain on the correct manifold at each reverse step of the diffusion process at the same time. Our proposed method demonstrates significant performance in conditional 3D molecular generation and offers a promising approach towards inverse molecular design, potentially facilitating advancements in drug discovery, materials science, and other related fields.
△ Less
Submitted 1 November, 2024;
originally announced November 2024.
-
Scaling Up Membership Inference: When and How Attacks Succeed on Large Language Models
Authors:
Haritz Puerto,
Martin Gubri,
Sangdoo Yun,
Seong Joon Oh
Abstract:
Membership inference attacks (MIA) attempt to verify the membership of a given data sample in the training set for a model. MIA has become relevant in recent years, following the rapid development of large language models (LLM). Many are concerned about the usage of copyrighted materials for training them and call for methods for detecting such usage. However, recent research has largely concluded…
▽ More
Membership inference attacks (MIA) attempt to verify the membership of a given data sample in the training set for a model. MIA has become relevant in recent years, following the rapid development of large language models (LLM). Many are concerned about the usage of copyrighted materials for training them and call for methods for detecting such usage. However, recent research has largely concluded that current MIA methods do not work on LLMs. Even when they seem to work, it is usually because of the ill-designed experimental setup where other shortcut features enable "cheating." In this work, we argue that MIA still works on LLMs, but only when multiple documents are presented for testing. We construct new benchmarks that measure the MIA performances at a continuous scale of data samples, from sentences (n-grams) to a collection of documents (multiple chunks of tokens). To validate the efficacy of current MIA approaches at greater scales, we adapt a recent work on Dataset Inference (DI) for the task of binary membership detection that aggregates paragraph-level MIA features to enable MIA at document and collection of documents level. This baseline achieves the first successful MIA on pre-trained and fine-tuned LLMs.
△ Less
Submitted 3 February, 2025; v1 submitted 31 October, 2024;
originally announced November 2024.
-
PV-VTT: A Privacy-Centric Dataset for Mission-Specific Anomaly Detection and Natural Language Interpretation
Authors:
Ryozo Masukawa,
Sanggeon Yun,
Yoshiki Yamaguchi,
Mohsen Imani
Abstract:
Video crime detection is a significant application of computer vision and artificial intelligence. However, existing datasets primarily focus on detecting severe crimes by analyzing entire video clips, often neglecting the precursor activities (i.e., privacy violations) that could potentially prevent these crimes. To address this limitation, we present PV-VTT (Privacy Violation Video To Text), a u…
▽ More
Video crime detection is a significant application of computer vision and artificial intelligence. However, existing datasets primarily focus on detecting severe crimes by analyzing entire video clips, often neglecting the precursor activities (i.e., privacy violations) that could potentially prevent these crimes. To address this limitation, we present PV-VTT (Privacy Violation Video To Text), a unique multimodal dataset aimed at identifying privacy violations. PV-VTT provides detailed annotations for both video and text in scenarios. To ensure the privacy of individuals in the videos, we only provide video feature vectors, avoiding the release of any raw video data. This privacy-focused approach allows researchers to use the dataset while protecting participant confidentiality. Recognizing that privacy violations are often ambiguous and context-dependent, we propose a Graph Neural Network (GNN)-based video description model. Our model generates a GNN-based prompt with image for Large Language Model (LLM), which deliver cost-effective and high-quality video descriptions. By leveraging a single video frame along with relevant text, our method reduces the number of input tokens required, maintaining descriptive quality while optimizing LLM API-usage. Extensive experiments validate the effectiveness and interpretability of our approach in video description tasks and flexibility of our PV-VTT dataset.
△ Less
Submitted 4 December, 2024; v1 submitted 29 October, 2024;
originally announced October 2024.
-
Probabilistic Language-Image Pre-Training
Authors:
Sanghyuk Chun,
Wonjae Kim,
Song Park,
Sangdoo Yun
Abstract:
Vision-language models (VLMs) embed aligned image-text pairs into a joint space but often rely on deterministic embeddings, assuming a one-to-one correspondence between images and texts. This oversimplifies real-world relationships, which are inherently many-to-many, with multiple captions describing a single image and vice versa. We introduce Probabilistic Language-Image Pre-training (ProLIP), th…
▽ More
Vision-language models (VLMs) embed aligned image-text pairs into a joint space but often rely on deterministic embeddings, assuming a one-to-one correspondence between images and texts. This oversimplifies real-world relationships, which are inherently many-to-many, with multiple captions describing a single image and vice versa. We introduce Probabilistic Language-Image Pre-training (ProLIP), the first probabilistic VLM pre-trained on a billion-scale image-text dataset using only probabilistic objectives, achieving a strong zero-shot capability (e.g., 74.6% ImageNet zero-shot accuracy with ViT-B/16). ProLIP efficiently estimates uncertainty by an "uncertainty token" without extra parameters. We also introduce a novel inclusion loss that enforces distributional inclusion relationships between image-text pairs and between original and masked inputs. Experiments demonstrate that, by leveraging uncertainty estimates, ProLIP benefits downstream tasks and aligns with intuitive notions of uncertainty, e.g., shorter texts being more uncertain and more general inputs including specific ones. Utilizing text uncertainties, we further improve ImageNet accuracy from 74.6% to 75.8% (under a few-shot setting), supporting the practical advantages of our probabilistic approach. The code is available at https://github.com/naver-ai/prolip
△ Less
Submitted 12 March, 2025; v1 submitted 24 October, 2024;
originally announced October 2024.
-
$C^2$: Scalable Auto-Feedback for LLM-based Chart Generation
Authors:
Woosung Koh,
Jang Han Yoon,
MinHyung Lee,
Youngjin Song,
Jaegwan Cho,
Jaehyun Kang,
Taehyeon Kim,
Se-Young Yun,
Youngjae Yu,
Bongshin Lee
Abstract:
Generating high-quality charts with Large Language Models (LLMs) presents significant challenges due to limited data and the high cost of scaling through human curation. $\langle \text{instruction}, \text{data}, \text{code} \rangle$ triplets are scarce and expensive to manually curate as their creation demands technical expertise. To address this scalability challenge, we introduce a reference-fre…
▽ More
Generating high-quality charts with Large Language Models (LLMs) presents significant challenges due to limited data and the high cost of scaling through human curation. $\langle \text{instruction}, \text{data}, \text{code} \rangle$ triplets are scarce and expensive to manually curate as their creation demands technical expertise. To address this scalability challenge, we introduce a reference-free automatic feedback generator, which eliminates the need for costly human intervention. Our novel framework, C$^2$, consists of (1) an automatic feedback provider (ChartAF) and (2) a diverse, reference-free dataset (ChartUIE-8K). The results are compelling: in our first experiment, 74% of respondents strongly preferred, and 10% preferred, the results after feedback. The second post-feedback experiment demonstrates that ChartAF outperform nine baselines. Moreover, ChartUIE-8K significantly improves data diversity by increasing queries, datasets, and chart types by 5982%, 1936%, and 91%, respectively, over benchmarks. Finally, a study of LLM users revealed that 94% of participants preferred ChartUIE-8K's queries, with 93% deeming them aligned with real-world use cases. Core contributions are available as open-source at chartsquared.github.io, with ample qualitative examples.
△ Less
Submitted 12 February, 2025; v1 submitted 24 October, 2024;
originally announced October 2024.
-
FlickerFusion: Intra-trajectory Domain Generalizing Multi-Agent RL
Authors:
Woosung Koh,
Wonbeen Oh,
Siyeol Kim,
Suhin Shin,
Hyeongjin Kim,
Jaein Jang,
Junghyun Lee,
Se-Young Yun
Abstract:
Multi-agent reinforcement learning has demonstrated significant potential in addressing complex cooperative tasks across various real-world applications. However, existing MARL approaches often rely on the restrictive assumption that the number of entities (e.g., agents, obstacles) remains constant between training and inference. This overlooks scenarios where entities are dynamically removed or a…
▽ More
Multi-agent reinforcement learning has demonstrated significant potential in addressing complex cooperative tasks across various real-world applications. However, existing MARL approaches often rely on the restrictive assumption that the number of entities (e.g., agents, obstacles) remains constant between training and inference. This overlooks scenarios where entities are dynamically removed or added during the inference trajectory -- a common occurrence in real-world environments like search and rescue missions and dynamic combat situations. In this paper, we tackle the challenge of intra-trajectory dynamic entity composition under zero-shot out-of-domain (OOD) generalization, where such dynamic changes cannot be anticipated beforehand. Our empirical studies reveal that existing MARL methods suffer significant performance degradation and increased uncertainty in these scenarios. In response, we propose FlickerFusion, a novel OOD generalization method that acts as a universally applicable augmentation technique for MARL backbone methods. FlickerFusion stochastically drops out parts of the observation space, emulating being in-domain when inferenced OOD. The results show that FlickerFusion not only achieves superior inference rewards but also uniquely reduces uncertainty vis-à-vis the backbone, compared to existing methods. Benchmarks, implementations, and model weights are organized and open-sourced at flickerfusion305.github.io, accompanied by ample demo video renderings.
△ Less
Submitted 3 December, 2024; v1 submitted 21 October, 2024;
originally announced October 2024.
-
EP-SAM: Weakly Supervised Histopathology Segmentation via Enhanced Prompt with Segment Anything
Authors:
Joonhyeon Song,
Seohwan Yun,
Seongho Yoon,
Joohyeok Kim,
Sangmin Lee
Abstract:
This work proposes a novel approach beyond supervised learning for effective pathological image analysis, addressing the challenge of limited robust labeled data. Pathological diagnosis of diseases like cancer has conventionally relied on the evaluation of morphological features by physicians and pathologists. However, recent advancements in compute-aided diagnosis (CAD) systems are gaining signif…
▽ More
This work proposes a novel approach beyond supervised learning for effective pathological image analysis, addressing the challenge of limited robust labeled data. Pathological diagnosis of diseases like cancer has conventionally relied on the evaluation of morphological features by physicians and pathologists. However, recent advancements in compute-aided diagnosis (CAD) systems are gaining significant attention as diagnostic support tools. Although the advancement of deep learning has improved CAD significantly, segmentation models typically require large pixel-level annotated dataset, and such labeling is expensive. Existing studies not based on supervised approaches still struggle with limited generalization, and no practical approach has emerged yet. To address this issue, we present a weakly supervised semantic segmentation (WSSS) model by combining class activation map and Segment Anything Model (SAM)-based pseudo-labeling. For effective pretraining, we adopt the SAM-a foundation model that is pretrained on large datasets and operates in zero-shot configurations using only coarse prompts. The proposed approach transfer enhanced Attention Dropout Layer's knowledge to SAM, thereby generating pseudo-labels. To demonstrate the superiority of the proposed method, experimental studies are conducted on histopathological breast cancer datasets. The proposed method outperformed other WSSS methods across three datasets, demonstrating its efficiency by achieving this with only 12GB of GPU memory during training. Our code is available at : https://github.com/QI-NemoSong/EP-SAM
△ Less
Submitted 21 October, 2024; v1 submitted 17 October, 2024;
originally announced October 2024.
-
PortLLM: Personalizing Evolving Large Language Models with Training-Free and Portable Model Patches
Authors:
Rana Muhammad Shahroz Khan,
Pingzhi Li,
Sukwon Yun,
Zhenyu Wang,
Shahriar Nirjon,
Chau-Wai Wong,
Tianlong Chen
Abstract:
As large language models (LLMs) increasingly shape the AI landscape, fine-tuning pretrained models has become more popular than in the pre-LLM era for achieving optimal performance in domain-specific tasks. However, pretrained LLMs such as ChatGPT are periodically evolved, i.e., model parameters are frequently updated), making it challenging for downstream users with limited resources to keep up w…
▽ More
As large language models (LLMs) increasingly shape the AI landscape, fine-tuning pretrained models has become more popular than in the pre-LLM era for achieving optimal performance in domain-specific tasks. However, pretrained LLMs such as ChatGPT are periodically evolved, i.e., model parameters are frequently updated), making it challenging for downstream users with limited resources to keep up with fine-tuning the newest LLMs for their domain application. Even though fine-tuning costs have nowadays been reduced thanks to the innovations of parameter-efficient fine-tuning such as LoRA, not all downstream users have adequate computing for frequent personalization. Moreover, access to fine-tuning datasets, particularly in sensitive domains such as healthcare, could be time-restrictive, making it crucial to retain the knowledge encoded in earlier fine-tuned rounds for future adaptation. In this paper, we present PortLLM, a training-free framework that (i) creates an initial lightweight model update patch to capture domain-specific knowledge, and (ii) allows a subsequent seamless plugging for the continual personalization of evolved LLM at minimal cost. Our extensive experiments cover seven representative datasets, from easier question-answering tasks {BoolQ, SST2} to harder reasoning tasks {WinoGrande, GSM8K}, and models including {Mistral-7B, Llama2, Llama3.1, and Gemma2}, validating the portability of our designed model patches and showcasing the effectiveness of our proposed framework. For instance, PortLLM achieves comparable performance to LoRA fine-tuning with reductions of up to 12.2x in GPU memory usage. Finally, we provide theoretical justifications to understand the portability of our model update patches, which offers new insights into the theoretical dimension of LLMs' personalization.
△ Less
Submitted 28 March, 2025; v1 submitted 8 October, 2024;
originally announced October 2024.
-
Automated Filtering of Human Feedback Data for Aligning Text-to-Image Diffusion Models
Authors:
Yongjin Yang,
Sihyeon Kim,
Hojung Jung,
Sangmin Bae,
SangMook Kim,
Se-Young Yun,
Kimin Lee
Abstract:
Fine-tuning text-to-image diffusion models with human feedback is an effective method for aligning model behavior with human intentions. However, this alignment process often suffers from slow convergence due to the large size and noise present in human feedback datasets. In this work, we propose FiFA, a novel automated data filtering algorithm designed to enhance the fine-tuning of diffusion mode…
▽ More
Fine-tuning text-to-image diffusion models with human feedback is an effective method for aligning model behavior with human intentions. However, this alignment process often suffers from slow convergence due to the large size and noise present in human feedback datasets. In this work, we propose FiFA, a novel automated data filtering algorithm designed to enhance the fine-tuning of diffusion models using human feedback datasets with direct preference optimization (DPO). Specifically, our approach selects data by solving an optimization problem to maximize three components: preference margin, text quality, and text diversity. The concept of preference margin is used to identify samples that are highly informative in addressing the noisy nature of feedback dataset, which is calculated using a proxy reward model. Additionally, we incorporate text quality, assessed by large language models to prevent harmful contents, and consider text diversity through a k-nearest neighbor entropy estimator to improve generalization. Finally, we integrate all these components into an optimization process, with approximating the solution by assigning importance score to each data pair and selecting the most important ones. As a result, our method efficiently filters data automatically, without the need for manual intervention, and can be applied to any large-scale dataset. Experimental results show that FiFA significantly enhances training stability and achieves better performance, being preferred by humans 17% more, while using less than 0.5% of the full data and thus 1% of the GPU hours compared to utilizing full human feedback datasets.
△ Less
Submitted 2 April, 2025; v1 submitted 14 October, 2024;
originally announced October 2024.
-
Flex-MoE: Modeling Arbitrary Modality Combination via the Flexible Mixture-of-Experts
Authors:
Sukwon Yun,
Inyoung Choi,
Jie Peng,
Yangfan Wu,
Jingxuan Bao,
Qiyiwen Zhang,
Jiayi Xin,
Qi Long,
Tianlong Chen
Abstract:
Multimodal learning has gained increasing importance across various fields, offering the ability to integrate data from diverse sources such as images, text, and personalized records, which are frequently observed in medical domains. However, in scenarios where some modalities are missing, many existing frameworks struggle to accommodate arbitrary modality combinations, often relying heavily on a…
▽ More
Multimodal learning has gained increasing importance across various fields, offering the ability to integrate data from diverse sources such as images, text, and personalized records, which are frequently observed in medical domains. However, in scenarios where some modalities are missing, many existing frameworks struggle to accommodate arbitrary modality combinations, often relying heavily on a single modality or complete data. This oversight of potential modality combinations limits their applicability in real-world situations. To address this challenge, we propose Flex-MoE (Flexible Mixture-of-Experts), a new framework designed to flexibly incorporate arbitrary modality combinations while maintaining robustness to missing data. The core idea of Flex-MoE is to first address missing modalities using a new missing modality bank that integrates observed modality combinations with the corresponding missing ones. This is followed by a uniquely designed Sparse MoE framework. Specifically, Flex-MoE first trains experts using samples with all modalities to inject generalized knowledge through the generalized router ($\mathcal{G}$-Router). The $\mathcal{S}$-Router then specializes in handling fewer modality combinations by assigning the top-1 gate to the expert corresponding to the observed modality combination. We evaluate Flex-MoE on the ADNI dataset, which encompasses four modalities in the Alzheimer's Disease domain, as well as on the MIMIC-IV dataset. The results demonstrate the effectiveness of Flex-MoE highlighting its ability to model arbitrary modality combinations in diverse missing modality scenarios. Code is available at https://github.com/UNITES-Lab/flex-moe.
△ Less
Submitted 31 October, 2024; v1 submitted 10 October, 2024;
originally announced October 2024.
-
A Unified Framework for Motion Reasoning and Generation in Human Interaction
Authors:
Jeongeun Park,
Sungjoon Choi,
Sangdoo Yun
Abstract:
Recent advancements in large language models (LLMs) have significantly improved their ability to generate natural and contextually relevant text, enabling more human-like AI interactions. However, generating and understanding interactive human-like motion, where multiple individuals engage in coordinated movements, remains challenging due to the complexity of modeling these interactions. Additiona…
▽ More
Recent advancements in large language models (LLMs) have significantly improved their ability to generate natural and contextually relevant text, enabling more human-like AI interactions. However, generating and understanding interactive human-like motion, where multiple individuals engage in coordinated movements, remains challenging due to the complexity of modeling these interactions. Additionally, a unified and versatile model is needed to handle diverse interactive scenarios, such as chat systems that dynamically adapt to user instructions and assigned roles. To address these challenges, we introduce VIM, the Versatile Interactive Motion-language model, which integrates both language and motion modalities to effectively understand, generate, and control interactive motions in multi-turn conversational contexts. Unlike previous studies that primarily focus on uni-directional tasks such as text-to-motion or motion-to-text, VIM employs a unified architecture capable of simultaneously understanding and generating both motion and text modalities. Given the absence of an appropriate dataset to support this task, we introduce Inter-MT2, a large-scale instruction-tuning dataset containing 82.7K multi-turn interactive motion instructions, covering 153K interactive motion samples. Inter-MT2 spans diverse instructional scenarios, including motion editing, question answering, and story generation, leveraging off-the-shelf large language models and motion diffusion models to construct a broad set of interactive motion instructions. We extensively evaluate the versatility of VIM across multiple interactive motion-related tasks, including motion-to-text, text-to-motion, reaction generation, motion editing, and reasoning about motion sequences.
△ Less
Submitted 12 March, 2025; v1 submitted 7 October, 2024;
originally announced October 2024.
-
DaWin: Training-free Dynamic Weight Interpolation for Robust Adaptation
Authors:
Changdae Oh,
Yixuan Li,
Kyungwoo Song,
Sangdoo Yun,
Dongyoon Han
Abstract:
Adapting a pre-trained foundation model on downstream tasks should ensure robustness against distribution shifts without the need to retrain the whole model. Although existing weight interpolation methods are simple yet effective, we argue their static nature limits downstream performance while achieving efficiency. In this work, we propose DaWin, a training-free dynamic weight interpolation metho…
▽ More
Adapting a pre-trained foundation model on downstream tasks should ensure robustness against distribution shifts without the need to retrain the whole model. Although existing weight interpolation methods are simple yet effective, we argue their static nature limits downstream performance while achieving efficiency. In this work, we propose DaWin, a training-free dynamic weight interpolation method that leverages the entropy of individual models over each unlabeled test sample to assess model expertise, and compute per-sample interpolation coefficients dynamically. Unlike previous works that typically rely on additional training to learn such coefficients, our approach requires no training. Then, we propose a mixture modeling approach that greatly reduces inference overhead raised by dynamic interpolation. We validate DaWin on the large-scale visual recognition benchmarks, spanning 14 tasks across robust fine-tuning -- ImageNet and derived five distribution shift benchmarks -- and multi-task learning with eight classification tasks. Results demonstrate that DaWin achieves significant performance gain in considered settings, with minimal computational overhead. We further discuss DaWin's analytic behavior to explain its empirical success.
△ Less
Submitted 13 March, 2025; v1 submitted 3 October, 2024;
originally announced October 2024.
-
Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems
Authors:
Guibin Zhang,
Yanwei Yue,
Zhixun Li,
Sukwon Yun,
Guancheng Wan,
Kun Wang,
Dawei Cheng,
Jeffrey Xu Yu,
Tianlong Chen
Abstract:
Recent advancements in large language model (LLM)-powered agents have shown that collective intelligence can significantly outperform individual capabilities, largely attributed to the meticulously designed inter-agent communication topologies. Though impressive in performance, existing multi-agent pipelines inherently introduce substantial token overhead, as well as increased economic costs, whic…
▽ More
Recent advancements in large language model (LLM)-powered agents have shown that collective intelligence can significantly outperform individual capabilities, largely attributed to the meticulously designed inter-agent communication topologies. Though impressive in performance, existing multi-agent pipelines inherently introduce substantial token overhead, as well as increased economic costs, which pose challenges for their large-scale deployments. In response to this challenge, we propose an economical, simple, and robust multi-agent communication framework, termed $\texttt{AgentPrune}$, which can seamlessly integrate into mainstream multi-agent systems and prunes redundant or even malicious communication messages. Technically, $\texttt{AgentPrune}$ is the first to identify and formally define the \textit{communication redundancy} issue present in current LLM-based multi-agent pipelines, and efficiently performs one-shot pruning on the spatial-temporal message-passing graph, yielding a token-economic and high-performing communication topology. Extensive experiments across six benchmarks demonstrate that $\texttt{AgentPrune}$ \textbf{(I)} achieves comparable results as state-of-the-art topologies at merely $\$5.6$ cost compared to their $\$43.7$, \textbf{(II)} integrates seamlessly into existing multi-agent frameworks with $28.1\%\sim72.8\%\downarrow$ token reduction, and \textbf{(III)} successfully defend against two types of agent-based adversarial attacks with $3.5\%\sim10.8\%\uparrow$ performance boost.
△ Less
Submitted 3 October, 2024;
originally announced October 2024.
-
CAD: Memory Efficient Convolutional Adapter for Segment Anything
Authors:
Joohyeok Kim,
Joonhyeon Song,
Seohwan Yun,
Seongho Yoon,
Sangmin Lee
Abstract:
The Foundation model for image segmentation, Segment Anything (SAM), has been actively researched in various fields since its proposal. Various researches have been proposed to adapt SAM to specific domains, with one notable approach involving the addition and training of lightweight adapter modules. While adapter-based fine-tuning approaches have reported parameter efficiency and significant perf…
▽ More
The Foundation model for image segmentation, Segment Anything (SAM), has been actively researched in various fields since its proposal. Various researches have been proposed to adapt SAM to specific domains, with one notable approach involving the addition and training of lightweight adapter modules. While adapter-based fine-tuning approaches have reported parameter efficiency and significant performance improvements, they face a often overlooked issue: the excessive consumption of GPU memory relative to the number of trainable parameters. Addressing this issue, this paper proposes a memory-efficient parallel convolutional adapter architecture. This architecture connects in parallel with SAM's image encoder, eliminating the need to store activations and gradients of the image encoder during model training. Our proposed architecture demonstrated competitive experimental results while using less than half the GPU memory compared to SAM Adapter, indicating its value as an alternative to simple decoder fine-tuning when hardware limitations preclude adapter-based learning. Our code implementation is available at our github.
△ Less
Submitted 24 September, 2024;
originally announced September 2024.
-
Safe Control of Quadruped in Varying Dynamics via Safety Index Adaptation
Authors:
Kai S. Yun,
Rui Chen,
Chase Dunaway,
John M. Dolan,
Changliu Liu
Abstract:
Varying dynamics pose a fundamental difficulty when deploying safe control laws in the real world. Safety Index Synthesis (SIS) deeply relies on the system dynamics and once the dynamics change, the previously synthesized safety index becomes invalid. In this work, we show the real-time efficacy of Safety Index Adaptation (SIA) in varying dynamics. SIA enables real-time adaptation to the changing…
▽ More
Varying dynamics pose a fundamental difficulty when deploying safe control laws in the real world. Safety Index Synthesis (SIS) deeply relies on the system dynamics and once the dynamics change, the previously synthesized safety index becomes invalid. In this work, we show the real-time efficacy of Safety Index Adaptation (SIA) in varying dynamics. SIA enables real-time adaptation to the changing dynamics so that the adapted safe control law can still guarantee 1) forward invariance within a safe region and 2) finite time convergence to that safe region. This work employs SIA on a package-carrying quadruped robot, where the payload weight changes in real-time. SIA updates the safety index when the dynamics change, e.g., a change in payload weight, so that the quadruped can avoid obstacles while achieving its performance objectives. Numerical study provides theoretical guarantees for SIA and a series of hardware experiments demonstrate the effectiveness of SIA in real-world deployment in avoiding obstacles under varying dynamics.
△ Less
Submitted 15 September, 2024;
originally announced September 2024.
-
FedHide: Federated Learning by Hiding in the Neighbors
Authors:
Hyunsin Park,
Sungrack Yun
Abstract:
We propose a prototype-based federated learning method designed for embedding networks in classification or verification tasks. Our focus is on scenarios where each client has data from a single class. The main challenge is to develop an embedding network that can distinguish between different classes while adhering to privacy constraints. Sharing true class prototypes with the server or other cli…
▽ More
We propose a prototype-based federated learning method designed for embedding networks in classification or verification tasks. Our focus is on scenarios where each client has data from a single class. The main challenge is to develop an embedding network that can distinguish between different classes while adhering to privacy constraints. Sharing true class prototypes with the server or other clients could potentially compromise sensitive information. To tackle this issue, we propose a proxy class prototype that will be shared among clients instead of the true class prototype. Our approach generates proxy class prototypes by linearly combining them with their nearest neighbors. This technique conceals the true class prototype while enabling clients to learn discriminative embedding networks. We compare our method to alternative techniques, such as adding random Gaussian noise and using random selection with cosine similarity constraints. Furthermore, we evaluate the robustness of our approach against gradient inversion attacks and introduce a measure for prototype leakage. This measure quantifies the extent of private information revealed when sharing the proposed proxy class prototype. Moreover, we provide a theoretical analysis of the convergence properties of our approach. Our proposed method for federated learning from scratch demonstrates its effectiveness through empirical results on three benchmark datasets: CIFAR-100, VoxCeleb1, and VGGFace2.
△ Less
Submitted 12 September, 2024;
originally announced September 2024.
-
Stable Language Model Pre-training by Reducing Embedding Variability
Authors:
Woojin Chung,
Jiwoo Hong,
Na Min An,
James Thorne,
Se-Young Yun
Abstract:
Stable pre-training is essential for achieving better-performing language models. However, tracking pre-training stability by calculating gradient variance at every step is impractical due to the significant computational costs. We explore Token Embedding Variability (TEV) as a simple and efficient proxy for assessing pre-training stability in language models with pre-layer normalization, given th…
▽ More
Stable pre-training is essential for achieving better-performing language models. However, tracking pre-training stability by calculating gradient variance at every step is impractical due to the significant computational costs. We explore Token Embedding Variability (TEV) as a simple and efficient proxy for assessing pre-training stability in language models with pre-layer normalization, given that shallower layers are more prone to gradient explosion (section 2.2). Moreover, we propose Multi-head Low-Rank Attention (MLRA) as an architecture to alleviate such instability by limiting the exponential growth of output embedding variance, thereby preventing the gradient explosion (section 3.2). Empirical results on GPT-2 with MLRA demonstrate increased stability and lower perplexity, particularly in deeper models.
△ Less
Submitted 12 September, 2024;
originally announced September 2024.
-
Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching
Authors:
Sungmin Yun,
Kwanhee Kyung,
Juhwan Cho,
Jaewan Choi,
Jongmin Kim,
Byeongho Kim,
Sukhan Lee,
Kyomin Sohn,
Jung Ho Ahn
Abstract:
Large language models (LLMs) have emerged due to their capability to generate high-quality content across diverse contexts. To reduce their explosively increasing demands for computing resources, a mixture of experts (MoE) has emerged. The MoE layer enables exploiting a huge number of parameters with less computation. Applying state-of-the-art continuous batching increases throughput; however, it…
▽ More
Large language models (LLMs) have emerged due to their capability to generate high-quality content across diverse contexts. To reduce their explosively increasing demands for computing resources, a mixture of experts (MoE) has emerged. The MoE layer enables exploiting a huge number of parameters with less computation. Applying state-of-the-art continuous batching increases throughput; however, it leads to frequent DRAM access in the MoE and attention layers. We observe that conventional computing devices have limitations when processing the MoE and attention layers, which dominate the total execution time and exhibit low arithmetic intensity (Op/B). Processing MoE layers only with devices targeting low-Op/B such as processing-in-memory (PIM) architectures is challenging due to the fluctuating Op/B in the MoE layer caused by continuous batching.
To address these challenges, we propose Duplex, which comprises xPU tailored for high-Op/B and Logic-PIM to effectively perform low-Op/B operation within a single device. Duplex selects the most suitable processor based on the Op/B of each layer within LLMs. As the Op/B of the MoE layer is at least 1 and that of the attention layer has a value of 4-8 for grouped query attention, prior PIM architectures are not efficient, which place processing units inside DRAM dies and only target extremely low-Op/B (under one) operations. Based on recent trends, Logic-PIM adds more through-silicon vias (TSVs) to enable high-bandwidth communication between the DRAM die and the logic die and place powerful processing units on the logic die, which is best suited for handling low-Op/B operations ranging from few to a few dozens. To maximally utilize the xPU and Logic-PIM, we propose expert and attention co-processing.
△ Less
Submitted 2 September, 2024;
originally announced September 2024.
-
Large-Scale Targeted Cause Discovery with Data-Driven Learning
Authors:
Jang-Hyun Kim,
Claudia Skok Gibbs,
Sangdoo Yun,
Hyun Oh Song,
Kyunghyun Cho
Abstract:
We propose a novel machine learning approach for inferring causal variables of a target variable from observations. Our focus is on directly inferring a set of causal factors without requiring full causal graph reconstruction, which is computationally challenging in large-scale systems. The identified causal set consists of all potential regulators of the target variable under experimental setting…
▽ More
We propose a novel machine learning approach for inferring causal variables of a target variable from observations. Our focus is on directly inferring a set of causal factors without requiring full causal graph reconstruction, which is computationally challenging in large-scale systems. The identified causal set consists of all potential regulators of the target variable under experimental settings, enabling efficient regulation when intervention costs and feasibility vary across variables. To achieve this, we train a neural network using supervised learning on simulated data to infer causality. By employing a local-inference strategy, our approach scales with linear complexity in the number of variables, efficiently scaling up to thousands of variables. Empirical results demonstrate superior performance in identifying causal relationships within large-scale gene regulatory networks, outperforming existing methods that emphasize full-graph discovery. We validate our model's generalization capability across out-of-distribution graph structures and generating mechanisms, including gene regulatory networks of E. coli and the human K562 cell line. Implementation codes are available at https://github.com/snu-mllab/Targeted-Cause-Discovery.
△ Less
Submitted 7 April, 2025; v1 submitted 28 August, 2024;
originally announced August 2024.
-
Diffusion-based Episodes Augmentation for Offline Multi-Agent Reinforcement Learning
Authors:
Jihwan Oh,
Sungnyun Kim,
Gahee Kim,
Sunghwan Kim,
Se-Young Yun
Abstract:
Offline multi-agent reinforcement learning (MARL) is increasingly recognized as crucial for effectively deploying RL algorithms in environments where real-time interaction is impractical, risky, or costly. In the offline setting, learning from a static dataset of past interactions allows for the development of robust and safe policies without the need for live data collection, which can be fraught…
▽ More
Offline multi-agent reinforcement learning (MARL) is increasingly recognized as crucial for effectively deploying RL algorithms in environments where real-time interaction is impractical, risky, or costly. In the offline setting, learning from a static dataset of past interactions allows for the development of robust and safe policies without the need for live data collection, which can be fraught with challenges. Building on this foundational importance, we present EAQ, Episodes Augmentation guided by Q-total loss, a novel approach for offline MARL framework utilizing diffusion models. EAQ integrates the Q-total function directly into the diffusion model as a guidance to maximize the global returns in an episode, eliminating the need for separate training. Our focus primarily lies on cooperative scenarios, where agents are required to act collectively towards achieving a shared goal-essentially, maximizing global returns. Consequently, we demonstrate that our episodes augmentation in a collaborative manner significantly boosts offline MARL algorithm compared to the original dataset, improving the normalized return by +17.3% and +12.9% for medium and poor behavioral policies in SMAC simulator, respectively.
△ Less
Submitted 23 August, 2024;
originally announced August 2024.
-
Inching toward the QCD Axions with Axion Magnetic Resonance in Helioscopes
Authors:
Hyeonseok Seong,
Chen Sun,
Seokhoon Yun
Abstract:
Utilizing a helical magnet profile to enhance axion-photon conversion showed great promise in laboratory searches for high axion masses. We extend the mechanism, known as the axion-magnetic resonance (AMR), from laser experiments to axion helioscopes and demonstrate its potential in covering QCD axion parameter space. Specifically, we apply AMR to the CAST experiment legacy, make projections for t…
▽ More
Utilizing a helical magnet profile to enhance axion-photon conversion showed great promise in laboratory searches for high axion masses. We extend the mechanism, known as the axion-magnetic resonance (AMR), from laser experiments to axion helioscopes and demonstrate its potential in covering QCD axion parameter space. Specifically, we apply AMR to the CAST experiment legacy, make projections for the upcoming IAXO experiment, and assess its implications for both axion-like particles and QCD axions. We observe considerable improvement in the experiment's sensitivity reach in all cases.
△ Less
Submitted 23 March, 2025; v1 submitted 20 August, 2024;
originally announced August 2024.
-
Tabular Transfer Learning via Prompting LLMs
Authors:
Jaehyun Nam,
Woomin Song,
Seong Hyeon Park,
Jihoon Tack,
Sukmin Yun,
Jaehyung Kim,
Kyu Hwan Oh,
Jinwoo Shin
Abstract:
Learning with a limited number of labeled data is a central problem in real-world applications of machine learning, as it is often expensive to obtain annotations. To deal with the scarcity of labeled data, transfer learning is a conventional approach; it suggests to learn a transferable knowledge by training a neural network from multiple other sources. In this paper, we investigate transfer lear…
▽ More
Learning with a limited number of labeled data is a central problem in real-world applications of machine learning, as it is often expensive to obtain annotations. To deal with the scarcity of labeled data, transfer learning is a conventional approach; it suggests to learn a transferable knowledge by training a neural network from multiple other sources. In this paper, we investigate transfer learning of tabular tasks, which has been less studied and successful in the literature, compared to other domains, e.g., vision and language. This is because tables are inherently heterogeneous, i.e., they contain different columns and feature spaces, making transfer learning difficult. On the other hand, recent advances in natural language processing suggest that the label scarcity issue can be mitigated by utilizing in-context learning capability of large language models (LLMs). Inspired by this and the fact that LLMs can also process tables within a unified language space, we ask whether LLMs can be effective for tabular transfer learning, in particular, under the scenarios where the source and target datasets are of different format. As a positive answer, we propose a novel tabular transfer learning framework, coined Prompt to Transfer (P2T), that utilizes unlabeled (or heterogeneous) source data with LLMs. Specifically, P2T identifies a column feature in a source dataset that is strongly correlated with a target task feature to create examples relevant to the target task, thus creating pseudo-demonstrations for prompts. Experimental results demonstrate that P2T outperforms previous methods on various tabular learning benchmarks, showing good promise for the important, yet underexplored tabular transfer learning problem. Code is available at https://github.com/jaehyun513/P2T.
△ Less
Submitted 9 August, 2024;
originally announced August 2024.
-
Implicit Grid Convolution for Multi-Scale Image Super-Resolution
Authors:
Dongheon Lee,
Seokju Yun,
Youngmin Ro
Abstract:
For Image Super-Resolution (SR), it is common to train and evaluate scale-specific models composed of an encoder and upsampler for each targeted scale. Consequently, many SR studies encounter substantial training times and complex deployment requirements. In this paper, we address this limitation by training and evaluating multiple scales simultaneously. Notably, we observe that encoder features a…
▽ More
For Image Super-Resolution (SR), it is common to train and evaluate scale-specific models composed of an encoder and upsampler for each targeted scale. Consequently, many SR studies encounter substantial training times and complex deployment requirements. In this paper, we address this limitation by training and evaluating multiple scales simultaneously. Notably, we observe that encoder features are similar across scales and that the Sub-Pixel Convolution (SPConv), widely-used scale-specific upsampler, exhibits strong inter-scale correlations in its functionality. Building on these insights, we propose a multi-scale framework that employs a single encoder in conjunction with Implicit Grid Convolution (IGConv), our novel upsampler, which unifies SPConv across all scales within a single module. Extensive experiments demonstrate that our framework achieves comparable performance to existing fixed-scale methods while reducing the training budget and stored parameters three-fold and maintaining the same latency. Additionally, we propose IGConv$^{+}$ to improve performance further by addressing spectral bias and allowing input-dependent upsampling and ensembled prediction. As a result, ATD-IGConv$^{+}$ achieves a notable 0.21dB improvement in PSNR on Urban100$\times$4, while also reducing the training budget, stored parameters, and inference cost compared to the existing ATD.
△ Less
Submitted 15 November, 2024; v1 submitted 18 August, 2024;
originally announced August 2024.
-
Ill-posedness of the Boltzmann-BGK model in the exponential class
Authors:
Donghyun Lee,
Sungbin Park,
Seok-Bae Yun
Abstract:
BGK (Bhatnagar-Gross-Krook) model is a relaxation-type model of the Boltzmann equation, which is popularly used in place of the Boltzmann equation in physics and engineering. In this paper, we address the ill-posedness problem for the BGK model, in which the solution instantly escapes the initial solution space. For this, we propose two ill-posedness scenarios, namely, the homogeneous and the inho…
▽ More
BGK (Bhatnagar-Gross-Krook) model is a relaxation-type model of the Boltzmann equation, which is popularly used in place of the Boltzmann equation in physics and engineering. In this paper, we address the ill-posedness problem for the BGK model, in which the solution instantly escapes the initial solution space. For this, we propose two ill-posedness scenarios, namely, the homogeneous and the inhomogeneous ill-posedness mechanisms. In the former case, we find a class of spatially homogeneous solutions to the BGK model, where removing the small velocity part of the initial data triggers ill-posedness by increasing temperature. For the latter, we construct a spatially inhomogeneous solution to the BGK model such that the local temperature constructed from the solution has a polynomial growth in spatial variable. These ill-posedness properties for the BGK model pose a stark contrast with the Boltzmann equation for which the solution map is, at least for a finite time, stable in the corresponding solution spaces.
△ Less
Submitted 22 August, 2024; v1 submitted 15 August, 2024;
originally announced August 2024.
-
An Offline Meta Black-box Optimization Framework for Adaptive Design of Urban Traffic Light Management Systems
Authors:
Taeyoung Yun,
Kanghoon Lee,
Sujin Yun,
Ilmyung Kim,
Won-Woo Jung,
Min-Cheol Kwon,
Kyujin Choi,
Yoohyeon Lee,
Jinkyoo Park
Abstract:
Complex urban road networks with high vehicle occupancy frequently face severe traffic congestion. Designing an effective strategy for managing multiple traffic lights plays a crucial role in managing congestion. However, most current traffic light management systems rely on human-crafted decisions, which may not adapt well to diverse traffic patterns. In this paper, we delve into two pivotal desi…
▽ More
Complex urban road networks with high vehicle occupancy frequently face severe traffic congestion. Designing an effective strategy for managing multiple traffic lights plays a crucial role in managing congestion. However, most current traffic light management systems rely on human-crafted decisions, which may not adapt well to diverse traffic patterns. In this paper, we delve into two pivotal design components of the traffic light management system that can be dynamically adjusted to various traffic conditions: phase combination and phase time allocation. While numerous studies have sought an efficient strategy for managing traffic lights, most of these approaches consider a fixed traffic pattern and are limited to relatively small road networks. To overcome these limitations, we introduce a novel and practical framework to formulate the optimization of such design components using an offline meta black-box optimization. We then present a simple yet effective method to efficiently find a solution for the aforementioned problem. In our framework, we first collect an offline meta dataset consisting of pairs of design choices and corresponding congestion measures from various traffic patterns. After collecting the dataset, we employ the Attentive Neural Process (ANP) to predict the impact of the proposed design on congestion across various traffic patterns with well-calibrated uncertainty. Finally, Bayesian optimization, with ANP as a surrogate model, is utilized to find an optimal design for unseen traffic patterns through limited online simulations. Our experiment results show that our method outperforms state-of-the-art baselines on complex road networks in terms of the number of waiting vehicles. Surprisingly, the deployment of our method into a real-world traffic system was able to improve traffic throughput by 4.80\% compared to the original strategy.
△ Less
Submitted 14 August, 2024;
originally announced August 2024.
-
VACoDe: Visual Augmented Contrastive Decoding
Authors:
Sihyeon Kim,
Boryeong Cho,
Sangmin Bae,
Sumyeong Ahn,
Se-Young Yun
Abstract:
Despite the astonishing performance of recent Large Vision-Language Models (LVLMs), these models often generate inaccurate responses. To address this issue, previous studies have focused on mitigating hallucinations by employing contrastive decoding (CD) with augmented images, which amplifies the contrast with the original image. However, these methods have limitations, including reliance on a sin…
▽ More
Despite the astonishing performance of recent Large Vision-Language Models (LVLMs), these models often generate inaccurate responses. To address this issue, previous studies have focused on mitigating hallucinations by employing contrastive decoding (CD) with augmented images, which amplifies the contrast with the original image. However, these methods have limitations, including reliance on a single augmentation, which is restrictive for certain tasks, as well as the high cost of using external knowledge. In this study, we address these limitations by exploring how to utilize multiple image augmentations. Through extensive experiments, we observed that different augmentations produce varying levels of contrast depending on the task. Based on this observation, we introduce a novel method called VACoDe, Visual Augmented Contrastive Decoding. This method adaptively selects the augmentation with the highest contrast for each task using the proposed softmax distance metric. Our empirical tests show that \alg outperforms previous methods and improves output quality in various vision-language tasks. Additionally, VACoDe can be universally applied across different model types and sizes without additional training or the use of external models and data.
△ Less
Submitted 26 July, 2024;
originally announced August 2024.
-
Direct Unlearning Optimization for Robust and Safe Text-to-Image Models
Authors:
Yong-Hyun Park,
Sangdoo Yun,
Jin-Hwa Kim,
Junho Kim,
Geonhui Jang,
Yonghyun Jeong,
Junghyo Jo,
Gayoung Lee
Abstract:
Recent advancements in text-to-image (T2I) models have unlocked a wide range of applications but also present significant risks, particularly in their potential to generate unsafe content. To mitigate this issue, researchers have developed unlearning techniques to remove the model's ability to generate potentially harmful content. However, these methods are easily bypassed by adversarial attacks,…
▽ More
Recent advancements in text-to-image (T2I) models have unlocked a wide range of applications but also present significant risks, particularly in their potential to generate unsafe content. To mitigate this issue, researchers have developed unlearning techniques to remove the model's ability to generate potentially harmful content. However, these methods are easily bypassed by adversarial attacks, making them unreliable for ensuring the safety of generated images. In this paper, we propose Direct Unlearning Optimization (DUO), a novel framework for removing Not Safe For Work (NSFW) content from T2I models while preserving their performance on unrelated topics. DUO employs a preference optimization approach using curated paired image data, ensuring that the model learns to remove unsafe visual concepts while retaining unrelated features. Furthermore, we introduce an output-preserving regularization term to maintain the model's generative capabilities on safe content. Extensive experiments demonstrate that DUO can robustly defend against various state-of-the-art red teaming methods without significant performance degradation on unrelated topics, as measured by FID and CLIP scores. Our work contributes to the development of safer and more reliable T2I models, paving the way for their responsible deployment in both closed-source and open-source scenarios.
△ Less
Submitted 16 January, 2025; v1 submitted 17 July, 2024;
originally announced July 2024.
-
Mew: Multiplexed Immunofluorescence Image Analysis through an Efficient Multiplex Network
Authors:
Sukwon Yun,
Jie Peng,
Alexandro E. Trevino,
Chanyoung Park,
Tianlong Chen
Abstract:
Recent advancements in graph-based approaches for multiplexed immunofluorescence (mIF) images have significantly propelled the field forward, offering deeper insights into patient-level phenotyping. However, current graph-based methodologies encounter two primary challenges: (1) Cellular Heterogeneity, where existing approaches fail to adequately address the inductive biases inherent in graphs, pa…
▽ More
Recent advancements in graph-based approaches for multiplexed immunofluorescence (mIF) images have significantly propelled the field forward, offering deeper insights into patient-level phenotyping. However, current graph-based methodologies encounter two primary challenges: (1) Cellular Heterogeneity, where existing approaches fail to adequately address the inductive biases inherent in graphs, particularly the homophily characteristic observed in cellular connectivity and; (2) Scalability, where handling cellular graphs from high-dimensional images faces difficulties in managing a high number of cells. To overcome these limitations, we introduce Mew, a novel framework designed to efficiently process mIF images through the lens of multiplex network. Mew innovatively constructs a multiplex network comprising two distinct layers: a Voronoi network for geometric information and a Cell-type network for capturing cell-wise homogeneity. This framework equips a scalable and efficient Graph Neural Network (GNN), capable of processing the entire graph during training. Furthermore, Mew integrates an interpretable attention module that autonomously identifies relevant layers for image classification. Extensive experiments on a real-world patient dataset from various institutions highlight Mew's remarkable efficacy and efficiency, marking a significant advancement in mIF image analysis. The source code of Mew can be found here: \url{https://github.com/UNITES-Lab/Mew}
△ Less
Submitted 25 July, 2024;
originally announced July 2024.
-
A Unified Confidence Sequence for Generalized Linear Models, with Applications to Bandits
Authors:
Junghyun Lee,
Se-Young Yun,
Kwang-Sung Jun
Abstract:
We present a unified likelihood ratio-based confidence sequence (CS) for any (self-concordant) generalized linear model (GLM) that is guaranteed to be convex and numerically tight. We show that this is on par or improves upon known CSs for various GLMs, including Gaussian, Bernoulli, and Poisson. In particular, for the first time, our CS for Bernoulli has a $\mathrm{poly}(S)$-free radius where…
▽ More
We present a unified likelihood ratio-based confidence sequence (CS) for any (self-concordant) generalized linear model (GLM) that is guaranteed to be convex and numerically tight. We show that this is on par or improves upon known CSs for various GLMs, including Gaussian, Bernoulli, and Poisson. In particular, for the first time, our CS for Bernoulli has a $\mathrm{poly}(S)$-free radius where $S$ is the norm of the unknown parameter. Our first technical novelty is its derivation, which utilizes a time-uniform PAC-Bayesian bound with a uniform prior/posterior, despite the latter being a rather unpopular choice for deriving CSs. As a direct application of our new CS, we propose a simple and natural optimistic algorithm called OFUGLB, applicable to any generalized linear bandits (GLB; Filippi et al. (2010)). Our analysis shows that the celebrated optimistic approach simultaneously attains state-of-the-art regrets for various self-concordant (not necessarily bounded) GLBs, and even $\mathrm{poly}(S)$-free for bounded GLBs, including logistic bandits. The regret analysis, our second technical novelty, follows from combining our new CS with a new proof technique that completely avoids the previously widely used self-concordant control lemma (Faury et al., 2020, Lemma 9). Numerically, OFUGLB outperforms or is at par with prior algorithms for logistic bandits.
△ Less
Submitted 15 January, 2025; v1 submitted 18 July, 2024;
originally announced July 2024.
-
Enhancing Temporal Action Localization: Advanced S6 Modeling with Recurrent Mechanism
Authors:
Sangyoun Lee,
Juho Jung,
Changdae Oh,
Sunghee Yun
Abstract:
Temporal Action Localization (TAL) is a critical task in video analysis, identifying precise start and end times of actions. Existing methods like CNNs, RNNs, GCNs, and Transformers have limitations in capturing long-range dependencies and temporal causality. To address these challenges, we propose a novel TAL architecture leveraging the Selective State Space Model (S6). Our approach integrates th…
▽ More
Temporal Action Localization (TAL) is a critical task in video analysis, identifying precise start and end times of actions. Existing methods like CNNs, RNNs, GCNs, and Transformers have limitations in capturing long-range dependencies and temporal causality. To address these challenges, we propose a novel TAL architecture leveraging the Selective State Space Model (S6). Our approach integrates the Feature Aggregated Bi-S6 block, Dual Bi-S6 structure, and a recurrent mechanism to enhance temporal and channel-wise dependency modeling without increasing parameter complexity. Extensive experiments on benchmark datasets demonstrate state-of-the-art results with mAP scores of 74.2% on THUMOS-14, 42.9% on ActivityNet, 29.6% on FineAction, and 45.8% on HACS. Ablation studies validate our method's effectiveness, showing that the Dual structure in the Stem module and the recurrent mechanism outperform traditional approaches. Our findings demonstrate the potential of S6-based models in TAL tasks, paving the way for future research.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
Feature Diversification and Adaptation for Federated Domain Generalization
Authors:
Seunghan Yang,
Seokeon Choi,
Hyunsin Park,
Sungha Choi,
Simyung Chang,
Sungrack Yun
Abstract:
Federated learning, a distributed learning paradigm, utilizes multiple clients to build a robust global model. In real-world applications, local clients often operate within their limited domains, leading to a `domain shift' across clients. Privacy concerns limit each client's learning to its own domain data, which increase the risk of overfitting. Moreover, the process of aggregating models train…
▽ More
Federated learning, a distributed learning paradigm, utilizes multiple clients to build a robust global model. In real-world applications, local clients often operate within their limited domains, leading to a `domain shift' across clients. Privacy concerns limit each client's learning to its own domain data, which increase the risk of overfitting. Moreover, the process of aggregating models trained on own limited domain can be potentially lead to a significant degradation in the global model performance. To deal with these challenges, we introduce the concept of federated feature diversification. Each client diversifies the own limited domain data by leveraging global feature statistics, i.e., the aggregated average statistics over all participating clients, shared through the global model's parameters. This data diversification helps local models to learn client-invariant representations while preserving privacy. Our resultant global model shows robust performance on unseen test domain data. To enhance performance further, we develop an instance-adaptive inference approach tailored for test domain data. Our proposed instance feature adapter dynamically adjusts feature statistics to align with the test input, thereby reducing the domain gap between the test and training domains. We show that our method achieves state-of-the-art performance on several domain generalization benchmarks within a federated learning setting.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Investigating User Perceptions of Collaborative Agenda Setting in Virtual Health Counseling Session
Authors:
Mina Fallah,
Farnaz Nouraei,
Hye Sun Yun,
Timothy Bickmore
Abstract:
Virtual health counselors offer the potential to provide users with information and counseling in complex areas such as disease management and health education. However, ensuring user engagement is challenging, particularly when the volume of information and length of counseling sessions increase. Agenda setting a clinical counseling technique where a patient and clinician collaboratively decide o…
▽ More
Virtual health counselors offer the potential to provide users with information and counseling in complex areas such as disease management and health education. However, ensuring user engagement is challenging, particularly when the volume of information and length of counseling sessions increase. Agenda setting a clinical counseling technique where a patient and clinician collaboratively decide on session topics is an effective approach to tailoring discussions for individual patient needs and sustaining engagement. We explore the effectiveness of agenda setting in a virtual counselor system designed to counsel women for breast cancer genetic testing. In a between subjects study, we assessed three versions of the system with varying levels of user control in the system's agenda setting approach. We found that participants' knowledge improved across all conditions. Although our results showed that any type of agenda setting was perceived as useful, regardless of user control, interviews revealed a preference for more collaboration and user involvement in the agenda setting process. Our study highlights the importance of using patient-centered approaches, such as tailored discussions, when using virtual counselors in healthcare.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition
Authors:
Sungnyun Kim,
Kangwook Jang,
Sangmin Bae,
Hoirin Kim,
Se-Young Yun
Abstract:
Audio-visual speech recognition (AVSR) aims to transcribe human speech using both audio and video modalities. In practical environments with noise-corrupted audio, the role of video information becomes crucial. However, prior works have primarily focused on enhancing audio features in AVSR, overlooking the importance of video features. In this study, we strengthen the video features by learning th…
▽ More
Audio-visual speech recognition (AVSR) aims to transcribe human speech using both audio and video modalities. In practical environments with noise-corrupted audio, the role of video information becomes crucial. However, prior works have primarily focused on enhancing audio features in AVSR, overlooking the importance of video features. In this study, we strengthen the video features by learning three temporal dynamics in video data: context order, playback direction, and the speed of video frames. Cross-modal attention modules are introduced to enrich video features with audio information so that speech variability can be taken into account when training on the video temporal dynamics. Based on our approach, we achieve the state-of-the-art performance on the LRS2 and LRS3 AVSR benchmarks for the noise-dominant settings. Our approach excels in scenarios especially for babble and speech noise, indicating the ability to distinguish the speech signal that should be recognized from lip movements in the video modality. We support the validity of our methodology by offering the ablation experiments for the temporal dynamics losses and the cross-modal attention architecture design.
△ Less
Submitted 14 October, 2024; v1 submitted 3 July, 2024;
originally announced July 2024.
-
ModelVerification.jl: a Comprehensive Toolbox for Formally Verifying Deep Neural Networks
Authors:
Tianhao Wei,
Luca Marzari,
Kai S. Yun,
Hanjiang Hu,
Peizhi Niu,
Xusheng Luo,
Changliu Liu
Abstract:
Deep Neural Networks (DNN) are crucial in approximating nonlinear functions across diverse applications, ranging from image classification to control. Verifying specific input-output properties can be a highly challenging task due to the lack of a single, self-contained framework that allows a complete range of verification types. To this end, we present \texttt{ModelVerification.jl (MV)}, the fir…
▽ More
Deep Neural Networks (DNN) are crucial in approximating nonlinear functions across diverse applications, ranging from image classification to control. Verifying specific input-output properties can be a highly challenging task due to the lack of a single, self-contained framework that allows a complete range of verification types. To this end, we present \texttt{ModelVerification.jl (MV)}, the first comprehensive, cutting-edge toolbox that contains a suite of state-of-the-art methods for verifying different types of DNNs and safety specifications. This versatile toolbox is designed to empower developers and machine learning practitioners with robust tools for verifying and ensuring the trustworthiness of their DNN models.
△ Less
Submitted 30 June, 2024;
originally announced July 2024.