-
Subspace-based Approximate Hessian Method for Zeroth-Order Optimization
Authors:
Dongyoon Kim,
Sungjae Lee,
Wonjin Lee,
Kwang In Kim
Abstract:
Zeroth-order optimization addresses problems where gradient information is inaccessible or impractical to compute. While most existing methods rely on first-order approximations, incorporating second-order (curvature) information can, in principle, significantly accelerate convergence. However, the high cost of function evaluations required to estimate Hessian matrices often limits practical appli…
▽ More
Zeroth-order optimization addresses problems where gradient information is inaccessible or impractical to compute. While most existing methods rely on first-order approximations, incorporating second-order (curvature) information can, in principle, significantly accelerate convergence. However, the high cost of function evaluations required to estimate Hessian matrices often limits practical applicability. We present the subspace-based approximate Hessian (ZO-SAH) method, a zeroth-order optimization algorithm that mitigates these costs by focusing on randomly selected two-dimensional subspaces. Within each subspace, ZO-SAH estimates the Hessian by fitting a quadratic polynomial to the objective function and extracting its second-order coefficients. To further reduce function-query costs, ZO-SAH employs a periodic subspace-switching strategy that reuses function evaluations across optimization steps. Experiments on eight benchmark datasets, including logistic regression and deep neural network training tasks, demonstrate that ZO-SAH achieves significantly faster convergence than existing zeroth-order methods.
△ Less
Submitted 8 July, 2025;
originally announced July 2025.
-
Integrating Generative AI in BIM Education: Insights from Classroom Implementation
Authors:
Islem Sahraoui,
Kinam Kim,
Lu Gao,
Zia Din,
Ahmed Senouci
Abstract:
This study evaluates the implementation of a Generative AI-powered rule checking workflow within a graduate-level Building Information Modeling (BIM) course at a U.S. university. Over two semesters, 55 students participated in a classroom-based pilot exploring the use of GenAI for BIM compliance tasks, an area with limited prior research. The instructional design included lectures on prompt engine…
▽ More
This study evaluates the implementation of a Generative AI-powered rule checking workflow within a graduate-level Building Information Modeling (BIM) course at a U.S. university. Over two semesters, 55 students participated in a classroom-based pilot exploring the use of GenAI for BIM compliance tasks, an area with limited prior research. The instructional design included lectures on prompt engineering and AI-driven rule checking, followed by an assignment where students used a large language model (LLM) to identify code violations in designs using Autodesk Revit. Surveys and interviews were conducted to assess student workload, learning effectiveness, and overall experience, using the NASA-TLX scale and regression analysis. Findings indicate students generally achieved learning objectives but faced challenges such as difficulties debugging AI-generated code and inconsistent tool performance, probably due to their limited prompt engineering experience. These issues increased cognitive and emotional strain, especially among students with minimal programming backgrounds. Despite these challenges, students expressed strong interest in future GenAI applications, particularly with clear instructional support.
△ Less
Submitted 5 July, 2025;
originally announced July 2025.
-
Recovering Plasticity of Neural Networks via Soft Weight Rescaling
Authors:
Seungwon Oh,
Sangyeon Park,
Isaac Han,
Kyung-Joong Kim
Abstract:
Recent studies have shown that as training progresses, neural networks gradually lose their capacity to learn new information, a phenomenon known as plasticity loss. An unbounded weight growth is one of the main causes of plasticity loss. Furthermore, it harms generalization capability and disrupts optimization dynamics. Re-initializing the network can be a solution, but it results in the loss of…
▽ More
Recent studies have shown that as training progresses, neural networks gradually lose their capacity to learn new information, a phenomenon known as plasticity loss. An unbounded weight growth is one of the main causes of plasticity loss. Furthermore, it harms generalization capability and disrupts optimization dynamics. Re-initializing the network can be a solution, but it results in the loss of learned information, leading to performance drops. In this paper, we propose Soft Weight Rescaling (SWR), a novel approach that prevents unbounded weight growth without losing information. SWR recovers the plasticity of the network by simply scaling down the weight at each step of the learning process. We theoretically prove that SWR bounds weight magnitude and balances weight magnitude between layers. Our experiment shows that SWR improves performance on warm-start learning, continual learning, and single-task learning setups on standard image classification benchmarks.
△ Less
Submitted 7 July, 2025;
originally announced July 2025.
-
Large Language Models' Varying Accuracy in Recognizing Risk-Promoting and Health-Supporting Sentiments in Public Health Discourse: The Cases of HPV Vaccination and Heated Tobacco Products
Authors:
Soojong Kim,
Kwanho Kim,
Hye Min Kim
Abstract:
Machine learning methods are increasingly applied to analyze health-related public discourse based on large-scale data, but questions remain regarding their ability to accurately detect different types of health sentiments. Especially, Large Language Models (LLMs) have gained attention as a powerful technology, yet their accuracy and feasibility in capturing different opinions and perspectives on…
▽ More
Machine learning methods are increasingly applied to analyze health-related public discourse based on large-scale data, but questions remain regarding their ability to accurately detect different types of health sentiments. Especially, Large Language Models (LLMs) have gained attention as a powerful technology, yet their accuracy and feasibility in capturing different opinions and perspectives on health issues are largely unexplored. Thus, this research examines how accurate the three prominent LLMs (GPT, Gemini, and LLAMA) are in detecting risk-promoting versus health-supporting sentiments across two critical public health topics: Human Papillomavirus (HPV) vaccination and heated tobacco products (HTPs). Drawing on data from Facebook and Twitter, we curated multiple sets of messages supporting or opposing recommended health behaviors, supplemented with human annotations as the gold standard for sentiment classification. The findings indicate that all three LLMs generally demonstrate substantial accuracy in classifying risk-promoting and health-supporting sentiments, although notable discrepancies emerge by platform, health issue, and model type. Specifically, models often show higher accuracy for risk-promoting sentiment on Facebook, whereas health-supporting messages on Twitter are more accurately detected. An additional analysis also shows the challenges LLMs face in reliably detecting neutral messages. These results highlight the importance of carefully selecting and validating language models for public health analyses, particularly given potential biases in training data that may lead LLMs to overestimate or underestimate the prevalence of certain perspectives.
△ Less
Submitted 6 July, 2025;
originally announced July 2025.
-
Nunchi-Bench: Benchmarking Language Models on Cultural Reasoning with a Focus on Korean Superstition
Authors:
Kyuhee Kim,
Sangah Lee
Abstract:
As large language models (LLMs) become key advisors in various domains, their cultural sensitivity and reasoning skills are crucial in multicultural environments. We introduce Nunchi-Bench, a benchmark designed to evaluate LLMs' cultural understanding, with a focus on Korean superstitions. The benchmark consists of 247 questions spanning 31 topics, assessing factual knowledge, culturally appropria…
▽ More
As large language models (LLMs) become key advisors in various domains, their cultural sensitivity and reasoning skills are crucial in multicultural environments. We introduce Nunchi-Bench, a benchmark designed to evaluate LLMs' cultural understanding, with a focus on Korean superstitions. The benchmark consists of 247 questions spanning 31 topics, assessing factual knowledge, culturally appropriate advice, and situational interpretation. We evaluate multilingual LLMs in both Korean and English to analyze their ability to reason about Korean cultural contexts and how language variations affect performance. To systematically assess cultural reasoning, we propose a novel evaluation strategy with customized scoring metrics that capture the extent to which models recognize cultural nuances and respond appropriately. Our findings highlight significant challenges in LLMs' cultural reasoning. While models generally recognize factual information, they struggle to apply it in practical scenarios. Furthermore, explicit cultural framing enhances performance more effectively than relying solely on the language of the prompt. To support further research, we publicly release Nunchi-Bench alongside a leaderboard.
△ Less
Submitted 5 July, 2025;
originally announced July 2025.
-
Making Sense of Korean Sentences: A Comprehensive Evaluation of LLMs through KoSEnd Dataset
Authors:
Seunguk Yu,
Kyeonghyun Kim,
Jungmin Yun,
Youngbin Kim
Abstract:
Although LLMs have made significant progress in various languages, there are still concerns about their effectiveness with low-resource agglutinative languages compared to languages such as English. In this study, we focused on Korean, a language known for its complex sentence endings, and evaluated LLMs on this challenging aspect. We introduce the Korean Sentence Endings (KoSEnd) dataset, which i…
▽ More
Although LLMs have made significant progress in various languages, there are still concerns about their effectiveness with low-resource agglutinative languages compared to languages such as English. In this study, we focused on Korean, a language known for its complex sentence endings, and evaluated LLMs on this challenging aspect. We introduce the Korean Sentence Endings (KoSEnd) dataset, which includes 3,000 sentences, each annotated for the naturalness of 15 sentence ending forms. These were collected from diverse sources to cover a range of contexts. We evaluated 11 LLMs to assess their understanding of Korean sentence endings, analyzing them based on parameter count and prediction consistency. Notably, we found that informing models about the possibility of missing sentence endings improved performance, highlighting the impact of explicitly considering certain linguistic features.
△ Less
Submitted 4 July, 2025;
originally announced July 2025.
-
APT: Adaptive Personalized Training for Diffusion Models with Limited Data
Authors:
JungWoo Chae,
Jiyoon Kim,
JaeWoong Choi,
Kyungyul Kim,
Sangheum Hwang
Abstract:
Personalizing diffusion models using limited data presents significant challenges, including overfitting, loss of prior knowledge, and degradation of text alignment. Overfitting leads to shifts in the noise prediction distribution, disrupting the denoising trajectory and causing the model to lose semantic coherence. In this paper, we propose Adaptive Personalized Training (APT), a novel framework…
▽ More
Personalizing diffusion models using limited data presents significant challenges, including overfitting, loss of prior knowledge, and degradation of text alignment. Overfitting leads to shifts in the noise prediction distribution, disrupting the denoising trajectory and causing the model to lose semantic coherence. In this paper, we propose Adaptive Personalized Training (APT), a novel framework that mitigates overfitting by employing adaptive training strategies and regularizing the model's internal representations during fine-tuning. APT consists of three key components: (1) Adaptive Training Adjustment, which introduces an overfitting indicator to detect the degree of overfitting at each time step bin and applies adaptive data augmentation and adaptive loss weighting based on this indicator; (2)Representation Stabilization, which regularizes the mean and variance of intermediate feature maps to prevent excessive shifts in noise prediction; and (3) Attention Alignment for Prior Knowledge Preservation, which aligns the cross-attention maps of the fine-tuned model with those of the pretrained model to maintain prior knowledge and semantic coherence. Through extensive experiments, we demonstrate that APT effectively mitigates overfitting, preserves prior knowledge, and outperforms existing methods in generating high-quality, diverse images with limited reference data.
△ Less
Submitted 3 July, 2025;
originally announced July 2025.
-
A Miniature High-Resolution Tension Sensor Based on a Photo-Reflector for Robotic Hands and Grippers
Authors:
Hyun-Bin Kim,
Kyung-Soo Kim
Abstract:
This paper presents a miniature tension sensor using a photo-reflector, designed for compact tendon-driven grippers and robotic hands. The proposed sensor has a small form factor of 13~mm x 7~mm x 6.5~mm and is capable of measuring tensile forces up to 200~N. A symmetric elastomer structure incorporating fillets and flexure hinges is designed based on Timoshenko beam theory and verified via FEM an…
▽ More
This paper presents a miniature tension sensor using a photo-reflector, designed for compact tendon-driven grippers and robotic hands. The proposed sensor has a small form factor of 13~mm x 7~mm x 6.5~mm and is capable of measuring tensile forces up to 200~N. A symmetric elastomer structure incorporating fillets and flexure hinges is designed based on Timoshenko beam theory and verified via FEM analysis, enabling improved sensitivity and mechanical durability while minimizing torsional deformation. The sensor utilizes a compact photo-reflector (VCNT2020) to measure displacement in the near-field region, eliminating the need for light-absorbing materials or geometric modifications required in photo-interrupter-based designs. A 16-bit analog-to-digital converter (ADC) and CAN-FD (Flexible Data-rate) communication enable efficient signal acquisition with up to 5~kHz sampling rate. Calibration experiments demonstrate a resolution of 9.9~mN (corresponding to over 14-bit accuracy) and a root mean square error (RMSE) of 0.455~N. Force control experiments using a twisted string actuator and PI control yield RMSEs as low as 0.073~N. Compared to previous research using photo-interrupter, the proposed method achieves more than tenfold improvement in resolution while also reducing nonlinearity and hysteresis. The design is mechanically simple, lightweight, easy to assemble, and suitable for integration into robotic and prosthetic systems requiring high-resolution force feedback.
△ Less
Submitted 1 July, 2025;
originally announced July 2025.
-
DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues
Authors:
Kyochul Jang,
Donghyeon Lee,
Kyusik Kim,
Dongseok Heo,
Taewhoo Lee,
Woojeong Kim,
Bongwon Suh
Abstract:
Existing function-calling benchmarks focus on single-turn interactions. However, they overlook the complexity of real-world scenarios. To quantify how existing benchmarks address practical applications, we introduce DICE-SCORE, a metric that evaluates the dispersion of tool-related information such as function name and parameter values throughout the dialogue. Analyzing existing benchmarks through…
▽ More
Existing function-calling benchmarks focus on single-turn interactions. However, they overlook the complexity of real-world scenarios. To quantify how existing benchmarks address practical applications, we introduce DICE-SCORE, a metric that evaluates the dispersion of tool-related information such as function name and parameter values throughout the dialogue. Analyzing existing benchmarks through DICE-SCORE reveals notably low scores, highlighting the need for more realistic scenarios. To address this gap, we present DICE-BENCH, a framework that constructs practical function-calling datasets by synthesizing conversations through a tool graph that maintains dependencies across rounds and a multi-agent system with distinct personas to enhance dialogue naturalness. The final dataset comprises 1,607 high-DICE-SCORE instances. Our experiments on 19 LLMs with DICE-BENCH show that significant advances are still required before such models can be deployed effectively in real-world settings. Our code and data are all publicly available: https://snuhcc.github.io/DICE-Bench/.
△ Less
Submitted 2 July, 2025; v1 submitted 28 June, 2025;
originally announced June 2025.
-
Performance improvement of spatial semantic segmentation with enriched audio features and agent-based error correction for DCASE 2025 Challenge Task 4
Authors:
Jongyeon Park,
Joonhee Lee,
Do-Hyeon Lim,
Hong Kook Kim,
Hyeongcheol Geum,
Jeong Eun Lim
Abstract:
This technical report presents submission systems for Task 4 of the DCASE 2025 Challenge. This model incorporates additional audio features (spectral roll-off and chroma features) into the embedding feature extracted from the mel-spectral feature to im-prove the classification capabilities of an audio-tagging model in the spatial semantic segmentation of sound scenes (S5) system. This approach is…
▽ More
This technical report presents submission systems for Task 4 of the DCASE 2025 Challenge. This model incorporates additional audio features (spectral roll-off and chroma features) into the embedding feature extracted from the mel-spectral feature to im-prove the classification capabilities of an audio-tagging model in the spatial semantic segmentation of sound scenes (S5) system. This approach is motivated by the fact that mixed audio often contains subtle cues that are difficult to capture with mel-spectrograms alone. Thus, these additional features offer alterna-tive perspectives for the model. Second, an agent-based label correction system is applied to the outputs processed by the S5 system. This system reduces false positives, improving the final class-aware signal-to-distortion ratio improvement (CA-SDRi) metric. Finally, we refine the training dataset to enhance the classi-fication accuracy of low-performing classes by removing irrele-vant samples and incorporating external data. That is, audio mix-tures are generated from a limited number of data points; thus, even a small number of out-of-class data points could degrade model performance. The experiments demonstrate that the submit-ted systems employing these approaches relatively improve CA-SDRi by up to 14.7% compared to the baseline of DCASE 2025 Challenge Task 4.
△ Less
Submitted 26 June, 2025;
originally announced June 2025.
-
Large Language Model-Driven Code Compliance Checking in Building Information Modeling
Authors:
Soumya Madireddy,
Lu Gao,
Zia Din,
Kinam Kim,
Ahmed Senouci,
Zhe Han,
Yunpeng Zhang
Abstract:
This research addresses the time-consuming and error-prone nature of manual code compliance checking in Building Information Modeling (BIM) by introducing a Large Language Model (LLM)-driven approach to semi-automate this critical process. The developed system integrates LLMs such as GPT, Claude, Gemini, and Llama, with Revit software to interpret building codes, generate Python scripts, and perfo…
▽ More
This research addresses the time-consuming and error-prone nature of manual code compliance checking in Building Information Modeling (BIM) by introducing a Large Language Model (LLM)-driven approach to semi-automate this critical process. The developed system integrates LLMs such as GPT, Claude, Gemini, and Llama, with Revit software to interpret building codes, generate Python scripts, and perform semi-automated compliance checks within the BIM environment. Case studies on a single-family residential project and an office building project demonstrated the system's ability to reduce the time and effort required for compliance checks while improving accuracy. It streamlined the identification of violations, such as non-compliant room dimensions, material usage, and object placements, by automatically assessing relationships and generating actionable reports. Compared to manual methods, the system eliminated repetitive tasks, simplified complex regulations, and ensured reliable adherence to standards. By offering a comprehensive, adaptable, and cost-effective solution, this proposed approach offers a promising advancement in BIM-based compliance checking, with potential applications across diverse regulatory documents in construction projects.
△ Less
Submitted 25 June, 2025;
originally announced June 2025.
-
Fair Bayesian Model-Based Clustering
Authors:
Jihu Lee,
Kunwoong Kim,
Yongdai Kim
Abstract:
Fair clustering has become a socially significant task with the advancement of machine learning technologies and the growing demand for trustworthy AI. Group fairness ensures that the proportions of each sensitive group are similar in all clusters. Most existing group-fair clustering methods are based on the $K$-means clustering and thus require the distance between instances and the number of clu…
▽ More
Fair clustering has become a socially significant task with the advancement of machine learning technologies and the growing demand for trustworthy AI. Group fairness ensures that the proportions of each sensitive group are similar in all clusters. Most existing group-fair clustering methods are based on the $K$-means clustering and thus require the distance between instances and the number of clusters to be given in advance. To resolve this limitation, we propose a fair Bayesian model-based clustering called Fair Bayesian Clustering (FBC). We develop a specially designed prior which puts its mass only on fair clusters, and implement an efficient MCMC algorithm. Advantages of FBC are that it can infer the number of clusters and can be applied to any data type as long as the likelihood is defined (e.g., categorical data). Experiments on real-world datasets show that FBC (i) reasonably infers the number of clusters, (ii) achieves a competitive utility-fairness trade-off compared to existing fair clustering methods, and (iii) performs well on categorical data.
△ Less
Submitted 15 June, 2025;
originally announced June 2025.
-
Collaborative Prediction: To Join or To Disjoin Datasets
Authors:
Kyung Rok Kim,
Yansong Wang,
Xiaocheng Li,
Guanting Chen
Abstract:
With the recent rise of generative Artificial Intelligence (AI), the need of selecting high-quality dataset to improve machine learning models has garnered increasing attention. However, some part of this topic remains underexplored, even for simple prediction models. In this work, we study the problem of developing practical algorithms that select appropriate dataset to minimize population loss o…
▽ More
With the recent rise of generative Artificial Intelligence (AI), the need of selecting high-quality dataset to improve machine learning models has garnered increasing attention. However, some part of this topic remains underexplored, even for simple prediction models. In this work, we study the problem of developing practical algorithms that select appropriate dataset to minimize population loss of our prediction model with high probability. Broadly speaking, we investigate when datasets from different sources can be effectively merged to enhance the predictive model's performance, and propose a practical algorithm with theoretical guarantees. By leveraging an oracle inequality and data-driven estimators, the algorithm reduces population loss with high probability. Numerical experiments demonstrate its effectiveness in both standard linear regression and broader machine learning applications. Code is available at https://github.com/kkrokii/collaborative_prediction.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Cross-Frame Representation Alignment for Fine-Tuning Video Diffusion Models
Authors:
Sungwon Hwang,
Hyojin Jang,
Kinam Kim,
Minho Park,
Jaegul Choo
Abstract:
Fine-tuning Video Diffusion Models (VDMs) at the user level to generate videos that reflect specific attributes of training data presents notable challenges, yet remains underexplored despite its practical importance. Meanwhile, recent work such as Representation Alignment (REPA) has shown promise in improving the convergence and quality of DiT-based image diffusion models by aligning, or assimila…
▽ More
Fine-tuning Video Diffusion Models (VDMs) at the user level to generate videos that reflect specific attributes of training data presents notable challenges, yet remains underexplored despite its practical importance. Meanwhile, recent work such as Representation Alignment (REPA) has shown promise in improving the convergence and quality of DiT-based image diffusion models by aligning, or assimilating, its internal hidden states with external pretrained visual features, suggesting its potential for VDM fine-tuning. In this work, we first propose a straightforward adaptation of REPA for VDMs and empirically show that, while effective for convergence, it is suboptimal in preserving semantic consistency across frames. To address this limitation, we introduce Cross-frame Representation Alignment (CREPA), a novel regularization technique that aligns hidden states of a frame with external features from neighboring frames. Empirical evaluations on large-scale VDMs, including CogVideoX-5B and Hunyuan Video, demonstrate that CREPA improves both visual fidelity and cross-frame semantic coherence when fine-tuned with parameter-efficient methods such as LoRA. We further validate CREPA across diverse datasets with varying attributes, confirming its broad applicability.
△ Less
Submitted 25 June, 2025; v1 submitted 10 June, 2025;
originally announced June 2025.
-
Plug-in and Fine-tuning: Bridging the Gap between Small Language Models and Large Language Models
Authors:
Kyeonghyun Kim,
Jinhee Jang,
Juhwan Choi,
Yoonji Lee,
Kyohoon Jin,
YoungBin Kim
Abstract:
Large language models (LLMs) are renowned for their extensive linguistic knowledge and strong generalization capabilities, but their high computational demands make them unsuitable for resource-constrained environments. In contrast, small language models (SLMs) are computationally efficient but often lack the broad generalization capacity of LLMs. To bridge this gap, we propose PiFi, a novel frame…
▽ More
Large language models (LLMs) are renowned for their extensive linguistic knowledge and strong generalization capabilities, but their high computational demands make them unsuitable for resource-constrained environments. In contrast, small language models (SLMs) are computationally efficient but often lack the broad generalization capacity of LLMs. To bridge this gap, we propose PiFi, a novel framework that combines the strengths of both LLMs and SLMs to achieve high performance while maintaining efficiency. PiFi integrates a single frozen layer from an LLM into a SLM and fine-tunes the combined model for specific tasks, boosting performance without a significant increase in computational cost. We show that PiFi delivers consistent performance improvements across a range of natural language processing tasks, including both natural language understanding and generation. Moreover, our findings demonstrate PiFi's ability to effectively leverage LLM knowledge, enhancing generalization to unseen domains and facilitating the transfer of linguistic abilities.
△ Less
Submitted 9 June, 2025;
originally announced June 2025.
-
Population-Proportional Preference Learning from Human Feedback: An Axiomatic Approach
Authors:
Kihyun Kim,
Jiawei Zhang,
Asuman Ozdaglar,
Pablo A. Parrilo
Abstract:
Conventional preference learning methods often prioritize opinions held more widely when aggregating preferences from multiple evaluators. This may result in policies that are biased in favor of some types of opinions or groups. The objective of this paper is to develop a novel preference learning framework capable of aligning aggregate opinions and policies proportionally with the true population…
▽ More
Conventional preference learning methods often prioritize opinions held more widely when aggregating preferences from multiple evaluators. This may result in policies that are biased in favor of some types of opinions or groups. The objective of this paper is to develop a novel preference learning framework capable of aligning aggregate opinions and policies proportionally with the true population distribution of evaluator preferences. Our approach infers the feasible set of evaluator population distributions directly from pairwise comparison data. Using these estimates, the algorithm constructs a policy that satisfies foundational axioms from social choice theory, namely monotonicity and Pareto efficiency, as well as our newly-introduced axioms of population-proportional representation and population-bounded robustness. We propose a soft-max relaxation method that smoothly trade-offs population-proportional representation with the selection of the Condorcet winner (which beats all other options in pairwise comparisons). Finally, we validate the effectiveness and scalability of our approach through experiments on both tabular recommendation tasks and large-scale language model alignment.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
Dream to Generalize: Zero-Shot Model-Based Reinforcement Learning for Unseen Visual Distractions
Authors:
Jeongsoo Ha,
Kyungsoo Kim,
Yusung Kim
Abstract:
Model-based reinforcement learning (MBRL) has been used to efficiently solve vision-based control tasks in highdimensional image observations. Although recent MBRL algorithms perform well in trained observations, they fail when faced with visual distractions in observations. These task-irrelevant distractions (e.g., clouds, shadows, and light) may be constantly present in real-world scenarios. In…
▽ More
Model-based reinforcement learning (MBRL) has been used to efficiently solve vision-based control tasks in highdimensional image observations. Although recent MBRL algorithms perform well in trained observations, they fail when faced with visual distractions in observations. These task-irrelevant distractions (e.g., clouds, shadows, and light) may be constantly present in real-world scenarios. In this study, we propose a novel self-supervised method, Dream to Generalize (Dr. G), for zero-shot MBRL. Dr. G trains its encoder and world model with dual contrastive learning which efficiently captures task-relevant features among multi-view data augmentations. We also introduce a recurrent state inverse dynamics model that helps the world model to better understand the temporal structure. The proposed methods can enhance the robustness of the world model against visual distractions. To evaluate the generalization performance, we first train Dr. G on simple backgrounds and then test it on complex natural video backgrounds in the DeepMind Control suite, and the randomizing environments in Robosuite. Dr. G yields a performance improvement of 117% and 14% over prior works, respectively. Our code is open-sourced and available at https://github.com/JeongsooHa/DrG.git
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
Self-Predictive Dynamics for Generalization of Vision-based Reinforcement Learning
Authors:
Kyungsoo Kim,
Jeongsoo Ha,
Yusung Kim
Abstract:
Vision-based reinforcement learning requires efficient and robust representations of image-based observations, especially when the images contain distracting (task-irrelevant) elements such as shadows, clouds, and light. It becomes more important if those distractions are not exposed during training. We design a Self-Predictive Dynamics (SPD) method to extract task-relevant features efficiently, e…
▽ More
Vision-based reinforcement learning requires efficient and robust representations of image-based observations, especially when the images contain distracting (task-irrelevant) elements such as shadows, clouds, and light. It becomes more important if those distractions are not exposed during training. We design a Self-Predictive Dynamics (SPD) method to extract task-relevant features efficiently, even in unseen observations after training. SPD uses weak and strong augmentations in parallel, and learns representations by predicting inverse and forward transitions across the two-way augmented versions. In a set of MuJoCo visual control tasks and an autonomous driving task (CARLA), SPD outperforms previous studies in complex observations, and significantly improves the generalization performance for unseen observations. Our code is available at https://github.com/unigary/SPD.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
Automatically Detecting Amusing Games in Wordle
Authors:
Ronaldo Luo,
Gary Liang,
Cindy Liu,
Adam Kabbara,
Minahil Bakhtawar,
Kina Kim,
Michael Guerzhoy
Abstract:
We explore automatically predicting which Wordle games Reddit users find amusing.
We scrape approximately 80k reactions by Reddit users to Wordle games from Reddit, classify the reactions as expressing amusement or not using OpenAI's GPT-3.5 using few-shot prompting, and verify that GPT-3.5's labels roughly correspond to human labels.
We then extract features from Wordle games that can predict…
▽ More
We explore automatically predicting which Wordle games Reddit users find amusing.
We scrape approximately 80k reactions by Reddit users to Wordle games from Reddit, classify the reactions as expressing amusement or not using OpenAI's GPT-3.5 using few-shot prompting, and verify that GPT-3.5's labels roughly correspond to human labels.
We then extract features from Wordle games that can predict user amusement. We demonstrate that the features indeed provide a (weak) signal that predicts user amusement as predicted by GPT-3.5.
Our results indicate that user amusement at Wordle games can be predicted computationally to some extent. We explore which features of the game contribute to user amusement.
We find that user amusement is predictable, indicating a measurable aspect of creativity infused into Wordle games through humor.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model
Authors:
Youngwan Lee,
Kangsan Kim,
Kwanyong Park,
Ilcahe Jung,
Soojin Jang,
Seanie Lee,
Yong-Ju Lee,
Sung Ju Hwang
Abstract:
Despite emerging efforts to enhance the safety of Vision-Language Models (VLMs), current approaches face two main shortcomings. 1) Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content, often overlooking contextually unsafe outcomes from seemingly benign pairs. This narrow coverage leaves VLMs vulnerable to jailbreak attacks in…
▽ More
Despite emerging efforts to enhance the safety of Vision-Language Models (VLMs), current approaches face two main shortcomings. 1) Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content, often overlooking contextually unsafe outcomes from seemingly benign pairs. This narrow coverage leaves VLMs vulnerable to jailbreak attacks in unseen configurations. 2) Prior methods rely primarily on data-centric tuning, with limited architectural innovations to intrinsically strengthen safety. We address these gaps by introducing a holistic safety dataset and benchmark, HoliSafe, that spans all five safe/unsafe image-text combinations, providing a more robust basis for both training and evaluation. We further propose SafeLLaVA, a novel VLM augmented with a learnable safety meta token and a dedicated safety head. The meta token encodes harmful visual cues during training, intrinsically guiding the language model toward safer responses, while the safety head offers interpretable harmfulness classification aligned with refusal rationales. Experiments show that SafeLLaVA, trained on HoliSafe, achieves state-of-the-art safety performance across multiple VLM benchmarks. Additionally, the HoliSafe benchmark itself reveals critical vulnerabilities in existing models. We hope that HoliSafe and SafeLLaVA will spur further research into robust and interpretable VLM safety, expanding future avenues for multimodal alignment.
△ Less
Submitted 11 June, 2025; v1 submitted 5 June, 2025;
originally announced June 2025.
-
Understanding the Impact of Sampling Quality in Direct Preference Optimization
Authors:
Kyung Rok Kim,
Yumo Bai,
Chonghuan Wang,
Guanting Chen
Abstract:
We study the role of the sampling distribution in Direct Preference Optimization (DPO) and aim to understand its impact on DPO's training dynamics. Our analyses show that both the solution space and the convergence behavior of DPO depend on the support and quality of the generating distribution. We first analyze how distribution of responses influences policy updates during gradient descent, drawi…
▽ More
We study the role of the sampling distribution in Direct Preference Optimization (DPO) and aim to understand its impact on DPO's training dynamics. Our analyses show that both the solution space and the convergence behavior of DPO depend on the support and quality of the generating distribution. We first analyze how distribution of responses influences policy updates during gradient descent, drawing connections to common phenomena found in practice. We then design a simplified yet well-structured alignment model as a proxy, and develop quantitative results showing how more frequent high-quality responses amplify the gradient signal and improve the optimization landscape, leading to more effective policy learning. Our theoretical findings are supported by empirical experiments and provide a principled justification for the online DPO framework in practice.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
RAW Image Reconstruction from RGB on Smartphones. NTIRE 2025 Challenge Report
Authors:
Marcos V. Conde,
Radu Timofte,
Radu Berdan,
Beril Besbinar,
Daisuke Iso,
Pengzhou Ji,
Xiong Dun,
Zeying Fan,
Chen Wu,
Zhansheng Wang,
Pengbo Zhang,
Jiazi Huang,
Qinglin Liu,
Wei Yu,
Shengping Zhang,
Xiangyang Ji,
Kyungsik Kim,
Minkyung Kim,
Hwalmin Lee,
Hekun Ma,
Huan Zheng,
Yanyan Wei,
Zhao Zhang,
Jing Fang,
Meilin Gao
, et al. (8 additional authors not shown)
Abstract:
Numerous low-level vision tasks operate in the RAW domain due to its linear properties, bit depth, and sensor designs. Despite this, RAW image datasets are scarce and more expensive to collect than the already large and public sRGB datasets. For this reason, many approaches try to generate realistic RAW images using sensor information and sRGB images. This paper covers the second challenge on RAW…
▽ More
Numerous low-level vision tasks operate in the RAW domain due to its linear properties, bit depth, and sensor designs. Despite this, RAW image datasets are scarce and more expensive to collect than the already large and public sRGB datasets. For this reason, many approaches try to generate realistic RAW images using sensor information and sRGB images. This paper covers the second challenge on RAW Reconstruction from sRGB (Reverse ISP). We aim to recover RAW sensor images from smartphones given the corresponding sRGB images without metadata and, by doing this, ``reverse" the ISP transformation. Over 150 participants joined this NTIRE 2025 challenge and submitted efficient models. The proposed methods and benchmark establish the state-of-the-art for generating realistic RAW data.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
Self-Refining Language Model Anonymizers via Adversarial Distillation
Authors:
Kyuyoung Kim,
Hyunjun Jeon,
Jinwoo Shin
Abstract:
Large language models (LLMs) are increasingly used in sensitive domains, where their ability to infer personal data from seemingly benign text poses emerging privacy risks. While recent LLM-based anonymization methods help mitigate such risks, they often rely on proprietary models (e.g., GPT-4), raising concerns about cost and the potential exposure of sensitive data to untrusted external systems.…
▽ More
Large language models (LLMs) are increasingly used in sensitive domains, where their ability to infer personal data from seemingly benign text poses emerging privacy risks. While recent LLM-based anonymization methods help mitigate such risks, they often rely on proprietary models (e.g., GPT-4), raising concerns about cost and the potential exposure of sensitive data to untrusted external systems. To address this, we introduce SElf-refining Anonymization with Language model (SEAL), a novel distillation framework for training small language models (SLMs) to perform effective anonymization without relying on external costly models at inference time. We leverage adversarial interactions between an LLM anonymizer and an inference model to collect trajectories of anonymized texts and inferred attributes, which are used to distill anonymization, adversarial inference, and utility evaluation capabilities into SLMs via supervised fine-tuning and preference learning. The resulting models learn to both anonymize text and critique their outputs, enabling iterative improvement of anonymization quality via self-refinement. Experiments on SynthPAI, a dataset of synthetic personal profiles and text comments, demonstrate that SLMs trained with SEAL achieve substantial improvements in anonymization capabilities. Notably, 8B models attain a privacy-utility trade-off comparable to that of the GPT-4 anonymizer and, with self-refinement, even surpass it in terms of privacy. These results show the effectiveness of our adversarial distillation framework in training SLMs as efficient anonymizers. To facilitate further research, we release the full dataset used in our experiments.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
RDB2G-Bench: A Comprehensive Benchmark for Automatic Graph Modeling of Relational Databases
Authors:
Dongwon Choi,
Sunwoo Kim,
Juyeon Kim,
Kyungho Kim,
Geon Lee,
Shinhwan Kang,
Myunghwan Kim,
Kijung Shin
Abstract:
Relational databases (RDBs) are composed of interconnected tables, where relationships between them are defined through foreign keys. Recent research on applying machine learning to RDBs has explored graph-based representations of RDBs, where rows of tables are modeled as nodes, and foreign key relationships are modeled as edges. RDB-to-graph modeling helps capture cross-table dependencies, ultima…
▽ More
Relational databases (RDBs) are composed of interconnected tables, where relationships between them are defined through foreign keys. Recent research on applying machine learning to RDBs has explored graph-based representations of RDBs, where rows of tables are modeled as nodes, and foreign key relationships are modeled as edges. RDB-to-graph modeling helps capture cross-table dependencies, ultimately leading to enhanced performance across diverse tasks. However, there are numerous ways to model RDBs as graphs, and performance varies significantly depending on the chosen graph model. In our analysis, applying a common heuristic rule for graph modeling leads to up to a 10% drop in performance compared to the best-performing graph model, which remains non-trivial to identify. To foster research on intelligent RDB-to-graph modeling, we introduce RDB2G-Bench, the first benchmark framework for evaluating such methods. We construct extensive datasets covering 5 real-world RDBs and 12 predictive tasks, resulting in around 50k graph-performance pairs for efficient and reproducible evaluations. Thanks to our precomputed datasets, we were able to benchmark 9 automatic RDB-to-graph modeling methods on the 12 tasks over 600x faster than on-the-fly evaluation, which requires repeated model training. Our analysis of the datasets and benchmark results reveals key structural patterns affecting graph model effectiveness, along with practical implications for effective graph modeling.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
An Empirical Study of Group Conformity in Multi-Agent Systems
Authors:
Min Choi,
Keonwoo Kim,
Sungwon Chae,
Sangyeob Baek
Abstract:
Recent advances in Large Language Models (LLMs) have enabled multi-agent systems that simulate real-world interactions with near-human reasoning. While previous studies have extensively examined biases related to protected attributes such as race, the emergence and propagation of biases on socially contentious issues in multi-agent LLM interactions remain underexplored. This study explores how LLM…
▽ More
Recent advances in Large Language Models (LLMs) have enabled multi-agent systems that simulate real-world interactions with near-human reasoning. While previous studies have extensively examined biases related to protected attributes such as race, the emergence and propagation of biases on socially contentious issues in multi-agent LLM interactions remain underexplored. This study explores how LLM agents shape public opinion through debates on five contentious topics. By simulating over 2,500 debates, we analyze how initially neutral agents, assigned a centrist disposition, adopt specific stances over time. Statistical analyses reveal significant group conformity mirroring human behavior; LLM agents tend to align with numerically dominant groups or more intelligent agents, exerting a greater influence. These findings underscore the crucial role of agent intelligence in shaping discourse and highlight the risks of bias amplification in online interactions. Our results emphasize the need for policy measures that promote diversity and transparency in LLM-generated discussions to mitigate the risks of bias propagation within anonymous online environments.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
Mamba Drafters for Speculative Decoding
Authors:
Daewon Choi,
Seunghyuk Oh,
Saket Dingliwal,
Jihoon Tack,
Kyuyoung Kim,
Woomin Song,
Seojin Kim,
Insu Han,
Jinwoo Shin,
Aram Galstyan,
Shubham Katiyar,
Sravan Babu Bodapati
Abstract:
Speculative decoding has emerged as a promising approach to accelerating large language model (LLM) generation using a fast drafter while maintaining alignment with the target model's distribution. However, existing approaches face a trade-off: external drafters offer flexibility but can suffer from slower drafting, while self-speculation methods use drafters tailored to the target model but requi…
▽ More
Speculative decoding has emerged as a promising approach to accelerating large language model (LLM) generation using a fast drafter while maintaining alignment with the target model's distribution. However, existing approaches face a trade-off: external drafters offer flexibility but can suffer from slower drafting, while self-speculation methods use drafters tailored to the target model but require re-training. In this paper, we introduce novel drafters based on Mamba, a state-of-the-art state space model (SSM), as a solution that combines the best aspects of both approaches. By leveraging the linear structure of SSMs, our approach avoids the quadratic complexity inherent in traditional Transformer-based methods, enabling faster drafting and lower memory usage while maintaining the flexibility to work across different target models. We further enhance efficiency with a novel test-time tree search algorithm for generating high-quality draft candidates. Our empirical evaluation demonstrates that Mamba-based drafters not only outperform existing external drafting methods but are also comparable to state-of-the-art self-speculation approaches while using less memory and maintaining their cross-model adaptability.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models
Authors:
Kinam Kim,
Junha Hyung,
Jaegul Choo
Abstract:
Recent advances in text-to-video diffusion models have enabled high-quality video synthesis, but controllable generation remains challenging, particularly under limited data and compute. Existing fine-tuning methods for conditional generation often rely on external encoders or architectural modifications, which demand large datasets and are typically restricted to spatially aligned conditioning, l…
▽ More
Recent advances in text-to-video diffusion models have enabled high-quality video synthesis, but controllable generation remains challenging, particularly under limited data and compute. Existing fine-tuning methods for conditional generation often rely on external encoders or architectural modifications, which demand large datasets and are typically restricted to spatially aligned conditioning, limiting flexibility and scalability. In this work, we introduce Temporal In-Context Fine-Tuning (TIC-FT), an efficient and versatile approach for adapting pretrained video diffusion models to diverse conditional generation tasks. Our key idea is to concatenate condition and target frames along the temporal axis and insert intermediate buffer frames with progressively increasing noise levels. These buffer frames enable smooth transitions, aligning the fine-tuning process with the pretrained model's temporal dynamics. TIC-FT requires no architectural changes and achieves strong performance with as few as 10-30 training samples. We validate our method across a range of tasks, including image-to-video and video-to-video generation, using large-scale base models such as CogVideoX-5B and Wan-14B. Extensive experiments show that TIC-FT outperforms existing baselines in both condition fidelity and visual quality, while remaining highly efficient in both training and inference. For additional results, visit https://kinam0252.github.io/TIC-FT/
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
FedCostAware: Enabling Cost-Aware Federated Learning on the Cloud
Authors:
Aditya Sinha,
Zilinghan Li,
Tingkai Liu,
Volodymyr Kindratenko,
Kibaek Kim,
Ravi Madduri
Abstract:
Federated learning (FL) is a distributed machine learning (ML) approach that allows multiple clients to collaboratively train ML model without exchanging their original training data, offering a solution that is particularly valuable in sensitive domains such as biomedicine. However, training robust FL models often requires substantial computing resources from participating clients, such as GPUs,…
▽ More
Federated learning (FL) is a distributed machine learning (ML) approach that allows multiple clients to collaboratively train ML model without exchanging their original training data, offering a solution that is particularly valuable in sensitive domains such as biomedicine. However, training robust FL models often requires substantial computing resources from participating clients, such as GPUs, which may not be readily available at institutions such as hospitals. While cloud platforms (e.g., AWS) offer on-demand access to such resources, their usage can incur significant costs, particularly in distributed training scenarios where poor coordination strategies can lead to substantial resource wastage. To address this, we introduce FedCostAware, a cost-aware scheduling algorithm designed to optimize synchronous FL on cloud spot instances. FedCostAware addresses the challenges of training on spot instances and different client budgets by employing intelligent management of the lifecycle of spot instances. This approach minimizes resource idle time and overall expenses. Comprehensive experiments across multiple datasets demonstrate that FedCostAware significantly reduces cloud computing costs compared to conventional spot and on-demand schemes, enhancing the accessibility and affordability of FL.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Nearly Dimension-Independent Convergence of Mean-Field Black-Box Variational Inference
Authors:
Kyurae Kim,
Yi-An Ma,
Trevor Campbell,
Jacob R. Gardner
Abstract:
We prove that, given a mean-field location-scale variational family, black-box variational inference (BBVI) with the reparametrization gradient converges at an almost dimension-independent rate. Specifically, for strongly log-concave and log-smooth targets, the number of iterations for BBVI with a sub-Gaussian family to achieve an objective $ε$-close to the global optimum is $\mathrm{O}(\log d)$,…
▽ More
We prove that, given a mean-field location-scale variational family, black-box variational inference (BBVI) with the reparametrization gradient converges at an almost dimension-independent rate. Specifically, for strongly log-concave and log-smooth targets, the number of iterations for BBVI with a sub-Gaussian family to achieve an objective $ε$-close to the global optimum is $\mathrm{O}(\log d)$, which improves over the $\mathrm{O}(d)$ dependence of full-rank location-scale families. For heavy-tailed families, we provide a weaker $\mathrm{O}(d^{2/k})$ dimension dependence, where $k$ is the number of finite moments. Additionally, if the Hessian of the target log-density is constant, the complexity is free of any explicit dimension dependence. We also prove that our bound on the gradient variance, which is key to our result, cannot be improved using only spectral bounds on the Hessian of the target log-density.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts
Authors:
Hee-Seon Kim,
Minbeom Kim,
Wonjun Lee,
Kihyun Kim,
Changick Kim
Abstract:
Optimization-based jailbreaks typically adopt the Toxic-Continuation setting in large vision-language models (LVLMs), following the standard next-token prediction objective. In this setting, an adversarial image is optimized to make the model predict the next token of a toxic prompt. However, we find that the Toxic-Continuation paradigm is effective at continuing already-toxic inputs, but struggle…
▽ More
Optimization-based jailbreaks typically adopt the Toxic-Continuation setting in large vision-language models (LVLMs), following the standard next-token prediction objective. In this setting, an adversarial image is optimized to make the model predict the next token of a toxic prompt. However, we find that the Toxic-Continuation paradigm is effective at continuing already-toxic inputs, but struggles to induce safety misalignment when explicit toxic signals are absent. We propose a new paradigm: Benign-to-Toxic (B2T) jailbreak. Unlike prior work, we optimize adversarial images to induce toxic outputs from benign conditioning. Since benign conditioning contains no safety violations, the image alone must break the model's safety mechanisms. Our method outperforms prior approaches, transfers in black-box settings, and complements text-based jailbreaks. These results reveal an underexplored vulnerability in multimodal alignment and introduce a fundamentally new direction for jailbreak approaches.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
ISAC: Training-Free Instance-to-Semantic Attention Control for Improving Multi-Instance Generation
Authors:
Sanghyun Jo,
Wooyeol Lee,
Ziseok Lee,
Kyungsu Kim
Abstract:
Text-to-image diffusion models excel at generating single-instance scenes but struggle with multi-instance scenarios, often merging or omitting objects. Unlike previous training-free approaches that rely solely on semantic-level guidance without addressing instance individuation, our training-free method, Instance-to-Semantic Attention Control (ISAC), explicitly resolves incomplete instance format…
▽ More
Text-to-image diffusion models excel at generating single-instance scenes but struggle with multi-instance scenarios, often merging or omitting objects. Unlike previous training-free approaches that rely solely on semantic-level guidance without addressing instance individuation, our training-free method, Instance-to-Semantic Attention Control (ISAC), explicitly resolves incomplete instance formation and semantic entanglement through an instance-first modeling approach. This enables ISAC to effectively leverage a hierarchical, tree-structured prompt mechanism, disentangling multiple object instances and individually aligning them with their corresponding semantic labels. Without employing any external models, ISAC achieves up to 52% average multi-class accuracy and 83% average multi-instance accuracy by effectively forming disentangled instances. The code will be made available upon publication.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Trans-EnV: A Framework for Evaluating the Linguistic Robustness of LLMs Against English Varieties
Authors:
Jiyoung Lee,
Seungho Kim,
Jieun Han,
Jun-Min Lee,
Kitaek Kim,
Alice Oh,
Edward Choi
Abstract:
Large Language Models (LLMs) are predominantly evaluated on Standard American English (SAE), often overlooking the diversity of global English varieties. This narrow focus may raise fairness concerns as degraded performance on non-standard varieties can lead to unequal benefits for users worldwide. Therefore, it is critical to extensively evaluate the linguistic robustness of LLMs on multiple non-…
▽ More
Large Language Models (LLMs) are predominantly evaluated on Standard American English (SAE), often overlooking the diversity of global English varieties. This narrow focus may raise fairness concerns as degraded performance on non-standard varieties can lead to unequal benefits for users worldwide. Therefore, it is critical to extensively evaluate the linguistic robustness of LLMs on multiple non-standard English varieties. We introduce Trans-EnV, a framework that automatically transforms SAE datasets into multiple English varieties to evaluate the linguistic robustness. Our framework combines (1) linguistics expert knowledge to curate variety-specific features and transformation guidelines from linguistic literature and corpora, and (2) LLM-based transformations to ensure both linguistic validity and scalability. Using Trans-EnV, we transform six benchmark datasets into 38 English varieties and evaluate seven state-of-the-art LLMs. Our results reveal significant performance disparities, with accuracy decreasing by up to 46.3% on non-standard varieties. These findings highlight the importance of comprehensive linguistic robustness evaluation across diverse English varieties. Each construction of Trans-EnV was validated through rigorous statistical testing and consultation with a researcher in the field of second language acquisition, ensuring its linguistic validity. Our code and datasets are publicly available at https://github.com/jiyounglee-0523/TransEnV and https://huggingface.co/collections/jiyounglee0523/transenv-681eadb3c0c8cf363b363fb1.
△ Less
Submitted 4 June, 2025; v1 submitted 27 May, 2025;
originally announced May 2025.
-
An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks
Authors:
Xin Zhou,
Kisub Kim,
Ting Zhang,
Martin Weyssow,
Luis F. Gomes,
Guang Yang,
David Lo
Abstract:
Large Language Models (LLMs) and other automated techniques have been increasingly used to support software developers by generating software artifacts such as code snippets, patches, and comments. However, accurately assessing the correctness of these generated artifacts remains a significant challenge. On one hand, human evaluation provides high accuracy but is labor-intensive and lacks scalabil…
▽ More
Large Language Models (LLMs) and other automated techniques have been increasingly used to support software developers by generating software artifacts such as code snippets, patches, and comments. However, accurately assessing the correctness of these generated artifacts remains a significant challenge. On one hand, human evaluation provides high accuracy but is labor-intensive and lacks scalability. On the other hand, other existing automatic evaluation metrics are scalable and require minimal human effort, but they often fail to accurately reflect the actual correctness of generated software artifacts.
In this paper, we present SWE-Judge, the first evaluation metric for LLM-as-Ensemble-Judge specifically designed to accurately assess the correctness of generated software artifacts. SWE-Judge first defines five distinct evaluation strategies, each implemented as an independent judge. A dynamic team selection mechanism then identifies the most appropriate subset of judges to produce a final correctness score through ensembling. We evaluate SWE-Judge across a diverse set of software engineering (SE) benchmarks, including CoNaLa, Card2Code, HumanEval-X, APPS, APR-Assess, and Summary-Assess. These benchmarks span three SE tasks: code generation, automated program repair, and code summarization. Experimental results demonstrate that SWE-Judge consistently achieves a higher correlation with human judgments, with improvements ranging from 5.9% to 183.8% over existing automatic metrics. Furthermore, SWE-Judge reaches agreement levels with human annotators that are comparable to inter-annotator agreement in code generation and program repair tasks. These findings underscore SWE-Judge's potential as a scalable and reliable alternative to human evaluation.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
RSCF: Relation-Semantics Consistent Filter for Entity Embedding of Knowledge Graph
Authors:
Junsik Kim,
Jinwook Park,
Kangil Kim
Abstract:
In knowledge graph embedding, leveraging relation specific entity transformation has markedly enhanced performance. However, the consistency of embedding differences before and after transformation remains unaddressed, risking the loss of valuable inductive bias inherent in the embeddings. This inconsistency stems from two problems. First, transformation representations are specified for relations…
▽ More
In knowledge graph embedding, leveraging relation specific entity transformation has markedly enhanced performance. However, the consistency of embedding differences before and after transformation remains unaddressed, risking the loss of valuable inductive bias inherent in the embeddings. This inconsistency stems from two problems. First, transformation representations are specified for relations in a disconnected manner, allowing dissimilar transformations and corresponding entity embeddings for similar relations. Second, a generalized plug-in approach as a SFBR (Semantic Filter Based on Relations) disrupts this consistency through excessive concentration of entity embeddings under entity-based regularization, generating indistinguishable score distributions among relations. In this paper, we introduce a plug-in KGE method, Relation-Semantics Consistent Filter (RSCF). Its entity transformation has three features for enhancing semantic consistency: 1) shared affine transformation of relation embeddings across all relations, 2) rooted entity transformation that adds an entity embedding to its change represented by the transformed vector, and 3) normalization of the change to prevent scale reduction. To amplify the advantages of consistency that preserve semantics on embeddings, RSCF adds relation transformation and prediction modules for enhancing the semantics. In knowledge graph completion tasks with distance-based and tensor decomposition models, RSCF significantly outperforms state-of-the-art KGE methods, showing robustness across all relations and their frequencies.
△ Less
Submitted 12 June, 2025; v1 submitted 27 May, 2025;
originally announced May 2025.
-
HAND Me the Data: Fast Robot Adaptation via Hand Path Retrieval
Authors:
Matthew Hong,
Anthony Liang,
Kevin Kim,
Harshitha Rajaprakash,
Jesse Thomason,
Erdem Bıyık,
Jesse Zhang
Abstract:
We hand the community HAND, a simple and time-efficient method for teaching robots new manipulation tasks through human hand demonstrations. Instead of relying on task-specific robot demonstrations collected via teleoperation, HAND uses easy-to-provide hand demonstrations to retrieve relevant behaviors from task-agnostic robot play data. Using a visual tracking pipeline, HAND extracts the motion o…
▽ More
We hand the community HAND, a simple and time-efficient method for teaching robots new manipulation tasks through human hand demonstrations. Instead of relying on task-specific robot demonstrations collected via teleoperation, HAND uses easy-to-provide hand demonstrations to retrieve relevant behaviors from task-agnostic robot play data. Using a visual tracking pipeline, HAND extracts the motion of the human hand from the hand demonstration and retrieves robot sub-trajectories in two stages: first filtering by visual similarity, then retrieving trajectories with similar behaviors to the hand. Fine-tuning a policy on the retrieved data enables real-time learning of tasks in under four minutes, without requiring calibrated cameras or detailed hand pose estimation. Experiments also show that HAND outperforms retrieval baselines by over 2x in average task success rates on real robots. Videos can be found at our project website: https://liralab.usc.edu/handretrieval/.
△ Less
Submitted 1 June, 2025; v1 submitted 26 May, 2025;
originally announced May 2025.
-
Adapting SAM 2 for Visual Object Tracking: 1st Place Solution for MMVPR Challenge Multi-Modal Tracking
Authors:
Cheng-Yen Yang,
Hsiang-Wei Huang,
Pyong-Kun Kim,
Chien-Kai Kuo,
Jui-Wei Chang,
Kwang-Ju Kim,
Chung-I Huang,
Jenq-Neng Hwang
Abstract:
We present an effective approach for adapting the Segment Anything Model 2 (SAM2) to the Visual Object Tracking (VOT) task. Our method leverages the powerful pre-trained capabilities of SAM2 and incorporates several key techniques to enhance its performance in VOT applications. By combining SAM2 with our proposed optimizations, we achieved a first place AUC score of 89.4 on the 2024 ICPR Multi-mod…
▽ More
We present an effective approach for adapting the Segment Anything Model 2 (SAM2) to the Visual Object Tracking (VOT) task. Our method leverages the powerful pre-trained capabilities of SAM2 and incorporates several key techniques to enhance its performance in VOT applications. By combining SAM2 with our proposed optimizations, we achieved a first place AUC score of 89.4 on the 2024 ICPR Multi-modal Object Tracking challenge, demonstrating the effectiveness of our approach. This paper details our methodology, the specific enhancements made to SAM2, and a comprehensive analysis of our results in the context of VOT solutions along with the multi-modality aspect of the dataset.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
PatientSim: A Persona-Driven Simulator for Realistic Doctor-Patient Interactions
Authors:
Daeun Kyung,
Hyunseung Chung,
Seongsu Bae,
Jiho Kim,
Jae Ho Sohn,
Taerim Kim,
Soo Kyung Kim,
Edward Choi
Abstract:
Doctor-patient consultations require multi-turn, context-aware communication tailored to diverse patient personas. Training or evaluating doctor LLMs in such settings requires realistic patient interaction systems. However, existing simulators often fail to reflect the full range of personas seen in clinical practice. To address this, we introduce PatientSim, a patient simulator that generates rea…
▽ More
Doctor-patient consultations require multi-turn, context-aware communication tailored to diverse patient personas. Training or evaluating doctor LLMs in such settings requires realistic patient interaction systems. However, existing simulators often fail to reflect the full range of personas seen in clinical practice. To address this, we introduce PatientSim, a patient simulator that generates realistic and diverse patient personas for clinical scenarios, grounded in medical expertise. PatientSim operates using: 1) clinical profiles, including symptoms and medical history, derived from real-world data in the MIMIC-ED and MIMIC-IV datasets, and 2) personas defined by four axes: personality, language proficiency, medical history recall level, and cognitive confusion level, resulting in 37 unique combinations. We evaluated eight LLMs for factual accuracy and persona consistency. The top-performing open-source model, Llama 3.3, was validated by four clinicians to confirm the robustness of our framework. As an open-source, customizable platform, PatientSim provides a reproducible and scalable solution that can be customized for specific training needs. Offering a privacy-compliant environment, it serves as a robust testbed for evaluating medical dialogue systems across diverse patient presentations and shows promise as an educational tool for healthcare.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
Model Already Knows the Best Noise: Bayesian Active Noise Selection via Attention in Video Diffusion Model
Authors:
Kwanyoung Kim,
Sanghyun Kim
Abstract:
The choice of initial noise significantly affects the quality and prompt alignment of video diffusion models, where different noise seeds for the same prompt can lead to drastically different generations. While recent methods rely on externally designed priors such as frequency filters or inter-frame smoothing, they often overlook internal model signals that indicate which noise seeds are inherent…
▽ More
The choice of initial noise significantly affects the quality and prompt alignment of video diffusion models, where different noise seeds for the same prompt can lead to drastically different generations. While recent methods rely on externally designed priors such as frequency filters or inter-frame smoothing, they often overlook internal model signals that indicate which noise seeds are inherently preferable. To address this, we propose ANSE (Active Noise Selection for Generation), a model-aware framework that selects high-quality noise seeds by quantifying attention-based uncertainty. At its core is BANSA (Bayesian Active Noise Selection via Attention), an acquisition function that measures entropy disagreement across multiple stochastic attention samples to estimate model confidence and consistency. For efficient inference-time deployment, we introduce a Bernoulli-masked approximation of BANSA that enables score estimation using a single diffusion step and a subset of attention layers. Experiments on CogVideoX-2B and 5B demonstrate that ANSE improves video quality and temporal coherence with only an 8% and 13% increase in inference time, respectively, providing a principled and generalizable approach to noise selection in video diffusion. See our project page: https://anse-project.github.io/anse-project/
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
Do You Keep an Eye on What I Ask? Mitigating Multimodal Hallucination via Attention-Guided Ensemble Decoding
Authors:
Yeongjae Cho,
Keonwoo Kim,
Taebaek Hwang,
Sungzoon Cho
Abstract:
Recent advancements in Large Vision-Language Models (LVLMs) have significantly expanded their utility in tasks like image captioning and visual question answering. However, they still struggle with object hallucination, where models generate descriptions that inaccurately reflect the visual content by including nonexistent objects or misrepresenting existing ones. While previous methods, such as d…
▽ More
Recent advancements in Large Vision-Language Models (LVLMs) have significantly expanded their utility in tasks like image captioning and visual question answering. However, they still struggle with object hallucination, where models generate descriptions that inaccurately reflect the visual content by including nonexistent objects or misrepresenting existing ones. While previous methods, such as data augmentation and training-free approaches, strive to tackle this issue, they still encounter scalability challenges and often depend on additional external modules. In this work, we propose Ensemble Decoding (ED), a novel strategy that splits the input image into sub-images and combines logit distributions by assigning weights through the attention map. Furthermore, we introduce ED adaptive plausibility constraint to calibrate logit distribution and FastED, a variant designed for speed-critical applications. Extensive experiments across hallucination benchmarks demonstrate that our proposed method achieves state-of-the-art performance, validating the effectiveness of our approach.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
PoseBH: Prototypical Multi-Dataset Training Beyond Human Pose Estimation
Authors:
Uyoung Jeong,
Jonathan Freer,
Seungryul Baek,
Hyung Jin Chang,
Kwang In Kim
Abstract:
We study multi-dataset training (MDT) for pose estimation, where skeletal heterogeneity presents a unique challenge that existing methods have yet to address. In traditional domains, \eg regression and classification, MDT typically relies on dataset merging or multi-head supervision. However, the diversity of skeleton types and limited cross-dataset supervision complicate integration in pose estim…
▽ More
We study multi-dataset training (MDT) for pose estimation, where skeletal heterogeneity presents a unique challenge that existing methods have yet to address. In traditional domains, \eg regression and classification, MDT typically relies on dataset merging or multi-head supervision. However, the diversity of skeleton types and limited cross-dataset supervision complicate integration in pose estimation. To address these challenges, we introduce PoseBH, a new MDT framework that tackles keypoint heterogeneity and limited supervision through two key techniques. First, we propose nonparametric keypoint prototypes that learn within a unified embedding space, enabling seamless integration across skeleton types. Second, we develop a cross-type self-supervision mechanism that aligns keypoint predictions with keypoint embedding prototypes, providing supervision without relying on teacher-student models or additional augmentations. PoseBH substantially improves generalization across whole-body and animal pose datasets, including COCO-WholeBody, AP-10K, and APT-36K, while preserving performance on standard human pose benchmarks (COCO, MPII, and AIC). Furthermore, our learned keypoint embeddings transfer effectively to hand shape estimation (InterHand2.6M) and human body shape estimation (3DPW). The code for PoseBH is available at: https://github.com/uyoung-jeong/PoseBH.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
Sobolev Gradient Ascent for Optimal Transport: Barycenter Optimization and Convergence Analysis
Authors:
Kaheon Kim,
Bohan Zhou,
Changbo Zhu,
Xiaohui Chen
Abstract:
This paper introduces a new constraint-free concave dual formulation for the Wasserstein barycenter. Tailoring the vanilla dual gradient ascent algorithm to the Sobolev geometry, we derive a scalable Sobolev gradient ascent (SGA) algorithm to compute the barycenter for input distributions supported on a regular grid. Despite the algorithmic simplicity, we provide a global convergence analysis that…
▽ More
This paper introduces a new constraint-free concave dual formulation for the Wasserstein barycenter. Tailoring the vanilla dual gradient ascent algorithm to the Sobolev geometry, we derive a scalable Sobolev gradient ascent (SGA) algorithm to compute the barycenter for input distributions supported on a regular grid. Despite the algorithmic simplicity, we provide a global convergence analysis that achieves the same rate as the classical subgradient descent methods for minimizing nonsmooth convex functions in the Euclidean space. A central feature of our SGA algorithm is that the computationally expensive $c$-concavity projection operator enforced on the Kantorovich dual potentials is unnecessary to guarantee convergence, leading to significant algorithmic and theoretical simplifications over all existing primal and dual methods for computing the exact barycenter. Our numerical experiments demonstrate the superior empirical performance of SGA over the existing optimal transport barycenter solvers.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Fair Clustering via Alignment
Authors:
Kunwoong Kim,
Jihu Lee,
Sangchul Park,
Yongdai Kim
Abstract:
Algorithmic fairness in clustering aims to balance the proportions of instances assigned to each cluster with respect to a given sensitive attribute. While recently developed fair clustering algorithms optimize clustering objectives under specific fairness constraints, their inherent complexity or approximation often results in suboptimal clustering utility or numerical instability in practice. To…
▽ More
Algorithmic fairness in clustering aims to balance the proportions of instances assigned to each cluster with respect to a given sensitive attribute. While recently developed fair clustering algorithms optimize clustering objectives under specific fairness constraints, their inherent complexity or approximation often results in suboptimal clustering utility or numerical instability in practice. To resolve these limitations, we propose a new fair clustering algorithm based on a novel decomposition of the fair $K$-means clustering objective function. The proposed algorithm, called Fair Clustering via Alignment (FCA), operates by alternately (i) finding a joint probability distribution to align the data from different protected groups, and (ii) optimizing cluster centers in the aligned space. A key advantage of FCA is that it theoretically guarantees approximately optimal clustering utility for any given fairness level without complex constraints, thereby enabling high-utility fair clustering in practice. Experiments show that FCA outperforms existing methods by (i) attaining a superior trade-off between fairness level and clustering utility, and (ii) achieving near-perfect fairness without numerical instability.
△ Less
Submitted 23 May, 2025; v1 submitted 14 May, 2025;
originally announced May 2025.
-
A Novel 6-axis Force/Torque Sensor Using Inductance Sensors
Authors:
Hyun-Bin Kim,
Kyung-Soo Kim
Abstract:
This paper presents a novel six-axis force/torque (F/T) sensor based on inductive sensing technology. Unlike conventional strain gauge-based sensors that require direct contact and external amplification, the proposed sensor utilizes non-contact inductive measurements to estimate force via displacement of a conductive target. A compact, fully integrated architecture is achieved by incorporating a…
▽ More
This paper presents a novel six-axis force/torque (F/T) sensor based on inductive sensing technology. Unlike conventional strain gauge-based sensors that require direct contact and external amplification, the proposed sensor utilizes non-contact inductive measurements to estimate force via displacement of a conductive target. A compact, fully integrated architecture is achieved by incorporating a CAN-FD based signal processing module directly onto the PCB, enabling high-speed data acquisition at up to 4~kHz without external DAQ systems. The sensing mechanism is modeled and calibrated through a rational function fitting approach, which demonstrated superior performance in terms of root mean square error (RMSE), coefficient of determination ($R^2$), and linearity error compared to other nonlinear models. Static and repeatability experiments validate the sensor's accuracy, achieving a resolution of 0.03~N and quantization levels exceeding 55,000 steps, surpassing that of commercial sensors. The sensor also exhibits low crosstalk, high sensitivity, and robust noise characteristics. Its performance and structure make it suitable for precision robotic applications, especially in scenarios where compactness, non-contact operation, and integrated processing are essential.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Fair Representation Learning for Continuous Sensitive Attributes using Expectation of Integral Probability Metrics
Authors:
Insung Kong,
Kunwoong Kim,
Yongdai Kim
Abstract:
AI fairness, also known as algorithmic fairness, aims to ensure that algorithms operate without bias or discrimination towards any individual or group. Among various AI algorithms, the Fair Representation Learning (FRL) approach has gained significant interest in recent years. However, existing FRL algorithms have a limitation: they are primarily designed for categorical sensitive attributes and t…
▽ More
AI fairness, also known as algorithmic fairness, aims to ensure that algorithms operate without bias or discrimination towards any individual or group. Among various AI algorithms, the Fair Representation Learning (FRL) approach has gained significant interest in recent years. However, existing FRL algorithms have a limitation: they are primarily designed for categorical sensitive attributes and thus cannot be applied to continuous sensitive attributes, such as age or income. In this paper, we propose an FRL algorithm for continuous sensitive attributes. First, we introduce a measure called the Expectation of Integral Probability Metrics (EIPM) to assess the fairness level of representation space for continuous sensitive attributes. We demonstrate that if the distribution of the representation has a low EIPM value, then any prediction head constructed on the top of the representation become fair, regardless of the selection of the prediction head. Furthermore, EIPM possesses a distinguished advantage in that it can be accurately estimated using our proposed estimator with finite samples. Based on these properties, we propose a new FRL algorithm called Fair Representation using EIPM with MMD (FREM). Experimental evidences show that FREM outperforms other baseline methods.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
Tri-MTL: A Triple Multitask Learning Approach for Respiratory Disease Diagnosis
Authors:
June-Woo Kim,
Sanghoon Lee,
Miika Toikkanen,
Daehwan Hwang,
Kyunghoon Kim
Abstract:
Auscultation remains a cornerstone of clinical practice, essential for both initial evaluation and continuous monitoring. Clinicians listen to the lung sounds and make a diagnosis by combining the patient's medical history and test results. Given this strong association, multitask learning (MTL) can offer a compelling framework to simultaneously model these relationships, integrating respiratory s…
▽ More
Auscultation remains a cornerstone of clinical practice, essential for both initial evaluation and continuous monitoring. Clinicians listen to the lung sounds and make a diagnosis by combining the patient's medical history and test results. Given this strong association, multitask learning (MTL) can offer a compelling framework to simultaneously model these relationships, integrating respiratory sound patterns with disease manifestations. While MTL has shown considerable promise in medical applications, a significant research gap remains in understanding the complex interplay between respiratory sounds, disease manifestations, and patient metadata attributes. This study investigates how integrating MTL with cutting-edge deep learning architectures can enhance both respiratory sound classification and disease diagnosis. Specifically, we extend recent findings regarding the beneficial impact of metadata on respiratory sound classification by evaluating its effectiveness within an MTL framework. Our comprehensive experiments reveal significant improvements in both lung sound classification and diagnostic performance when the stethoscope information is incorporated into the MTL architecture.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
Importance Analysis for Dynamic Control of Balancing Parameter in a Simple Knowledge Distillation Setting
Authors:
Seongmin Kim,
Kwanho Kim,
Minseung Kim,
Kanghyun Jo
Abstract:
Although deep learning models owe their remarkable success to deep and complex architectures, this very complexity typically comes at the expense of real-time performance. To address this issue, a variety of model compression techniques have been proposed, among which knowledge distillation (KD) stands out for its strong empirical performance. The KD contains two concurrent processes: (i) matching…
▽ More
Although deep learning models owe their remarkable success to deep and complex architectures, this very complexity typically comes at the expense of real-time performance. To address this issue, a variety of model compression techniques have been proposed, among which knowledge distillation (KD) stands out for its strong empirical performance. The KD contains two concurrent processes: (i) matching the outputs of a large, pre-trained teacher network and a lightweight student network, and (ii) training the student to solve its designated downstream task. The associated loss functions are termed the distillation loss and the downsteam-task loss, respectively. Numerous prior studies report that KD is most effective when the influence of the distillation loss outweighs that of the downstream-task loss. The influence(or importance) is typically regulated by a balancing parameter. This paper provides a mathematical rationale showing that in a simple KD setting when the loss is decreasing, the balancing parameter should be dynamically adjusted
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
Real-Time Person Image Synthesis Using a Flow Matching Model
Authors:
Jiwoo Jeong,
Kirok Kim,
Wooju Kim,
Nam-Joon Kim
Abstract:
Pose-Guided Person Image Synthesis (PGPIS) generates realistic person images conditioned on a target pose and a source image. This task plays a key role in various real-world applications, such as sign language video generation, AR/VR, gaming, and live streaming. In these scenarios, real-time PGPIS is critical for providing immediate visual feedback and maintaining user immersion.However, achievin…
▽ More
Pose-Guided Person Image Synthesis (PGPIS) generates realistic person images conditioned on a target pose and a source image. This task plays a key role in various real-world applications, such as sign language video generation, AR/VR, gaming, and live streaming. In these scenarios, real-time PGPIS is critical for providing immediate visual feedback and maintaining user immersion.However, achieving real-time performance remains a significant challenge due to the complexity of synthesizing high-fidelity images from diverse and dynamic human poses. Recent diffusion-based methods have shown impressive image quality in PGPIS, but their slow sampling speeds hinder deployment in time-sensitive applications. This latency is particularly problematic in tasks like generating sign language videos during live broadcasts, where rapid image updates are required. Therefore, developing a fast and reliable PGPIS model is a crucial step toward enabling real-time interactive systems. To address this challenge, we propose a generative model based on flow matching (FM). Our approach enables faster, more stable, and more efficient training and sampling. Furthermore, the proposed model supports conditional generation and can operate in latent space, making it especially suitable for real-time PGPIS applications where both speed and quality are critical. We evaluate our proposed method, Real-Time Person Image Synthesis Using a Flow Matching Model (RPFM), on the widely used DeepFashion dataset for PGPIS tasks. Our results show that RPFM achieves near-real-time sampling speeds while maintaining performance comparable to the state-of-the-art models. Our methodology trades off a slight, acceptable decrease in generated-image accuracy for over a twofold increase in generation speed, thereby ensuring real-time performance.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
Artificial Behavior Intelligence: Technology, Challenges, and Future Directions
Authors:
Kanghyun Jo,
Jehwan Choi,
Kwanho Kim,
Seongmin Kim,
Duy-Linh Nguyen,
Xuan-Thuy Vo,
Adri Priadana,
Tien-Dat Tran
Abstract:
Understanding and predicting human behavior has emerged as a core capability in various AI application domains such as autonomous driving, smart healthcare, surveillance systems, and social robotics. This paper defines the technical framework of Artificial Behavior Intelligence (ABI), which comprehensively analyzes and interprets human posture, facial expressions, emotions, behavioral sequences, a…
▽ More
Understanding and predicting human behavior has emerged as a core capability in various AI application domains such as autonomous driving, smart healthcare, surveillance systems, and social robotics. This paper defines the technical framework of Artificial Behavior Intelligence (ABI), which comprehensively analyzes and interprets human posture, facial expressions, emotions, behavioral sequences, and contextual cues. It details the essential components of ABI, including pose estimation, face and emotion recognition, sequential behavior analysis, and context-aware modeling. Furthermore, we highlight the transformative potential of recent advances in large-scale pretrained models, such as large language models (LLMs), vision foundation models, and multimodal integration models, in significantly improving the accuracy and interpretability of behavior recognition. Our research team has a strong interest in the ABI domain and is actively conducting research, particularly focusing on the development of intelligent lightweight models capable of efficiently inferring complex human behaviors. This paper identifies several technical challenges that must be addressed to deploy ABI in real-world applications including learning behavioral intelligence from limited data, quantifying uncertainty in complex behavior prediction, and optimizing model structures for low-power, real-time inference. To tackle these challenges, our team is exploring various optimization strategies including lightweight transformers, graph-based recognition architectures, energy-aware loss functions, and multimodal knowledge distillation, while validating their applicability in real-time environments.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
Refining Fuzzed Crashing Inputs for Better Fault Diagnosis
Authors:
Kieun Kim,
Seongmin Lee,
Shin Hong
Abstract:
We present DiffMin, a technique that refines a fuzzed crashing input to gain greater similarities to given passing inputs to help developers analyze the crashing input to identify the failure-inducing condition and locate buggy code for debugging. DiffMin iteratively applies edit actions to transform a fuzzed input while preserving the crash behavior. Our pilot study with the Magma benchmark demon…
▽ More
We present DiffMin, a technique that refines a fuzzed crashing input to gain greater similarities to given passing inputs to help developers analyze the crashing input to identify the failure-inducing condition and locate buggy code for debugging. DiffMin iteratively applies edit actions to transform a fuzzed input while preserving the crash behavior. Our pilot study with the Magma benchmark demonstrates that DiffMin effectively minimizes the differences between crashing and passing inputs while enhancing the accuracy of spectrum-based fault localization, highlighting its potential as a valuable pre-debugging step after greybox fuzzing.
△ Less
Submitted 6 May, 2025; v1 submitted 4 May, 2025;
originally announced May 2025.
-
Unemployment Dynamics Forecasting with Machine Learning Regression Models
Authors:
Kyungsu Kim
Abstract:
In this paper, I explored how a range of regression and machine learning techniques can be applied to monthly U.S. unemployment data to produce timely forecasts. I compared seven models: Linear Regression, SGDRegressor, Random Forest, XGBoost, CatBoost, Support Vector Regression, and an LSTM network, training each on a historical span of data and then evaluating on a later hold-out period. Input f…
▽ More
In this paper, I explored how a range of regression and machine learning techniques can be applied to monthly U.S. unemployment data to produce timely forecasts. I compared seven models: Linear Regression, SGDRegressor, Random Forest, XGBoost, CatBoost, Support Vector Regression, and an LSTM network, training each on a historical span of data and then evaluating on a later hold-out period. Input features include macro indicators (GDP growth, CPI), labor market measures (job openings, initial claims), financial variables (interest rates, equity indices), and consumer sentiment.
I tuned model hyperparameters via cross-validation and assessed performance with standard error metrics and the ability to predict the correct unemployment direction. Across the board, tree-based ensembles (and CatBoost in particular) deliver noticeably better forecasts than simple linear approaches, while the LSTM captures underlying temporal patterns more effectively than other nonlinear methods. SVR and SGDRegressor yield modest gains over standard regression but don't match the consistency of the ensemble and deep-learning models.
Interpretability tools ,feature importance rankings and SHAP values, point to job openings and consumer sentiment as the most influential predictors across all methods. By directly comparing linear, ensemble, and deep-learning approaches on the same dataset, our study shows how modern machine-learning techniques can enhance real-time unemployment forecasting, offering economists and policymakers richer insights into labor market trends.
In the comparative evaluation of the models, I employed a dataset comprising thirty distinct features over the period from January 2020 through December 2024.
△ Less
Submitted 3 May, 2025;
originally announced May 2025.