Search | arXiv e-print repository

Subspace-based Approximate Hessian Method for Zeroth-Order Optimization

Authors: Dongyoon Kim, Sungjae Lee, Wonjin Lee, Kwang In Kim

Abstract: Zeroth-order optimization addresses problems where gradient information is inaccessible or impractical to compute. While most existing methods rely on first-order approximations, incorporating second-order (curvature) information can, in principle, significantly accelerate convergence. However, the high cost of function evaluations required to estimate Hessian matrices often limits practical appli… ▽ More Zeroth-order optimization addresses problems where gradient information is inaccessible or impractical to compute. While most existing methods rely on first-order approximations, incorporating second-order (curvature) information can, in principle, significantly accelerate convergence. However, the high cost of function evaluations required to estimate Hessian matrices often limits practical applicability. We present the subspace-based approximate Hessian (ZO-SAH) method, a zeroth-order optimization algorithm that mitigates these costs by focusing on randomly selected two-dimensional subspaces. Within each subspace, ZO-SAH estimates the Hessian by fitting a quadratic polynomial to the objective function and extracting its second-order coefficients. To further reduce function-query costs, ZO-SAH employs a periodic subspace-switching strategy that reuses function evaluations across optimization steps. Experiments on eight benchmark datasets, including logistic regression and deep neural network training tasks, demonstrate that ZO-SAH achieves significantly faster convergence than existing zeroth-order methods. △ Less

Submitted 8 July, 2025; originally announced July 2025.

Comments: 20 pages, 8 figures

arXiv:2507.05686 [pdf, ps, other]

Smoothie-Qwen: Post-Hoc Smoothing to Reduce Language Bias in Multilingual LLMs

Authors: SeungWon Ji, Jungyup Lee, Jemin Kim, Sang Park, SeungJae Lee

Abstract: Multilingual large language models (LLMs) often exhibit language confusion, a tendency to generate responses in a dominant language irrespective of the prompt's language. To address this, we propose Smoothie-Qwen, a lightweight, post-hoc method that mitigates language bias without retraining. This technique selectively adjusts token-level output probabilities to effectively suppress undesired lang… ▽ More Multilingual large language models (LLMs) often exhibit language confusion, a tendency to generate responses in a dominant language irrespective of the prompt's language. To address this, we propose Smoothie-Qwen, a lightweight, post-hoc method that mitigates language bias without retraining. This technique selectively adjusts token-level output probabilities to effectively suppress undesired language generation. Applied to the Qwen model, our method reduces unintended Chinese output by over 95% while preserving task accuracy on multilingual benchmarks. This work provides a practical and efficient solution for enhancing the language controllability of LLMs, making them more reliable for global applications. △ Less

Submitted 8 July, 2025; originally announced July 2025.

arXiv:2507.05556 [pdf, ps, other]

Per-Row Activation Counting on Real Hardware: Demystifying Performance Overheads

Authors: Jumin Kim, Seungmin Baek, Minbok Wi, Hwayong Nam, Michael Jaemin Kim, Sukhan Lee, Kyomin Sohn, Jung Ho Ahn

Abstract: Per-Row Activation Counting (PRAC), a DRAM read disturbance mitigation method, modifies key DRAM timing parameters, reportedly causing significant performance overheads in simulator-based studies. However, given known discrepancies between simulators and real hardware, real-machine experiments are vital for accurate PRAC performance estimation. We present the first real-machine performance analysi… ▽ More Per-Row Activation Counting (PRAC), a DRAM read disturbance mitigation method, modifies key DRAM timing parameters, reportedly causing significant performance overheads in simulator-based studies. However, given known discrepancies between simulators and real hardware, real-machine experiments are vital for accurate PRAC performance estimation. We present the first real-machine performance analysis of PRAC. After verifying timing modifications on the latest CPUs using microbenchmarks, our analysis shows that PRAC's average and maximum overheads are just 1.06% and 3.28% for the SPEC CPU2017 workloads -- up to 9.15x lower than simulator-based reports. Further, we show that the close page policy minimizes this overhead by effectively hiding the elongated DRAM row precharge operations due to PRAC from the critical path. △ Less

Submitted 7 July, 2025; originally announced July 2025.

Comments: 4 pages, 4 figures, to appear at IEEE Computer Architecture Letters

arXiv:2507.05418 [pdf, ps, other]

Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning

Authors: Jaedong Hwang, Kumar Tanmay, Seok-Jin Lee, Ayush Agrawal, Hamid Palangi, Kumar Ayush, Ila Fiete, Paul Pu Liang

Abstract: Large Language Models (LLMs) have achieved strong performance in domains like mathematics, factual QA, and code generation, yet their multilingual reasoning capabilities in these tasks remain underdeveloped. Especially for low-resource languages such as Swahili or Thai, LLMs can often misinterpret prompts or default to reasoning in English. This implicit bias toward high-resource languages undermi… ▽ More Large Language Models (LLMs) have achieved strong performance in domains like mathematics, factual QA, and code generation, yet their multilingual reasoning capabilities in these tasks remain underdeveloped. Especially for low-resource languages such as Swahili or Thai, LLMs can often misinterpret prompts or default to reasoning in English. This implicit bias toward high-resource languages undermines factual accuracy, interpretability, and trust. Current multilingual benchmarks focus only on final answers, overlooking whether models actually reason in the target language. To address this gap, we introduce GeoFact-X, a geography-based multilingual factual reasoning benchmark with annotated reasoning traces in five languages: English, Hindi, Japanese, Swahili, and Thai. We further propose BRIDGE, a novel training method that guides supervised fine-tuning and test-time reinforcement learning with a language-consistency reward to align reasoning with the input language. Finally, we develop an automatic evaluation protocol using LLM-as-a-judge to assess answer correctness and the quality and language consistency of reasoning traces, enabling nuanced and scalable analysis beyond surface-level metrics. Our results show that BRIDGE significantly enhances multilingual reasoning fidelity, demonstrating that reasoning-aware multilingual reinforcement learning is crucial for robust cross-lingual generalization. https://jd730.github.io/projects/GeoFact-X_BRIDGE △ Less

Submitted 7 July, 2025; originally announced July 2025.

arXiv:2507.04748 [pdf, ps, other]

LLM-based Question-Answer Framework for Sensor-driven HVAC System Interaction

Authors: Sungmin Lee, Minju Kang, Joonhee Lee, Seungyong Lee, Dongju Kim, Jingi Hong, Jun Shin, Pei Zhang, JeongGil Ko

Abstract: Question-answering (QA) interfaces powered by large language models (LLMs) present a promising direction for improving interactivity with HVAC system insights, particularly for non-expert users. However, enabling accurate, real-time, and context-aware interactions with HVAC systems introduces unique challenges, including the integration of frequently updated sensor data, domain-specific knowledge… ▽ More Question-answering (QA) interfaces powered by large language models (LLMs) present a promising direction for improving interactivity with HVAC system insights, particularly for non-expert users. However, enabling accurate, real-time, and context-aware interactions with HVAC systems introduces unique challenges, including the integration of frequently updated sensor data, domain-specific knowledge grounding, and coherent multi-stage reasoning. In this paper, we present JARVIS, a two-stage LLM-based QA framework tailored for sensor data-driven HVAC system interaction. JARVIS employs an Expert-LLM to translate high-level user queries into structured execution instructions, and an Agent that performs SQL-based data retrieval, statistical processing, and final response generation. To address HVAC-specific challenges, JARVIS integrates (1) an adaptive context injection strategy for efficient HVAC and deployment-specific information integration, (2) a parameterized SQL builder and executor to improve data access reliability, and (3) a bottom-up planning scheme to ensure consistency across multi-stage response generation. We evaluate JARVIS using real-world data collected from a commercial HVAC system and a ground truth QA dataset curated by HVAC experts to demonstrate its effectiveness in delivering accurate and interpretable responses across diverse queries. Results show that JARVIS consistently outperforms baseline and ablation variants in both automated and user-centered assessments, achieving high response quality and accuracy. △ Less

Submitted 7 July, 2025; originally announced July 2025.

arXiv:2507.04660 [pdf, ps, other]

CP-Dilatation: A Copy-and-Paste Augmentation Method for Preserving the Boundary Context Information of Histopathology Images

Authors: Sungrae Hong, Sol Lee, Mun Yong Yi

Abstract: Medical AI diagnosis including histopathology segmentation has derived benefits from the recent development of deep learning technology. However, deep learning itself requires a large amount of training data and the medical image segmentation masking, in particular, requires an extremely high cost due to the shortage of medical specialists. To mitigate this issue, we propose a new data augmentatio… ▽ More Medical AI diagnosis including histopathology segmentation has derived benefits from the recent development of deep learning technology. However, deep learning itself requires a large amount of training data and the medical image segmentation masking, in particular, requires an extremely high cost due to the shortage of medical specialists. To mitigate this issue, we propose a new data augmentation method built upon the conventional Copy and Paste (CP) augmentation technique, called CP-Dilatation, and apply it to histopathology image segmentation. To the well-known traditional CP technique, the proposed method adds a dilation operation that can preserve the boundary context information of the malignancy, which is important in histopathological image diagnosis, as the boundary between the malignancy and its margin is mostly unclear and a significant context exists in the margin. In our experiments using histopathology benchmark datasets, the proposed method was found superior to the other state-of-the-art baselines chosen for comparison. △ Less

Submitted 7 July, 2025; originally announced July 2025.

Comments: 5 pages, 5 figures

arXiv:2507.04388 [pdf, ps, other]

Comprehensive Information Bottleneck for Unveiling Universal Attribution to Interpret Vision Transformers

Authors: Jung-Ho Hong, Ho-Joong Kim, Kyu-Sung Jeon, Seong-Whan Lee

Abstract: The feature attribution method reveals the contribution of input variables to the decision-making process to provide an attribution map for explanation. Existing methods grounded on the information bottleneck principle compute information in a specific layer to obtain attributions, compressing the features by injecting noise via a parametric damping ratio. However, the attribution obtained in a sp… ▽ More The feature attribution method reveals the contribution of input variables to the decision-making process to provide an attribution map for explanation. Existing methods grounded on the information bottleneck principle compute information in a specific layer to obtain attributions, compressing the features by injecting noise via a parametric damping ratio. However, the attribution obtained in a specific layer neglects evidence of the decision-making process distributed across layers. In this paper, we introduce a comprehensive information bottleneck (CoIBA), which discovers the relevant information in each targeted layer to explain the decision-making process. Our core idea is applying information bottleneck in multiple targeted layers to estimate the comprehensive information by sharing a parametric damping ratio across the layers. Leveraging this shared ratio complements the over-compressed information to discover the omitted clues of the decision by sharing the relevant information across the targeted layers. We suggest the variational approach to fairly reflect the relevant information of each layer by upper bounding layer-wise information. Therefore, CoIBA guarantees that the discarded activation is unnecessary in every targeted layer to make a decision. The extensive experimental results demonstrate the enhancement in faithfulness of the feature attributions provided by CoIBA. △ Less

Submitted 6 July, 2025; originally announced July 2025.

Comments: CVPR 2025 (highlight)

arXiv:2507.04018 [pdf, ps, other]

doi 10.1007/978-981-96-8180-8_38

Handling Korean Out-of-Vocabulary Words with Phoneme Representation Learning

Authors: Nayeon Kim, Eojin Jeon, Jun-Hyung Park, SangKeun Lee

Abstract: In this study, we introduce KOPL, a novel framework for handling Korean OOV words with Phoneme representation Learning. Our work is based on the linguistic property of Korean as a phonemic script, the high correlation between phonemes and letters. KOPL incorporates phoneme and word representations for Korean OOV words, facilitating Korean OOV word representations to capture both text and phoneme i… ▽ More In this study, we introduce KOPL, a novel framework for handling Korean OOV words with Phoneme representation Learning. Our work is based on the linguistic property of Korean as a phonemic script, the high correlation between phonemes and letters. KOPL incorporates phoneme and word representations for Korean OOV words, facilitating Korean OOV word representations to capture both text and phoneme information of words. We empirically demonstrate that KOPL significantly improves the performance on Korean Natural Language Processing (NLP) tasks, while being readily integrated into existing static and contextual Korean embedding models in a plug-and-play manner. Notably, we show that KOPL outperforms the state-of-the-art model by an average of 1.9%. Our code is available at https://github.com/jej127/KOPL.git. △ Less

Submitted 5 July, 2025; originally announced July 2025.

Journal ref: Advances in Knowledge Discovery and Data Mining. PAKDD 2025

arXiv:2507.04014 [pdf, ps, other]

Nunchi-Bench: Benchmarking Language Models on Cultural Reasoning with a Focus on Korean Superstition

Authors: Kyuhee Kim, Sangah Lee

Abstract: As large language models (LLMs) become key advisors in various domains, their cultural sensitivity and reasoning skills are crucial in multicultural environments. We introduce Nunchi-Bench, a benchmark designed to evaluate LLMs' cultural understanding, with a focus on Korean superstitions. The benchmark consists of 247 questions spanning 31 topics, assessing factual knowledge, culturally appropria… ▽ More As large language models (LLMs) become key advisors in various domains, their cultural sensitivity and reasoning skills are crucial in multicultural environments. We introduce Nunchi-Bench, a benchmark designed to evaluate LLMs' cultural understanding, with a focus on Korean superstitions. The benchmark consists of 247 questions spanning 31 topics, assessing factual knowledge, culturally appropriate advice, and situational interpretation. We evaluate multilingual LLMs in both Korean and English to analyze their ability to reason about Korean cultural contexts and how language variations affect performance. To systematically assess cultural reasoning, we propose a novel evaluation strategy with customized scoring metrics that capture the extent to which models recognize cultural nuances and respond appropriately. Our findings highlight significant challenges in LLMs' cultural reasoning. While models generally recognize factual information, they struggle to apply it in practical scenarios. Furthermore, explicit cultural framing enhances performance more effectively than relying solely on the language of the prompt. To support further research, we publicly release Nunchi-Bench alongside a leaderboard. △ Less

Submitted 5 July, 2025; originally announced July 2025.

arXiv:2507.03328 [pdf, ps, other]

scikit-package -- software packaging standards and roadmap for sharing reproducible scientific software

Authors: S. Lee, C. Myers, A. Yang, T. Zhang, S. J. L. Billinge

Abstract: Scientific advancement relies on the ability to share and reproduce results. When data analysis or calculations are carried out using software written by scientists there are special challenges around code versions, quality and code sharing. scikit-package provides a roadmap to facilitate code reuse and sharing with minimal effort through tutorials coupled with automated and centralized reusable w… ▽ More Scientific advancement relies on the ability to share and reproduce results. When data analysis or calculations are carried out using software written by scientists there are special challenges around code versions, quality and code sharing. scikit-package provides a roadmap to facilitate code reuse and sharing with minimal effort through tutorials coupled with automated and centralized reusable workflows. The goal of the project is to provide pedagogical and practical tools for scientists who are not professionally trained software engineers to write more reusable and maintainable software code. Code reuse can occur at multiple levels of complexity-from turning a code block into a function within a single script, to publishing a publicly installable, fully tested, and documented software package scikit-package provides a community maintained set of tools, and a roadmap, to help scientists bring their software higher levels of reproducibility and shareability. △ Less

Submitted 8 July, 2025; v1 submitted 4 July, 2025; originally announced July 2025.

Comments: GitHub: https://github.com/scikit-package/scikit-package Doc: https://scikit-package.github.io/scikit-package/

arXiv:2507.03114 [pdf, ps, other]

Characterizing Compute-Communication Overlap in GPU-Accelerated Distributed Deep Learning: Performance and Power Implications

Authors: Seonho Lee, Jihwan Oh, Junkyum Kim, Seokjin Go, Jongse Park, Divya Mahajan

Abstract: This paper provides an in-depth characterization of GPU-accelerated systems, to understand the interplay between overlapping computation and communication which is commonly employed in distributed training settings. Due to the large size of models, distributing them across multiple devices is required. Overlapping strategies, which enable concurrent computation and communication, are critical for… ▽ More This paper provides an in-depth characterization of GPU-accelerated systems, to understand the interplay between overlapping computation and communication which is commonly employed in distributed training settings. Due to the large size of models, distributing them across multiple devices is required. Overlapping strategies, which enable concurrent computation and communication, are critical for mitigating communication bottlenecks and maximizing GPU utilization. However, the current consensus is that we should always and aggressively overlap compute and communication to mitigate the overhead of distribution. By systematically evaluating state-of-the-art GPUs, this study investigates the impact of hardware features such as numeric precision, specialized cores, and power capping on distributed training workloads. Comprehensive experiments and studies showcase the effects of overlapping strategies on performance and power consumption across varying scenarios. We observe that overlapping computation and communication can result in an average computational slowdown of 18.9%, with a maximum of 40.0% slowdown. This slowdown is in comparison to the scenario when no communication was happening with the compute. We consider this an ideal execution scenario, where the communication in parallel has not impact on the compute time. However, performing computation and communication sequentially is, on average, 10.2% slower than overlapped execution, with a maximum slowdown of 26.6%. We further observe, while specialized datapath and optimized numeric precision mitigate certain slowdowns, overlapping execution can lead to resource contention and also increase power consumption under specific configurations. The analysis also uncovers trade-offs introduced by power and frequency capping, emphasizing the importance of balanced strategies to optimize energy efficiency and training throughput. △ Less

Submitted 3 July, 2025; originally announced July 2025.

arXiv:2507.02393 [pdf, ps, other]

PLOT: Pseudo-Labeling via Video Object Tracking for Scalable Monocular 3D Object Detection

Authors: Seokyeong Lee, Sithu Aung, Junyong Choi, Seungryong Kim, Ig-Jae Kim, Junghyun Cho

Abstract: Monocular 3D object detection (M3OD) has long faced challenges due to data scarcity caused by high annotation costs and inherent 2D-to-3D ambiguity. Although various weakly supervised methods and pseudo-labeling methods have been proposed to address these issues, they are mostly limited by domain-specific learning or rely solely on shape information from a single observation. In this paper, we pro… ▽ More Monocular 3D object detection (M3OD) has long faced challenges due to data scarcity caused by high annotation costs and inherent 2D-to-3D ambiguity. Although various weakly supervised methods and pseudo-labeling methods have been proposed to address these issues, they are mostly limited by domain-specific learning or rely solely on shape information from a single observation. In this paper, we propose a novel pseudo-labeling framework that uses only video data and is more robust to occlusion, without requiring a multi-view setup, additional sensors, camera poses, or domain-specific training. Specifically, we explore a technique for aggregating the pseudo-LiDARs of both static and dynamic objects across temporally adjacent frames using object point tracking, enabling 3D attribute extraction in scenarios where 3D data acquisition is infeasible. Extensive experiments demonstrate that our method ensures reliable accuracy and strong scalability, making it a practical and effective solution for M3OD. △ Less

Submitted 3 July, 2025; originally announced July 2025.

Comments: 18 pages, 16 figures

arXiv:2507.02080 [pdf, ps, other]

TAGF: Time-aware Gated Fusion for Multimodal Valence-Arousal Estimation

Authors: Yubeen Lee, Sangeun Lee, Chaewon Park, Junyeop Cha, Eunil Park

Abstract: Multimodal emotion recognition often suffers from performance degradation in valence-arousal estimation due to noise and misalignment between audio and visual modalities. To address this challenge, we introduce TAGF, a Time-aware Gated Fusion framework for multimodal emotion recognition. The TAGF adaptively modulates the contribution of recursive attention outputs based on temporal dynamics. Speci… ▽ More Multimodal emotion recognition often suffers from performance degradation in valence-arousal estimation due to noise and misalignment between audio and visual modalities. To address this challenge, we introduce TAGF, a Time-aware Gated Fusion framework for multimodal emotion recognition. The TAGF adaptively modulates the contribution of recursive attention outputs based on temporal dynamics. Specifically, the TAGF incorporates a BiLSTM-based temporal gating mechanism to learn the relative importance of each recursive step and effectively integrates multistep cross-modal features. By embedding temporal awareness into the recursive fusion process, the TAGF effectively captures the sequential evolution of emotional expressions and the complex interplay between modalities. Experimental results on the Aff-Wild2 dataset demonstrate that TAGF achieves competitive performance compared with existing recursive attention-based models. Furthermore, TAGF exhibits strong robustness to cross-modal misalignment and reliably models dynamic emotional transitions in real-world conditions. △ Less

Submitted 2 July, 2025; originally announced July 2025.

Comments: 9 pages, 2 figures, 2 tables

arXiv:2507.02068 [pdf, ps, other]

How do Software Engineering Candidates Prepare for Technical Interviews?

Authors: Brian Bell, Teresa Thomas, Sang Won Lee, Chris Brown

Abstract: To obtain employment, aspiring software engineers must complete technical interviews -- a hiring process which involves candidates writing code while communicating to an audience. However, the complexities of tech interviews are difficult to prepare for and seldom faced in computing curricula. To this end, we seek to understand how candidates prepare for technical interviews, investigating the eff… ▽ More To obtain employment, aspiring software engineers must complete technical interviews -- a hiring process which involves candidates writing code while communicating to an audience. However, the complexities of tech interviews are difficult to prepare for and seldom faced in computing curricula. To this end, we seek to understand how candidates prepare for technical interviews, investigating the effects of preparation methods and the role of education. We distributed a survey to candidates (n = 131) actively preparing for technical interviews. Our results suggest candidates rarely train in authentic settings and courses fail to support preparation efforts -- leading to stress and unpreparedness. Based on our findings, we provide implications for stakeholders to enhance tech interview preparation for candidates pursuing software engineering roles. △ Less

Submitted 2 July, 2025; originally announced July 2025.

arXiv:2507.00683 [pdf, ps, other]

Testing the spin-bath view of self-attention: A Hamiltonian analysis of GPT-2 Transformer

Authors: Satadeep Bhattacharjee, Seung-Cheol Lee

Abstract: The recently proposed physics-based framework by Huo and Johnson~\cite{huo2024capturing} models the attention mechanism of Large Language Models (LLMs) as an interacting two-body spin system, offering a first-principles explanation for phenomena like repetition and bias. Building on this hypothesis, we extract the complete Query-Key weight matrices from a production-grade GPT-2 model and derive th… ▽ More The recently proposed physics-based framework by Huo and Johnson~\cite{huo2024capturing} models the attention mechanism of Large Language Models (LLMs) as an interacting two-body spin system, offering a first-principles explanation for phenomena like repetition and bias. Building on this hypothesis, we extract the complete Query-Key weight matrices from a production-grade GPT-2 model and derive the corresponding effective Hamiltonian for every attention head. From these Hamiltonians we obtain analytic \textit{phase boundaries} logit gap criteria that predict which token should dominate the next-token distribution for a given context. A systematic evaluation on 144 heads across 20 factual-recall prompts reveals a strong negative correlation between the theoretical logit gaps and the model's empirical token rankings ($r\approx-0.70$, $p<10^{-3}$).Targeted ablations further show that suppressing the heads most aligned with the spin-bath predictions induces the anticipated shifts in output probabilities, confirming a causal link rather than a coincidental association. Taken together, our findings provide the first strong empirical evidence for the spin-bath analogy in a production-grade model. This validation not only furnishes a tractable, physics-inspired lens for interpretability but also provides the groundwork for novel generative models, bridging the gap between theoretical condensed matter physics and AI. △ Less

Submitted 4 July, 2025; v1 submitted 1 July, 2025; originally announced July 2025.

arXiv:2506.24119 [pdf, ps, other]

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

Authors: Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, Wee Sun Lee, Natasha Jaques

Abstract: Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering. We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving ve… ▽ More Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering. We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision. Through self-play, SPIRAL generates an infinite curriculum of progressively challenging problems as models must constantly adapt to stronger opponents. To enable this self-play training at scale, We implement a fully online, multi-turn, multi-agent reinforcement learning system for LLMs and propose role-conditioned advantage estimation (RAE) to stabilize multi-agent training. Using SPIRAL, self-play on zero-sum games produces reasoning capabilities that transfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6% improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000 expert game trajectories. Analysis reveals that this transfer occurs through three cognitive patterns: systematic decomposition, expected value calculation, and case-by-case analysis. Multi-game training (TicTacToe, Kuhn Poker, Simple Negotiation) further enhances performance as each game develops distinct reasoning strengths. Applying SPIRAL to a strong reasoning model (DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. These results demonstrate that zero-sum games naturally develop transferable reasoning capabilities, highlighting a promising direction for autonomous reasoning development. △ Less

Submitted 30 June, 2025; v1 submitted 30 June, 2025; originally announced June 2025.

Comments: Work in Progress

arXiv:2506.23516 [pdf, ps, other]

FedWSQ: Efficient Federated Learning with Weight Standardization and Distribution-Aware Non-Uniform Quantization

Authors: Seung-Wook Kim, Seongyeol Kim, Jiah Kim, Seowon Ji, Se-Ho Lee

Abstract: Federated learning (FL) often suffers from performance degradation due to key challenges such as data heterogeneity and communication constraints. To address these limitations, we present a novel FL framework called FedWSQ, which integrates weight standardization (WS) and the proposed distribution-aware non-uniform quantization (DANUQ). WS enhances FL performance by filtering out biased components… ▽ More Federated learning (FL) often suffers from performance degradation due to key challenges such as data heterogeneity and communication constraints. To address these limitations, we present a novel FL framework called FedWSQ, which integrates weight standardization (WS) and the proposed distribution-aware non-uniform quantization (DANUQ). WS enhances FL performance by filtering out biased components in local updates during training, thereby improving the robustness of the model against data heterogeneity and unstable client participation. In addition, DANUQ minimizes quantization errors by leveraging the statistical properties of local model updates. As a result, FedWSQ significantly reduces communication overhead while maintaining superior model accuracy. Extensive experiments on FL benchmark datasets demonstrate that FedWSQ consistently outperforms existing FL methods across various challenging FL settings, including extreme data heterogeneity and ultra-low-bit communication scenarios. △ Less

Submitted 30 June, 2025; originally announced June 2025.

arXiv:2506.23326 [pdf, ps, other]

Simplifying Data-Driven Modeling of the Volume-Flow-Pressure Relationship in Hydraulic Soft Robotic Actuators

Authors: Sang-Yoep Lee, Leonardo Zamora Yanez, Jacob Rogatinsky, Vi T. Vo, Tanvi Shingade, Tommaso Ranzani

Abstract: Soft robotic systems are known for their flexibility and adaptability, but traditional physics-based models struggle to capture their complex, nonlinear behaviors. This study explores a data-driven approach to modeling the volume-flow-pressure relationship in hydraulic soft actuators, focusing on low-complexity models with high accuracy. We perform regression analysis on a stacked balloon actuator… ▽ More Soft robotic systems are known for their flexibility and adaptability, but traditional physics-based models struggle to capture their complex, nonlinear behaviors. This study explores a data-driven approach to modeling the volume-flow-pressure relationship in hydraulic soft actuators, focusing on low-complexity models with high accuracy. We perform regression analysis on a stacked balloon actuator system using exponential, polynomial, and neural network models with or without autoregressive inputs. The results demonstrate that simpler models, particularly multivariate polynomials, effectively predict pressure dynamics with fewer parameters. This research offers a practical solution for real-time soft robotics applications, balancing model complexity and computational efficiency. Moreover, the approach may benefit various techniques that require explicit analytical models. △ Less

Submitted 29 June, 2025; originally announced June 2025.

Comments: This work has been submitted to the IEEE for possible publication

arXiv:2506.22806 [pdf, ps, other]

Concept Pinpoint Eraser for Text-to-image Diffusion Models via Residual Attention Gate

Authors: Byung Hyun Lee, Sungjin Lim, Seunggyu Lee, Dong Un Kang, Se Young Chun

Abstract: Remarkable progress in text-to-image diffusion models has brought a major concern about potentially generating images on inappropriate or trademarked concepts. Concept erasing has been investigated with the goals of deleting target concepts in diffusion models while preserving other concepts with minimal distortion. To achieve these goals, recent concept erasing methods usually fine-tune the cross… ▽ More Remarkable progress in text-to-image diffusion models has brought a major concern about potentially generating images on inappropriate or trademarked concepts. Concept erasing has been investigated with the goals of deleting target concepts in diffusion models while preserving other concepts with minimal distortion. To achieve these goals, recent concept erasing methods usually fine-tune the cross-attention layers of diffusion models. In this work, we first show that merely updating the cross-attention layers in diffusion models, which is mathematically equivalent to adding \emph{linear} modules to weights, may not be able to preserve diverse remaining concepts. Then, we propose a novel framework, dubbed Concept Pinpoint Eraser (CPE), by adding \emph{nonlinear} Residual Attention Gates (ResAGs) that selectively erase (or cut) target concepts while safeguarding remaining concepts from broad distributions by employing an attention anchoring loss to prevent the forgetting. Moreover, we adversarially train CPE with ResAG and learnable text embeddings in an iterative manner to maximize erasing performance and enhance robustness against adversarial attacks. Extensive experiments on the erasure of celebrities, artistic styles, and explicit contents demonstrated that the proposed CPE outperforms prior arts by keeping diverse remaining concepts while deleting the target concepts with robustness against attack prompts. Code is available at https://github.com/Hyun1A/CPE △ Less

Submitted 28 June, 2025; originally announced June 2025.

arXiv:2506.21855 [pdf, ps, other]

Periodic-MAE: Periodic Video Masked Autoencoder for rPPG Estimation

Authors: Jiho Choi, Sang Jun Lee

Abstract: In this paper, we propose a method that learns a general representation of periodic signals from unlabeled facial videos by capturing subtle changes in skin tone over time. The proposed framework employs the video masked autoencoder to learn a high-dimensional spatio-temporal representation of the facial region through self-supervised learning. Capturing quasi-periodic signals in the video is cruc… ▽ More In this paper, we propose a method that learns a general representation of periodic signals from unlabeled facial videos by capturing subtle changes in skin tone over time. The proposed framework employs the video masked autoencoder to learn a high-dimensional spatio-temporal representation of the facial region through self-supervised learning. Capturing quasi-periodic signals in the video is crucial for remote photoplethysmography (rPPG) estimation. To account for signal periodicity, we apply frame masking in terms of video sampling, which allows the model to capture resampled quasi-periodic signals during the pre-training stage. Moreover, the framework incorporates physiological bandlimit constraints, leveraging the property that physiological signals are sparse within their frequency bandwidth to provide pulse cues to the model. The pre-trained encoder is then transferred to the rPPG task, where it is used to extract physiological signals from facial videos. We evaluate the proposed method through extensive experiments on the PURE, UBFC-rPPG, MMPD, and V4V datasets. Our results demonstrate significant performance improvements, particularly in challenging cross-dataset evaluations. Our code is available at https://github.com/ziiho08/Periodic-MAE. △ Less

Submitted 26 June, 2025; originally announced June 2025.

arXiv:2506.21765 [pdf, ps, other]

TUS-REC2024: A Challenge to Reconstruct 3D Freehand Ultrasound Without External Tracker

Authors: Qi Li, Shaheer U. Saeed, Yuliang Huang, Mingyuan Luo, Zhongnuo Yan, Jiongquan Chen, Xin Yang, Dong Ni, Nektarios Winter, Phuc Nguyen, Lucas Steinberger, Caelan Haney, Yuan Zhao, Mingjie Jiang, Bowen Ren, SiYeoul Lee, Seonho Kim, MinKyung Seo, MinWoo Kim, Yimeng Dou, Zhiwei Zhang, Yin Li, Tomy Varghese, Dean C. Barratt, Matthew J. Clarkson , et al. (2 additional authors not shown)

Abstract: Trackerless freehand ultrasound reconstruction aims to reconstruct 3D volumes from sequences of 2D ultrasound images without relying on external tracking systems, offering a low-cost, portable, and widely deployable alternative for volumetric imaging. However, it presents significant challenges, including accurate inter-frame motion estimation, minimisation of drift accumulation over long sequence… ▽ More Trackerless freehand ultrasound reconstruction aims to reconstruct 3D volumes from sequences of 2D ultrasound images without relying on external tracking systems, offering a low-cost, portable, and widely deployable alternative for volumetric imaging. However, it presents significant challenges, including accurate inter-frame motion estimation, minimisation of drift accumulation over long sequences, and generalisability across scanning protocols. The TUS-REC2024 Challenge was established to benchmark and accelerate progress in trackerless 3D ultrasound reconstruction by providing a publicly available dataset for the first time, along with a baseline model and evaluation framework. The Challenge attracted over 43 registered teams, of which 6 teams submitted 21 valid dockerized solutions. Submitted methods spanned a wide range of algorithmic approaches, including recurrent models, registration-driven volume refinement, attention, and physics-informed models. This paper presents an overview of the Challenge design, summarises the key characteristics of the dataset, provides a concise literature review, introduces the technical details of the underlying methodology working with tracked freehand ultrasound data, and offers a comparative analysis of submitted methods across multiple evaluation metrics. The results highlight both the progress and current limitations of state-of-the-art approaches in this domain, and inform directions for future research. The data, evaluation code, and baseline are publicly available to facilitate ongoing development and reproducibility. As a live and evolving benchmark, this Challenge is designed to be continuously developed and improved. The Challenge was held at MICCAI 2024 and will be organised again at MICCAI 2025, reflecting its growing impact and the sustained commitment to advancing this field. △ Less

Submitted 26 June, 2025; originally announced June 2025.

arXiv:2506.21582 [pdf, ps, other]

VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents

Authors: Sam Yu-Te Lee, Chengyang Ji, Shicheng Wen, Lifu Huang, Dongyi Liu, Kwan-Liu Ma

Abstract: Text analytics has traditionally required specialized knowledge in Natural Language Processing (NLP) or text analysis, which presents a barrier for entry-level analysts. Recent advances in large language models (LLMs) have changed the landscape of NLP by enabling more accessible and automated text analysis (e.g., topic detection, summarization, information extraction, etc.). We introduce VIDEE, a… ▽ More Text analytics has traditionally required specialized knowledge in Natural Language Processing (NLP) or text analysis, which presents a barrier for entry-level analysts. Recent advances in large language models (LLMs) have changed the landscape of NLP by enabling more accessible and automated text analysis (e.g., topic detection, summarization, information extraction, etc.). We introduce VIDEE, a system that supports entry-level data analysts to conduct advanced text analytics with intelligent agents. VIDEE instantiates a human-agent collaroration workflow consisting of three stages: (1) Decomposition, which incorporates a human-in-the-loop Monte-Carlo Tree Search algorithm to support generative reasoning with human feedback, (2) Execution, which generates an executable text analytics pipeline, and (3) Evaluation, which integrates LLM-based evaluation and visualizations to support user validation of execution results. We conduct two quantitative experiments to evaluate VIDEE's effectiveness and analyze common agent errors. A user study involving participants with varying levels of NLP and text analytics experience -- from none to expert -- demonstrates the system's usability and reveals distinct user behavior patterns. The findings identify design implications for human-agent collaboration, validate the practical utility of VIDEE for non-expert users, and inform future improvements to intelligent text analytics systems. △ Less

Submitted 17 June, 2025; originally announced June 2025.

arXiv:2506.21039 [pdf, ps, other]

Strict Subgoal Execution: Reliable Long-Horizon Planning in Hierarchical Reinforcement Learning

Authors: Jaebak Hwang, Sanghyeon Lee, Jeongmo Kim, Seungyul Han

Abstract: Long-horizon goal-conditioned tasks pose fundamental challenges for reinforcement learning (RL), particularly when goals are distant and rewards are sparse. While hierarchical and graph-based methods offer partial solutions, they often suffer from subgoal infeasibility and inefficient planning. We introduce Strict Subgoal Execution (SSE), a graph-based hierarchical RL framework that enforces singl… ▽ More Long-horizon goal-conditioned tasks pose fundamental challenges for reinforcement learning (RL), particularly when goals are distant and rewards are sparse. While hierarchical and graph-based methods offer partial solutions, they often suffer from subgoal infeasibility and inefficient planning. We introduce Strict Subgoal Execution (SSE), a graph-based hierarchical RL framework that enforces single-step subgoal reachability by structurally constraining high-level decision-making. To enhance exploration, SSE employs a decoupled exploration policy that systematically traverses underexplored regions of the goal space. Furthermore, a failure-aware path refinement, which refines graph-based planning by dynamically adjusting edge costs according to observed low-level success rates, thereby improving subgoal reliability. Experimental results across diverse long-horizon benchmarks demonstrate that SSE consistently outperforms existing goal-conditioned RL and hierarchical RL approaches in both efficiency and success rate. △ Less

Submitted 26 June, 2025; originally announced June 2025.

Comments: 9 technical page followed by references and appendix

arXiv:2506.20922 [pdf, ps, other]

M2SFormer: Multi-Spectral and Multi-Scale Attention with Edge-Aware Difficulty Guidance for Image Forgery Localization

Authors: Ju-Hyeon Nam, Dong-Hyun Moon, Sang-Chul Lee

Abstract: Image editing techniques have rapidly advanced, facilitating both innovative use cases and malicious manipulation of digital images. Deep learning-based methods have recently achieved high accuracy in pixel-level forgery localization, yet they frequently struggle with computational overhead and limited representation power, particularly for subtle or complex tampering. In this paper, we propose M2… ▽ More Image editing techniques have rapidly advanced, facilitating both innovative use cases and malicious manipulation of digital images. Deep learning-based methods have recently achieved high accuracy in pixel-level forgery localization, yet they frequently struggle with computational overhead and limited representation power, particularly for subtle or complex tampering. In this paper, we propose M2SFormer, a novel Transformer encoder-based framework designed to overcome these challenges. Unlike approaches that process spatial and frequency cues separately, M2SFormer unifies multi-frequency and multi-scale attentions in the skip connection, harnessing global context to better capture diverse forgery artifacts. Additionally, our framework addresses the loss of fine detail during upsampling by utilizing a global prior map, a curvature metric indicating the difficulty of forgery localization, which then guides a difficulty-guided attention module to preserve subtle manipulations more effectively. Extensive experiments on multiple benchmark datasets demonstrate that M2SFormer outperforms existing state-of-the-art models, offering superior generalization in detecting and localizing forgeries across unseen domains. △ Less

Submitted 25 June, 2025; originally announced June 2025.

Comments: Accepted in International Conference on Computer Vision (ICCV) 2025

arXiv:2506.20702 [pdf]

The Singapore Consensus on Global AI Safety Research Priorities

Authors: Yoshua Bengio, Tegan Maharaj, Luke Ong, Stuart Russell, Dawn Song, Max Tegmark, Lan Xue, Ya-Qin Zhang, Stephen Casper, Wan Sie Lee, Sören Mindermann, Vanessa Wilfred, Vidhisha Balachandran, Fazl Barez, Michael Belinsky, Imane Bello, Malo Bourgon, Mark Brakel, Siméon Campos, Duncan Cass-Beggs, Jiahao Chen, Rumman Chowdhury, Kuan Chua Seah, Jeff Clune, Juntao Dai , et al. (63 additional authors not shown)

Abstract: Rapidly improving AI capabilities and autonomy hold significant promise of transformation, but are also driving vigorous debate on how to ensure that AI is safe, i.e., trustworthy, reliable, and secure. Building a trusted ecosystem is therefore essential -- it helps people embrace AI with confidence and gives maximal space for innovation while avoiding backlash. The "2025 Singapore Conference on… ▽ More Rapidly improving AI capabilities and autonomy hold significant promise of transformation, but are also driving vigorous debate on how to ensure that AI is safe, i.e., trustworthy, reliable, and secure. Building a trusted ecosystem is therefore essential -- it helps people embrace AI with confidence and gives maximal space for innovation while avoiding backlash. The "2025 Singapore Conference on AI (SCAI): International Scientific Exchange on AI Safety" aimed to support research in this space by bringing together AI scientists across geographies to identify and synthesise research priorities in AI safety. This resulting report builds on the International AI Safety Report chaired by Yoshua Bengio and backed by 33 governments. By adopting a defence-in-depth model, this report organises AI safety research domains into three types: challenges with creating trustworthy AI systems (Development), challenges with evaluating their risks (Assessment), and challenges with monitoring and intervening after deployment (Control). △ Less

Submitted 30 June, 2025; v1 submitted 25 June, 2025; originally announced June 2025.

Comments: Final report from the "2025 Singapore Conference on AI (SCAI)" held April 26: https://www.scai.gov.sg/2025/scai2025-report

arXiv:2506.20357 [pdf, ps, other]

Tabular Feature Discovery With Reasoning Type Exploration

Authors: Sungwon Han, Sungkyu Park, Seungeon Lee

Abstract: Feature engineering for tabular data remains a critical yet challenging step in machine learning. Recently, large language models (LLMs) have been used to automatically generate new features by leveraging their vast knowledge. However, existing LLM-based approaches often produce overly simple or repetitive features, partly due to inherent biases in the transformations the LLM chooses and the lack… ▽ More Feature engineering for tabular data remains a critical yet challenging step in machine learning. Recently, large language models (LLMs) have been used to automatically generate new features by leveraging their vast knowledge. However, existing LLM-based approaches often produce overly simple or repetitive features, partly due to inherent biases in the transformations the LLM chooses and the lack of structured reasoning guidance during generation. In this paper, we propose a novel method REFeat, which guides an LLM to discover diverse and informative features by leveraging multiple types of reasoning to steer the feature generation process. Experiments on 59 benchmark datasets demonstrate that our approach not only achieves higher predictive accuracy on average, but also discovers more diverse and meaningful features. These results highlight the promise of incorporating rich reasoning paradigms and adaptive strategy selection into LLM-driven feature discovery for tabular data. △ Less

Submitted 25 June, 2025; originally announced June 2025.

arXiv:2506.20112 [pdf]

A Multi-Pass Large Language Model Framework for Precise and Efficient Radiology Report Error Detection

Authors: Songsoo Kim, Seungtae Lee, See Young Lee, Joonho Kim, Keechan Kan, Dukyong Yoon

Abstract: Background: The positive predictive value (PPV) of large language model (LLM)-based proofreading for radiology reports is limited due to the low error prevalence. Purpose: To assess whether a three-pass LLM framework enhances PPV and reduces operational costs compared with baseline approaches. Materials and Methods: A retrospective analysis was performed on 1,000 consecutive radiology reports (250… ▽ More Background: The positive predictive value (PPV) of large language model (LLM)-based proofreading for radiology reports is limited due to the low error prevalence. Purpose: To assess whether a three-pass LLM framework enhances PPV and reduces operational costs compared with baseline approaches. Materials and Methods: A retrospective analysis was performed on 1,000 consecutive radiology reports (250 each: radiography, ultrasonography, CT, MRI) from the MIMIC-III database. Two external datasets (CheXpert and Open-i) were validation sets. Three LLM frameworks were tested: (1) single-prompt detector; (2) extractor plus detector; and (3) extractor, detector, and false-positive verifier. Precision was measured by PPV and absolute true positive rate (aTPR). Efficiency was calculated from model inference charges and reviewer remuneration. Statistical significance was tested using cluster bootstrap, exact McNemar tests, and Holm-Bonferroni correction. Results: Framework PPV increased from 0.063 (95% CI, 0.036-0.101, Framework 1) to 0.079 (0.049-0.118, Framework 2), and significantly to 0.159 (0.090-0.252, Framework 3; P<.001 vs. baselines). aTPR remained stable (0.012-0.014; P>=.84). Operational costs per 1,000 reports dropped to USD 5.58 (Framework 3) from USD 9.72 (Framework 1) and USD 6.85 (Framework 2), reflecting reductions of 42.6% and 18.5%, respectively. Human-reviewed reports decreased from 192 to 88. External validation supported Framework 3's superior PPV (CheXpert 0.133, Open-i 0.105) and stable aTPR (0.007). Conclusion: A three-pass LLM framework significantly enhanced PPV and reduced operational costs, maintaining detection performance, providing an effective strategy for AI-assisted radiology report quality assurance. △ Less

Submitted 25 June, 2025; originally announced June 2025.

Comments: 29 pages, 5 figures, 4 tables. Code available at https://github.com/radssk/mp-rred

ACM Class: I.2.7

arXiv:2506.19602 [pdf, ps, other]

Soft Robotic Delivery of Coiled Anchors for Cardiac Interventions

Authors: Leonardo Zamora Yanez, Jacob Rogatinsky, Dominic Recco, Sang-Yoep Lee, Grace Matthews, Andrew P. Sabelhaus, Tommaso Ranzani

Abstract: Trans-catheter cardiac intervention has become an increasingly available option for high-risk patients without the complications of open heart surgery. However, current catheterbased platforms suffer from a lack of dexterity, force application, and compliance required to perform complex intracardiac procedures. An exemplary task that would significantly ease minimally invasive intracardiac procedu… ▽ More Trans-catheter cardiac intervention has become an increasingly available option for high-risk patients without the complications of open heart surgery. However, current catheterbased platforms suffer from a lack of dexterity, force application, and compliance required to perform complex intracardiac procedures. An exemplary task that would significantly ease minimally invasive intracardiac procedures is the implantation of anchor coils, which can be used to fix and implant various devices in the beating heart. We introduce a robotic platform capable of delivering anchor coils. We develop a kineto-statics model of the robotic platform and demonstrate low positional error. We leverage the passive compliance and high force output of the actuator in a multi-anchor delivery procedure against a motile in-vitro simulator with millimeter level accuracy. △ Less

Submitted 24 June, 2025; originally announced June 2025.

Comments: This work has been submitted to the IEEE for possible publication

arXiv:2506.19451 [pdf, ps, other]

Low-Complexity Semantic Packet Aggregation for Token Communication via Lookahead Search

Authors: Seunghun Lee, Jihong Park, Jinho Choi, Hyuncheol Park

Abstract: Tokens are fundamental processing units of generative AI (GenAI) and large language models (LLMs), and token communication (TC) is essential for enabling remote AI-generate content (AIGC) and wireless LLM applications. Unlike traditional bits, each of which is independently treated, the semantics of each token depends on its surrounding context tokens. This inter-token dependency makes TC vulnerab… ▽ More Tokens are fundamental processing units of generative AI (GenAI) and large language models (LLMs), and token communication (TC) is essential for enabling remote AI-generate content (AIGC) and wireless LLM applications. Unlike traditional bits, each of which is independently treated, the semantics of each token depends on its surrounding context tokens. This inter-token dependency makes TC vulnerable to outage channels, where the loss of a single token can significantly distort the original message semantics. Motivated by this, this paper focuses on optimizing token packetization to maximize the average token similarity (ATS) between the original and received token messages under outage channels. Due to inter-token dependency, this token grouping problem is combinatorial, with complexity growing exponentially with message length. To address this, we propose a novel framework of semantic packet aggregation with lookahead search (SemPA-Look), built on two core ideas. First, it introduces the residual semantic score (RSS) as a token-level surrogate for the message-level ATS, allowing robust semantic preservation even when a certain token packet is lost. Second, instead of full search, SemPA-Look applies a lookahead search-inspired algorithm that samples intra-packet token candidates without replacement (fixed depth), conditioned on inter-packet token candidates sampled with replacement (fixed width), thereby achieving linear complexity. Experiments on a remote AIGC task with the MS-COCO dataset (text captioned images) demonstrate that SemPA-Look achieves high ATS and LPIPS scores comparable to exhaustive search, while reducing computational complexity by up to 40$\times$. Compared to other linear-complexity algorithms such as the genetic algorithm (GA), SemPA-Look achieves 10$\times$ lower complexity, demonstrating its practicality for remote AIGC and other TC applications. △ Less

Submitted 24 June, 2025; originally announced June 2025.

arXiv:2506.19417 [pdf, ps, other]

Center of Gravity-Guided Focusing Influence Mechanism for Multi-Agent Reinforcement Learning

Authors: Yisak Park, Sunwoo Lee, Seungyul Han

Abstract: Cooperative multi-agent reinforcement learning (MARL) under sparse rewards presents a fundamental challenge due to limited exploration and insufficient coordinated attention among agents. In this work, we propose the Focusing Influence Mechanism (FIM), a novel framework that enhances cooperation by directing agent influence toward task-critical elements, referred to as Center of Gravity (CoG) stat… ▽ More Cooperative multi-agent reinforcement learning (MARL) under sparse rewards presents a fundamental challenge due to limited exploration and insufficient coordinated attention among agents. In this work, we propose the Focusing Influence Mechanism (FIM), a novel framework that enhances cooperation by directing agent influence toward task-critical elements, referred to as Center of Gravity (CoG) state dimensions, inspired by Clausewitz's military theory. FIM consists of three core components: (1) identifying CoG state dimensions based on their stability under agent behavior, (2) designing counterfactual intrinsic rewards to promote meaningful influence on these dimensions, and (3) encouraging persistent and synchronized focus through eligibility-trace-based credit accumulation. These mechanisms enable agents to induce more targeted and effective state transitions, facilitating robust cooperation even in extremely sparse reward settings. Empirical evaluations across diverse MARL benchmarks demonstrate that the proposed FIM significantly improves cooperative performance compared to baselines. △ Less

Submitted 24 June, 2025; originally announced June 2025.

Comments: 9 technical page followed by references and appendix

arXiv:2506.19217 [pdf, ps, other]

MedErr-CT: A Visual Question Answering Benchmark for Identifying and Correcting Errors in CT Reports

Authors: Sunggu Kyung, Hyungbin Park, Jinyoung Seo, Jimin Sung, Jihyun Kim, Dongyeong Kim, Wooyoung Jo, Yoojin Nam, Sangah Park, Taehee Kwon, Sang Min Lee, Namkug Kim

Abstract: Computed Tomography (CT) plays a crucial role in clinical diagnosis, but the growing demand for CT examinations has raised concerns about diagnostic errors. While Multimodal Large Language Models (MLLMs) demonstrate promising comprehension of medical knowledge, their tendency to produce inaccurate information highlights the need for rigorous validation. However, existing medical visual question an… ▽ More Computed Tomography (CT) plays a crucial role in clinical diagnosis, but the growing demand for CT examinations has raised concerns about diagnostic errors. While Multimodal Large Language Models (MLLMs) demonstrate promising comprehension of medical knowledge, their tendency to produce inaccurate information highlights the need for rigorous validation. However, existing medical visual question answering (VQA) benchmarks primarily focus on simple visual recognition tasks, lacking clinical relevance and failing to assess expert-level knowledge. We introduce MedErr-CT, a novel benchmark for evaluating medical MLLMs' ability to identify and correct errors in CT reports through a VQA framework. The benchmark includes six error categories - four vision-centric errors (Omission, Insertion, Direction, Size) and two lexical error types (Unit, Typo) - and is organized into three task levels: classification, detection, and correction. Using this benchmark, we quantitatively assess the performance of state-of-the-art 3D medical MLLMs, revealing substantial variation in their capabilities across different error types. Our benchmark contributes to the development of more reliable and clinically applicable MLLMs, ultimately helping reduce diagnostic errors and improve accuracy in clinical practice. The code and datasets are available at https://github.com/babbu3682/MedErr-CT. △ Less

Submitted 23 June, 2025; originally announced June 2025.

Comments: 14 pages, 5 figures, submitted to CVPR 2025

arXiv:2506.18557 [pdf, ps, other]

Object-aware Sound Source Localization via Audio-Visual Scene Understanding

Authors: Sung Jin Um, Dongjin Kim, Sangmin Lee, Jung Uk Kim

Abstract: Audio-visual sound source localization task aims to spatially localize sound-making objects within visual scenes by integrating visual and audio cues. However, existing methods struggle with accurately localizing sound-making objects in complex scenes, particularly when visually similar silent objects coexist. This limitation arises primarily from their reliance on simple audio-visual corresponden… ▽ More Audio-visual sound source localization task aims to spatially localize sound-making objects within visual scenes by integrating visual and audio cues. However, existing methods struggle with accurately localizing sound-making objects in complex scenes, particularly when visually similar silent objects coexist. This limitation arises primarily from their reliance on simple audio-visual correspondence, which does not capture fine-grained semantic differences between sound-making and silent objects. To address these challenges, we propose a novel sound source localization framework leveraging Multimodal Large Language Models (MLLMs) to generate detailed contextual information that explicitly distinguishes between sound-making foreground objects and silent background objects. To effectively integrate this detailed information, we introduce two novel loss functions: Object-aware Contrastive Alignment (OCA) loss and Object Region Isolation (ORI) loss. Extensive experimental results on MUSIC and VGGSound datasets demonstrate the effectiveness of our approach, significantly outperforming existing methods in both single-source and multi-source localization scenarios. Code and generated detailed contextual information are available at: https://github.com/VisualAIKHU/OA-SSL. △ Less

Submitted 23 June, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

Comments: Accepted at CVPR 2025

Journal ref: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 8342-8351

arXiv:2506.18284 [pdf]

Open Set Recognition for Endoscopic Image Classification: A Deep Learning Approach on the Kvasir Dataset

Authors: Kasra Moazzami, Seoyoun Son, John Lin, Sun Min Lee, Daniel Son, Hayeon Lee, Jeongho Lee, Seongji Lee

Abstract: Endoscopic image classification plays a pivotal role in medical diagnostics by identifying anatomical landmarks and pathological findings. However, conventional closed-set classification frameworks are inherently limited in open-world clinical settings, where previously unseen conditions can arise andcompromise model reliability. To address this, we explore the application of Open Set Recognition… ▽ More Endoscopic image classification plays a pivotal role in medical diagnostics by identifying anatomical landmarks and pathological findings. However, conventional closed-set classification frameworks are inherently limited in open-world clinical settings, where previously unseen conditions can arise andcompromise model reliability. To address this, we explore the application of Open Set Recognition (OSR) techniques on the Kvasir dataset, a publicly available and diverse endoscopic image collection. In this study, we evaluate and compare the OSR capabilities of several representative deep learning architectures, including ResNet-50, Swin Transformer, and a hybrid ResNet-Transformer model, under both closed-set and open-set conditions. OpenMax is adopted as a baseline OSR method to assess the ability of these models to distinguish known classes from previously unseen categories. This work represents one of the first efforts to apply open set recognition to the Kvasir dataset and provides a foundational benchmark for evaluating OSR performance in medical image analysis. Our results offer practical insights into model behavior in clinically realistic settings and highlight the importance of OSR techniques for the safe deployment of AI systems in endoscopy. △ Less

Submitted 23 June, 2025; originally announced June 2025.

Comments: 9 pages, 3 figures, 3 tables

arXiv:2506.17951 [pdf, ps, other]

A Comprehensive Graph Framework for Question Answering with Mode-Seeking Preference Alignment

Authors: Quanwei Tang, Sophia Yat Mei Lee, Junshuang Wu, Dong Zhang, Shoushan Li, Erik Cambria, Guodong Zhou

Abstract: Recent advancements in retrieval-augmented generation (RAG) have enhanced large language models in question answering by integrating external knowledge. However, challenges persist in achieving global understanding and aligning responses with human ethical and quality preferences. To address these issues, we propose GraphMPA, a comprehensive graph-based framework with mode-seeking preference align… ▽ More Recent advancements in retrieval-augmented generation (RAG) have enhanced large language models in question answering by integrating external knowledge. However, challenges persist in achieving global understanding and aligning responses with human ethical and quality preferences. To address these issues, we propose GraphMPA, a comprehensive graph-based framework with mode-seeking preference alignment. Our approach constructs a hierarchical document graph using a general similarity measurement, mimicking human cognitive processes for information understanding and synthesis. Additionally, we introduce mode-seeking preference optimization to better align model outputs with human preferences through probability-matching constraints. Extensive experiments on six datasets demonstrate the effectiveness of our \href{https://github.com/tangquanwei/GraphMPA}{GraphMPA}. △ Less

Submitted 22 June, 2025; originally announced June 2025.

Comments: acl 2025 findings

arXiv:2506.16617 [pdf, ps, other]

The Role of Explanation Styles and Perceived Accuracy on Decision Making in Predictive Process Monitoring

Authors: Soobin Chae, Suhwan Lee, Hanna Hauptmann, Hajo A. Reijers, Xixi Lu

Abstract: Predictive Process Monitoring (PPM) often uses deep learning models to predict the future behavior of ongoing processes, such as predicting process outcomes. While these models achieve high accuracy, their lack of interpretability undermines user trust and adoption. Explainable AI (XAI) aims to address this challenge by providing the reasoning behind the predictions. However, current evaluations o… ▽ More Predictive Process Monitoring (PPM) often uses deep learning models to predict the future behavior of ongoing processes, such as predicting process outcomes. While these models achieve high accuracy, their lack of interpretability undermines user trust and adoption. Explainable AI (XAI) aims to address this challenge by providing the reasoning behind the predictions. However, current evaluations of XAI in PPM focus primarily on functional metrics (such as fidelity), overlooking user-centered aspects such as their effect on task performance and decision-making. This study investigates the effects of explanation styles (feature importance, rule-based, and counterfactual) and perceived AI accuracy (low or high) on decision-making in PPM. We conducted a decision-making experiment, where users were presented with the AI predictions, perceived accuracy levels, and explanations of different styles. Users' decisions were measured both before and after receiving explanations, allowing the assessment of objective metrics (Task Performance and Agreement) and subjective metrics (Decision Confidence). Our findings show that perceived accuracy and explanation style have a significant effect. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: Accepted at CAiSE'25

arXiv:2506.15977 [pdf, ps, other]

Towards Classifying Histopathological Microscope Images as Time Series Data

Authors: Sungrae Hong, Hyeongmin Park, Youngsin Ko, Sol Lee, Bryan Wong, Mun Yong Yi

Abstract: As the frontline data for cancer diagnosis, microscopic pathology images are fundamental for providing patients with rapid and accurate treatment. However, despite their practical value, the deep learning community has largely overlooked their usage. This paper proposes a novel approach to classifying microscopy images as time series data, addressing the unique challenges posed by their manual acq… ▽ More As the frontline data for cancer diagnosis, microscopic pathology images are fundamental for providing patients with rapid and accurate treatment. However, despite their practical value, the deep learning community has largely overlooked their usage. This paper proposes a novel approach to classifying microscopy images as time series data, addressing the unique challenges posed by their manual acquisition and weakly labeled nature. The proposed method fits image sequences of varying lengths to a fixed-length target by leveraging Dynamic Time-series Warping (DTW). Attention-based pooling is employed to predict the class of the case simultaneously. We demonstrate the effectiveness of our approach by comparing performance with various baselines and showcasing the benefits of using various inference strategies in achieving stable and reliable results. Ablation studies further validate the contribution of each component. Our approach contributes to medical image analysis by not only embracing microscopic images but also lifting them to a trustworthy level of performance. △ Less