Search | arXiv e-print repository

No Language Data Left Behind: A Comparative Study of CJK Language Datasets in the Hugging Face Ecosystem

Authors: Dasol Choi, Woomyoung Park, Youngsook Song

Abstract: Recent advances in Natural Language Processing (NLP) have underscored the crucial role of high-quality datasets in building large language models (LLMs). However, while extensive resources and analyses exist for English, the landscape for East Asian languages - particularly Chinese, Japanese, and Korean (CJK) - remains fragmented and underexplored, despite these languages together serving over 1.6… ▽ More Recent advances in Natural Language Processing (NLP) have underscored the crucial role of high-quality datasets in building large language models (LLMs). However, while extensive resources and analyses exist for English, the landscape for East Asian languages - particularly Chinese, Japanese, and Korean (CJK) - remains fragmented and underexplored, despite these languages together serving over 1.6 billion speakers. To address this gap, we investigate the HuggingFace ecosystem from a cross-linguistic perspective, focusing on how cultural norms, research environments, and institutional practices shape dataset availability and quality. Drawing on more than 3,300 datasets, we employ quantitative and qualitative methods to examine how these factors drive distinct creation and curation patterns across Chinese, Japanese, and Korean NLP communities. Our findings highlight the large-scale and often institution-driven nature of Chinese datasets, grassroots community-led development in Korean NLP, and an entertainment- and subculture-focused emphasis on Japanese collections. By uncovering these patterns, we reveal practical strategies for enhancing dataset documentation, licensing clarity, and cross-lingual resource sharing - ultimately guiding more effective and culturally attuned LLM development in East Asia. We conclude by discussing best practices for future dataset curation and collaboration, aiming to strengthen resource development across all three languages. △ Less

Submitted 6 July, 2025; originally announced July 2025.

arXiv:2507.04327 [pdf, ps, other]

TinyProto: Communication-Efficient Federated Learning with Sparse Prototypes in Resource-Constrained Environments

Authors: Gyuejeong Lee, Daeyoung Choi

Abstract: Communication efficiency in federated learning (FL) remains a critical challenge for resource-constrained environments. While prototype-based FL reduces communication overhead by sharing class prototypes-mean activations in the penultimate layer-instead of model parameters, its efficiency decreases with larger feature dimensions and class counts. We propose TinyProto, which addresses these limitat… ▽ More Communication efficiency in federated learning (FL) remains a critical challenge for resource-constrained environments. While prototype-based FL reduces communication overhead by sharing class prototypes-mean activations in the penultimate layer-instead of model parameters, its efficiency decreases with larger feature dimensions and class counts. We propose TinyProto, which addresses these limitations through Class-wise Prototype Sparsification (CPS) and adaptive prototype scaling. CPS enables structured sparsity by allocating specific dimensions to class prototypes and transmitting only non-zero elements, while adaptive scaling adjusts prototypes based on class distributions. Our experiments show TinyProto reduces communication costs by up to 4x compared to existing methods while maintaining performance. Beyond its communication efficiency, TinyProto offers crucial advantages: achieving compression without client-side computational overhead and supporting heterogeneous architectures, making it ideal for resource-constrained heterogeneous FL. △ Less

Submitted 6 July, 2025; originally announced July 2025.

arXiv:2507.04310 [pdf, ps, other]

Heterogeneous Federated Learning with Prototype Alignment and Upscaling

Authors: Gyuejeong Lee, Jihwan Shin, Daeyoung Choi

Abstract: Heterogeneity in data distributions and model architectures remains a significant challenge in federated learning (FL). Various heterogeneous FL (HtFL) approaches have recently been proposed to address this challenge. Among them, prototype-based FL (PBFL) has emerged as a practical framework that only shares per-class mean activations from the penultimate layer. However, PBFL approaches often suff… ▽ More Heterogeneity in data distributions and model architectures remains a significant challenge in federated learning (FL). Various heterogeneous FL (HtFL) approaches have recently been proposed to address this challenge. Among them, prototype-based FL (PBFL) has emerged as a practical framework that only shares per-class mean activations from the penultimate layer. However, PBFL approaches often suffer from suboptimal prototype separation, limiting their discriminative power. We propose Prototype Normalization (ProtoNorm), a novel PBFL framework that addresses this limitation through two key components: Prototype Alignment (PA) and Prototype Upscaling (PU). The PA method draws inspiration from the Thomson problem in classical physics, optimizing global prototype configurations on a unit sphere to maximize angular separation; subsequently, the PU method increases prototype magnitudes to enhance separation in Euclidean space. Extensive evaluations on benchmark datasets show that our approach better separates prototypes and thus consistently outperforms existing HtFL approaches. Notably, since ProtoNorm inherits the communication efficiency of PBFL and the PA is performed server-side, it is particularly suitable for resource-constrained environments. △ Less

Submitted 6 July, 2025; originally announced July 2025.

arXiv:2507.01308 [pdf, ps, other]

LANet: A Lane Boundaries-Aware Approach For Robust Trajectory Prediction

Authors: Muhammad Atta ur Rahman, Dooseop Choi, KyoungWook Min

Abstract: Accurate motion forecasting is critical for safe and efficient autonomous driving, enabling vehicles to predict future trajectories and make informed decisions in complex traffic scenarios. Most of the current designs of motion prediction models are based on the major representation of lane centerlines, which limits their capability to capture critical road environments and traffic rules and const… ▽ More Accurate motion forecasting is critical for safe and efficient autonomous driving, enabling vehicles to predict future trajectories and make informed decisions in complex traffic scenarios. Most of the current designs of motion prediction models are based on the major representation of lane centerlines, which limits their capability to capture critical road environments and traffic rules and constraints. In this work, we propose an enhanced motion forecasting model informed by multiple vector map elements, including lane boundaries and road edges, that facilitates a richer and more complete representation of driving environments. An effective feature fusion strategy is developed to merge information in different vector map components, where the model learns holistic information on road structures and their interactions with agents. Since encoding more information about the road environment increases memory usage and is computationally expensive, we developed an effective pruning mechanism that filters the most relevant map connections to the target agent, ensuring computational efficiency while maintaining essential spatial and semantic relationships for accurate trajectory prediction. Overcoming the limitations of lane centerline-based models, our method provides a more informative and efficient representation of the driving environment and advances the state of the art for autonomous vehicle motion forecasting. We verify our approach with extensive experiments on the Argoverse 2 motion forecasting dataset, where our method maintains competitiveness on AV2 while achieving improved performance. Index Terms-Autonomous driving, trajectory prediction, vector map elements, road topology, connection pruning, Argoverse 2. △ Less

Submitted 1 July, 2025; originally announced July 2025.

Comments: Accepted at the 17th IEEE International Conference on Advanced Computational Intelligence (ICACI 2025)

arXiv:2506.13921 [pdf, ps, other]

A Study on Effective Initial Guess Finding Method Based on Bézier Curves: Orbit Determination Applications

Authors: Daegyun Choi, Sungwook Yang, Henzeh Leeghim, Donghoon Kim

Abstract: In celestial mechanics, proper orbits related to missions are obtained by solving two-point boundary value problems. Since a selection method of initial value affects the convergence of the solution, developing an effective method to find an initial guess is required. In this work, Bézier curves, which can describe complicated curves and surfaces, are utilized to find the initial guess. First, the… ▽ More In celestial mechanics, proper orbits related to missions are obtained by solving two-point boundary value problems. Since a selection method of initial value affects the convergence of the solution, developing an effective method to find an initial guess is required. In this work, Bézier curves, which can describe complicated curves and surfaces, are utilized to find the initial guess. First, the given problems are transformed into Bézier curves forms, and Bézier curves' control points, which can handle the shape of curves, are selected by solving the system of nonlinear equations. Finally, the initial guess is obtained by substituting the calculated control points to Bézier curves. To validate the performance of the proposed method, numerical simulations are conducted with respect to three kinds of orbits, which are from circular to highly elliptical orbit (HEO). The proposed method is compared to the general shooting method. The comparison results show that the initial guess calculated by Bézier curves makes finding the solution more efficient in terms of computational time and iterations. Also, it shows that the proposed method finds the solution for the HEO while the general shooting method fails to find the solution. △ Less

Submitted 16 June, 2025; originally announced June 2025.

Comments: 10 pages, 4 figures, 4 tables, 2019 AAS/AIAA Astrodynamics Specialist Conference

arXiv:2506.11344 [pdf, ps, other]

Do We Still Need Audio? Rethinking Speaker Diarization with a Text-Based Approach Using Multiple Prediction Models

Authors: Peilin Wu, Jinho D. Choi

Abstract: We present a novel approach to Speaker Diarization (SD) by leveraging text-based methods focused on Sentence-level Speaker Change Detection within dialogues. Unlike audio-based SD systems, which are often challenged by audio quality and speaker similarity, our approach utilizes the dialogue transcript alone. Two models are developed: the Single Prediction Model (SPM) and the Multiple Prediction Mo… ▽ More We present a novel approach to Speaker Diarization (SD) by leveraging text-based methods focused on Sentence-level Speaker Change Detection within dialogues. Unlike audio-based SD systems, which are often challenged by audio quality and speaker similarity, our approach utilizes the dialogue transcript alone. Two models are developed: the Single Prediction Model (SPM) and the Multiple Prediction Model (MPM), both of which demonstrate significant improvements in identifying speaker changes, particularly in short conversations. Our findings, based on a curated dataset encompassing diverse conversational scenarios, reveal that the text-based SD approach, especially the MPM, performs competitively against state-of-the-art audio-based SD systems, with superior performance in short conversational contexts. This paper not only showcases the potential of leveraging linguistic features for SD but also highlights the importance of integrating semantic understanding into SD systems, opening avenues for future research in multimodal and semantic feature-based diarization. △ Less

Submitted 12 June, 2025; originally announced June 2025.

arXiv:2506.03575 [pdf, ps, other]

Brillouin lasers in Bragg grating microresonators

Authors: Ryan L. Russell, Moritz Merklein, Choon Kong Lai, Cong Tinh Bui, Alvaro Casas-Bedoya, Duk-Yong Choi, Stephen J. Madden, Benjamin J. Eggleton

Abstract: Chip-scale coherent light sources are required in applications spanning metrology and sensing to telecommunications. Brillouin lasers (BLs) offer a route to ultra-coherent optical sources in compact microresonators with free spectral range (FSR) matched to the Brillouin frequency shift (BFS). However, BFS - FSR matching typically facilitates cascaded Brillouin scattering, constraining achievable B… ▽ More Chip-scale coherent light sources are required in applications spanning metrology and sensing to telecommunications. Brillouin lasers (BLs) offer a route to ultra-coherent optical sources in compact microresonators with free spectral range (FSR) matched to the Brillouin frequency shift (BFS). However, BFS - FSR matching typically facilitates cascaded Brillouin scattering, constraining achievable BL output power and coherence. Here, we demonstrate inhibition of cascading in a planar-integrated chalcogenide microresonator by exploiting the photonic bandgap (PBG) associated with a post-fabrication inscribed, reconfigurable intracavity Bragg grating. The PBG inhibits energy transfer within the target Brillouin scattering pathway, such as from pump to first-order Stokes wave. As a quantitative measure of Brillouin scattering inhibition, we report at least six-fold increase in threshold for onset of BL oscillation, which is ultimately limited by thermorefraction. For on-chip pump power of 399 mW, sufficient for a tenth-order Brillouin cascade, complete inhibition was achieved. Our work positions Bragg grating microresonators as an enabling platform for high performance on-chip BL sources, with reconfigurable modes of operation. △ Less

Submitted 4 June, 2025; originally announced June 2025.

arXiv:2506.01360 [pdf, ps, other]

RDB2G-Bench: A Comprehensive Benchmark for Automatic Graph Modeling of Relational Databases

Authors: Dongwon Choi, Sunwoo Kim, Juyeon Kim, Kyungho Kim, Geon Lee, Shinhwan Kang, Myunghwan Kim, Kijung Shin

Abstract: Relational databases (RDBs) are composed of interconnected tables, where relationships between them are defined through foreign keys. Recent research on applying machine learning to RDBs has explored graph-based representations of RDBs, where rows of tables are modeled as nodes, and foreign key relationships are modeled as edges. RDB-to-graph modeling helps capture cross-table dependencies, ultima… ▽ More Relational databases (RDBs) are composed of interconnected tables, where relationships between them are defined through foreign keys. Recent research on applying machine learning to RDBs has explored graph-based representations of RDBs, where rows of tables are modeled as nodes, and foreign key relationships are modeled as edges. RDB-to-graph modeling helps capture cross-table dependencies, ultimately leading to enhanced performance across diverse tasks. However, there are numerous ways to model RDBs as graphs, and performance varies significantly depending on the chosen graph model. In our analysis, applying a common heuristic rule for graph modeling leads to up to a 10% drop in performance compared to the best-performing graph model, which remains non-trivial to identify. To foster research on intelligent RDB-to-graph modeling, we introduce RDB2G-Bench, the first benchmark framework for evaluating such methods. We construct extensive datasets covering 5 real-world RDBs and 12 predictive tasks, resulting in around 50k graph-performance pairs for efficient and reproducible evaluations. Thanks to our precomputed datasets, we were able to benchmark 9 automatic RDB-to-graph modeling methods on the 12 tasks over 600x faster than on-the-fly evaluation, which requires repeated model training. Our analysis of the datasets and benchmark results reveals key structural patterns affecting graph model effectiveness, along with practical implications for effective graph modeling. △ Less

Submitted 2 June, 2025; originally announced June 2025.

Comments: Code and datasets are in https://github.com/chlehdwon/RDB2G-Bench

arXiv:2506.01206 [pdf, other]

Mamba Drafters for Speculative Decoding

Authors: Daewon Choi, Seunghyuk Oh, Saket Dingliwal, Jihoon Tack, Kyuyoung Kim, Woomin Song, Seojin Kim, Insu Han, Jinwoo Shin, Aram Galstyan, Shubham Katiyar, Sravan Babu Bodapati

Abstract: Speculative decoding has emerged as a promising approach to accelerating large language model (LLM) generation using a fast drafter while maintaining alignment with the target model's distribution. However, existing approaches face a trade-off: external drafters offer flexibility but can suffer from slower drafting, while self-speculation methods use drafters tailored to the target model but requi… ▽ More Speculative decoding has emerged as a promising approach to accelerating large language model (LLM) generation using a fast drafter while maintaining alignment with the target model's distribution. However, existing approaches face a trade-off: external drafters offer flexibility but can suffer from slower drafting, while self-speculation methods use drafters tailored to the target model but require re-training. In this paper, we introduce novel drafters based on Mamba, a state-of-the-art state space model (SSM), as a solution that combines the best aspects of both approaches. By leveraging the linear structure of SSMs, our approach avoids the quadratic complexity inherent in traditional Transformer-based methods, enabling faster drafting and lower memory usage while maintaining the flexibility to work across different target models. We further enhance efficiency with a novel test-time tree search algorithm for generating high-quality draft candidates. Our empirical evaluation demonstrates that Mamba-based drafters not only outperform existing external drafting methods but are also comparable to state-of-the-art self-speculation approaches while using less memory and maintaining their cross-model adaptability. △ Less

Submitted 1 June, 2025; originally announced June 2025.

arXiv:2506.00481 [pdf, other]

PVP: An Image Dataset for Personalized Visual Persuasion with Persuasion Strategies, Viewer Characteristics, and Persuasiveness Ratings

Authors: Junseo Kim, Jongwook Han, Dongmin Choi, Jongwook Yoon, Eun-Ju Lee, Yohan Jo

Abstract: Visual persuasion, which uses visual elements to influence cognition and behaviors, is crucial in fields such as advertising and political communication. With recent advancements in artificial intelligence, there is growing potential to develop persuasive systems that automatically generate persuasive images tailored to individuals. However, a significant bottleneck in this area is the lack of com… ▽ More Visual persuasion, which uses visual elements to influence cognition and behaviors, is crucial in fields such as advertising and political communication. With recent advancements in artificial intelligence, there is growing potential to develop persuasive systems that automatically generate persuasive images tailored to individuals. However, a significant bottleneck in this area is the lack of comprehensive datasets that connect the persuasiveness of images with the personal information about those who evaluated the images. To address this gap and facilitate technological advancements in personalized visual persuasion, we release the Personalized Visual Persuasion (PVP) dataset, comprising 28,454 persuasive images across 596 messages and 9 persuasion strategies. Importantly, the PVP dataset provides persuasiveness scores of images evaluated by 2,521 human annotators, along with their demographic and psychological characteristics (personality traits and values). We demonstrate the utility of our dataset by developing a persuasive image generator and an automated evaluator, and establish benchmark baselines. Our experiments reveal that incorporating psychological characteristics enhances the generation and evaluation of persuasive images, providing valuable insights for personalized visual persuasion. △ Less

Submitted 31 May, 2025; originally announced June 2025.

Comments: ACL 2025 Main. Code and dataset are released at: https://github.com/holi-lab/PVP_Personalized_Visual_Persuasion

arXiv:2505.16348 [pdf, other]

Embodied Agents Meet Personalization: Exploring Memory Utilization for Personalized Assistance

Authors: Taeyoon Kwon, Dongwook Choi, Sunghwan Kim, Hyojun Kim, Seungjun Moon, Beong-woo Kwak, Kuan-Hao Huang, Jinyoung Yeo

Abstract: Embodied agents empowered by large language models (LLMs) have shown strong performance in household object rearrangement tasks. However, these tasks primarily focus on single-turn interactions with simplified instructions, which do not truly reflect the challenges of providing meaningful assistance to users. To provide personalized assistance, embodied agents must understand the unique semantics… ▽ More Embodied agents empowered by large language models (LLMs) have shown strong performance in household object rearrangement tasks. However, these tasks primarily focus on single-turn interactions with simplified instructions, which do not truly reflect the challenges of providing meaningful assistance to users. To provide personalized assistance, embodied agents must understand the unique semantics that users assign to the physical world (e.g., favorite cup, breakfast routine) by leveraging prior interaction history to interpret dynamic, real-world instructions. Yet, the effectiveness of embodied agents in utilizing memory for personalized assistance remains largely underexplored. To address this gap, we present MEMENTO, a personalized embodied agent evaluation framework designed to comprehensively assess memory utilization capabilities to provide personalized assistance. Our framework consists of a two-stage memory evaluation process design that enables quantifying the impact of memory utilization on task performance. This process enables the evaluation of agents' understanding of personalized knowledge in object rearrangement tasks by focusing on its role in goal interpretation: (1) the ability to identify target objects based on personal meaning (object semantics), and (2) the ability to infer object-location configurations from consistent user patterns, such as routines (user patterns). Our experiments across various LLMs reveal significant limitations in memory utilization, with even frontier models like GPT-4o experiencing a 30.5% performance drop when required to reference multiple memories, particularly in tasks involving user patterns. These findings, along with our detailed analyses and case studies, provide valuable insights for future research in developing more effective personalized embodied agents. Project website: https://connoriginal.github.io/MEMENTO △ Less

Submitted 22 May, 2025; originally announced May 2025.

Comments: Work in progress

arXiv:2505.16267 [pdf, ps, other]

Rapid adiabatic couplers with arbitrary split ratios for broadband DWDM interleaver application

Authors: Daehan Choi, Woo-Joo Kim, Young-Ik Sohn

Abstract: We experimentally demonstrate a compact and broadband rapid adiabatic couplers (RACs) with arbitrary power split ratios, achieved through the combination of translational offset and waveguide width control. Fabricated RACs of four different target split ratios show power splitting within $\pm$3% of the design target over a 160 nm wavelength range. Using these RACs, we implement an 8-channel dense… ▽ More We experimentally demonstrate a compact and broadband rapid adiabatic couplers (RACs) with arbitrary power split ratios, achieved through the combination of translational offset and waveguide width control. Fabricated RACs of four different target split ratios show power splitting within $\pm$3% of the design target over a 160 nm wavelength range. Using these RACs, we implement an 8-channel dense wavelength division multiplexing (DWDM) interleaver exhibiting < -20 dB crosstalk for the center 8 channels with flat-top passbands. Over a broader wavelength range, the design maintains crosstalk below -10 dB across more than 40 channels with 100 GHz spacing, demonstrating the broadband capability and scalability of RAC-based photonic integrated circuits. △ Less

Submitted 22 May, 2025; originally announced May 2025.

Comments: 4 pages, 4 figures

arXiv:2505.15685 [pdf, ps, other]

From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems

Authors: Xiuchao Sui, Daiying Tian, Qi Sun, Ruirui Chen, Dongkyu Choi, Kenneth Kwok, Soujanya Poria

Abstract: Foundation models (FMs) are increasingly used to bridge language and action in embodied agents, yet the operational characteristics of different FM integration strategies remain under-explored -- particularly for complex instruction following and versatile action generation in changing environments. This paper examines three paradigms for building robotic systems: end-to-end vision-language-action… ▽ More Foundation models (FMs) are increasingly used to bridge language and action in embodied agents, yet the operational characteristics of different FM integration strategies remain under-explored -- particularly for complex instruction following and versatile action generation in changing environments. This paper examines three paradigms for building robotic systems: end-to-end vision-language-action (VLA) models that implicitly integrate perception and planning, and modular pipelines incorporating either vision-language models (VLMs) or multimodal large language models (LLMs). We evaluate these paradigms through two focused case studies: a complex instruction grounding task assessing fine-grained instruction understanding and cross-modal disambiguation, and an object manipulation task targeting skill transfer via VLA finetuning. Our experiments in zero-shot and few-shot settings reveal trade-offs in generalization and data efficiency. By exploring performance limits, we distill design implications for developing language-driven physical agents and outline emerging challenges and opportunities for FM-powered robotics in real-world conditions. △ Less

Submitted 21 May, 2025; originally announced May 2025.

Comments: 17 pages, 13 figures

arXiv:2505.15367 [pdf, ps, other]

Better Safe Than Sorry? Overreaction Problem of Vision Language Models in Visual Emergency Recognition

Authors: Dasol Choi, Seunghyun Lee, Youngsook Song

Abstract: Vision-Language Models (VLMs) have shown capabilities in interpreting visual content, but their reliability in safety-critical everyday life scenarios remains insufficiently explored. We introduce VERI (Visual Emergency Recognition Dataset), a diagnostic benchmark comprising 200 images organized into 100 contrastive pairs. Each emergency scene is paired with a visually similar but safe counterpart… ▽ More Vision-Language Models (VLMs) have shown capabilities in interpreting visual content, but their reliability in safety-critical everyday life scenarios remains insufficiently explored. We introduce VERI (Visual Emergency Recognition Dataset), a diagnostic benchmark comprising 200 images organized into 100 contrastive pairs. Each emergency scene is paired with a visually similar but safe counterpart through human verification and refinement. Using a two-stage evaluation protocol - risk identification and emergency response - we assess 14 VLMs (2B to 124B parameters) across medical emergencies, accidents, and natural disasters. Our analysis reveals an "overreaction problem", where models accurately identify genuine emergencies (70-100 percent success rate) but produce high false-positive rates, misclassifying 31-96 percent of safe situations as dangerous. Ten safe scenarios were universally misclassified by all models regardless of scale. This "better-safe-than-sorry" bias primarily results from contextual overinterpretation (88-93 percent of errors), challenging VLM reliability in safety-critical applications. These findings highlight fundamental limitations in current VLM architectures, which persist despite increased model scale. Our results demonstrate an urgent need for strategies specifically improving contextual reasoning in ambiguous visual situations. The consistently low performance of the model indicates that these data serve effectively as a diagnostic dataset. △ Less

Submitted 6 July, 2025; v1 submitted 21 May, 2025; originally announced May 2025.

arXiv:2505.15277 [pdf, other]

Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

Authors: Hyungjoo Chae, Sunghwan Kim, Junhee Cho, Seungone Kim, Seungjun Moon, Gyeom Hwangbo, Dongha Lim, Minjin Kim, Yeonjun Hwang, Minju Gwak, Dongwook Choi, Minseok Kang, Gwanhoon Im, ByeongUng Cho, Hyojun Kim, Jun Hee Han, Taeyoon Kwon, Minju Kim, Beong-woo Kwak, Dongjin Kang, Jinyoung Yeo

Abstract: Web navigation is a unique domain that can automate many repetitive real-life tasks and is challenging as it requires long-horizon sequential decision making beyond typical multimodal large language model (MLLM) tasks. Yet, specialized reward models for web navigation that can be utilized during both training and test-time have been absent until now. Despite the importance of speed and cost-effect… ▽ More Web navigation is a unique domain that can automate many repetitive real-life tasks and is challenging as it requires long-horizon sequential decision making beyond typical multimodal large language model (MLLM) tasks. Yet, specialized reward models for web navigation that can be utilized during both training and test-time have been absent until now. Despite the importance of speed and cost-effectiveness, prior works have utilized MLLMs as reward models, which poses significant constraints for real-world deployment. To address this, in this work, we propose the first process reward model (PRM) called Web-Shepherd which could assess web navigation trajectories in a step-level. To achieve this, we first construct the WebPRM Collection, a large-scale dataset with 40K step-level preference pairs and annotated checklists spanning diverse domains and difficulty levels. Next, we also introduce the WebRewardBench, the first meta-evaluation benchmark for evaluating PRMs. In our experiments, we observe that our Web-Shepherd achieves about 30 points better accuracy compared to using GPT-4o on WebRewardBench. Furthermore, when testing on WebArena-lite by using GPT-4o-mini as the policy and Web-Shepherd as the verifier, we achieve 10.9 points better performance, in 10 less cost compared to using GPT-4o-mini as the verifier. Our model, dataset, and code are publicly available at LINK. △ Less

Submitted 21 May, 2025; originally announced May 2025.

Comments: Work in progress

arXiv:2505.15160 [pdf, other]

Lossless Token Merging Even Without Fine-Tuning in Vision Transformers

Authors: Jaeyeon Lee, Dong-Wan Choi

Abstract: Although Vision Transformers (ViTs) have become the standard architecture in computer vision, their massive sizes lead to significant computational overhead. Token compression techniques have attracted considerable attention to address this issue, but they often suffer from severe information loss, requiring extensive additional training to achieve practical performance. In this paper, we propose… ▽ More Although Vision Transformers (ViTs) have become the standard architecture in computer vision, their massive sizes lead to significant computational overhead. Token compression techniques have attracted considerable attention to address this issue, but they often suffer from severe information loss, requiring extensive additional training to achieve practical performance. In this paper, we propose Adaptive Token Merging (ATM), a novel method that ensures lossless token merging, eliminating the need for fine-tuning while maintaining competitive performance. ATM adaptively reduces tokens across layers and batches by carefully adjusting layer-specific similarity thresholds, thereby preventing the undesirable merging of less similar tokens with respect to each layer. Furthermore, ATM introduces a novel token matching technique that considers not only similarity but also merging sizes, particularly for the final layers, to minimize the information loss incurred from each merging operation. We empirically validate our method across a wide range of pretrained models, demonstrating that ATM not only outperforms all existing training-free methods but also surpasses most training-intensive approaches, even without additional training. Remarkably, training-free ATM achieves over a 30% reduction in FLOPs for the DeiT-T and DeiT-S models without any drop in their original accuracy. △ Less

Submitted 21 May, 2025; originally announced May 2025.

Comments: Under Review

arXiv:2505.12686 [pdf, other]

RoVo: Robust Voice Protection Against Unauthorized Speech Synthesis with Embedding-Level Perturbations

Authors: Seungmin Kim, Sohee Park, Donghyun Kim, Jisu Lee, Daeseon Choi

Abstract: With the advancement of AI-based speech synthesis technologies such as Deep Voice, there is an increasing risk of voice spoofing attacks, including voice phishing and fake news, through unauthorized use of others' voices. Existing defenses that inject adversarial perturbations directly into audio signals have limited effectiveness, as these perturbations can easily be neutralized by speech enhance… ▽ More With the advancement of AI-based speech synthesis technologies such as Deep Voice, there is an increasing risk of voice spoofing attacks, including voice phishing and fake news, through unauthorized use of others' voices. Existing defenses that inject adversarial perturbations directly into audio signals have limited effectiveness, as these perturbations can easily be neutralized by speech enhancement methods. To overcome this limitation, we propose RoVo (Robust Voice), a novel proactive defense technique that injects adversarial perturbations into high-dimensional embedding vectors of audio signals, reconstructing them into protected speech. This approach effectively defends against speech synthesis attacks and also provides strong resistance to speech enhancement models, which represent a secondary attack threat. In extensive experiments, RoVo increased the Defense Success Rate (DSR) by over 70% compared to unprotected speech, across four state-of-the-art speech synthesis models. Specifically, RoVo achieved a DSR of 99.5% on a commercial speaker-verification API, effectively neutralizing speech synthesis attack. Moreover, RoVo's perturbations remained robust even under strong speech enhancement conditions, outperforming traditional methods. A user study confirmed that RoVo preserves both naturalness and usability of protected speech, highlighting its effectiveness in complex and evolving threat scenarios. △ Less

Submitted 19 May, 2025; originally announced May 2025.

arXiv:2505.10079 [pdf, other]

Electron spin resonance with scanning tunneling microscopy: a tool for an on-surface quantum platform of identical qubits

Authors: Deung-Jang Choi, Soo-hyon Phark, Andreas J. Heinrich, Nicolás Lorente

Abstract: Integration of electron spin resonance (ESR) in a scanning tunneling microscope (STM) has enabled an all-electrical control of atomic and molecular spins on solid surfaces with atomic-scale precision and energy resolution beyond thermal limitations. Further, coherent manipulation and detection of individual spins in an ESR-STM establishes a powerful quantum platform, allowing for the implementatio… ▽ More Integration of electron spin resonance (ESR) in a scanning tunneling microscope (STM) has enabled an all-electrical control of atomic and molecular spins on solid surfaces with atomic-scale precision and energy resolution beyond thermal limitations. Further, coherent manipulation and detection of individual spins in an ESR-STM establishes a powerful quantum platform, allowing for the implementation of fundamental quantum logic operations to on-surface identical qubits. In this review, we introduce recent advances of ESR-STM, focusing on its application to atomic-scale qubits and extension to molecular qubit systems. We discuss the principles underlying ESR-STM, followed by single-spin addressability, coherent control via Rabi oscillations, and quantum state readout through frequency-resolved detection. We further demonstrate multi-qubit control architectures enabled by atom manipulation and local magnetic field engineering, culminating in the realization of multi-qubit logic gates such as the Controlled-NOT and Toffoli gates. These implementations highlight the specialty of ESR-STM towards atomic-scale quantum circuits. Indeed, ESR-STM can be an excellent tool to perform and evaluate quantum operations in molecular qubits. The results reviewed in this collection establish ESR-STM as a versatile tool for advancing quantum coherent science at the atomic and molecular level in solid-state environments. △ Less

Submitted 15 May, 2025; originally announced May 2025.

arXiv:2505.09428 [pdf, ps, other]

Unraveling spin entanglement using quantum gates with scanning tunneling microscopy-driven electron spin resonance

Authors: Eric D. Switzer, Jose Reina-Gálvez, Géza Giedke, Talat S. Rahman, Christoph Wolf, Deung-Jang Choi, Nicolás Lorente

Abstract: Quantum entanglement is a fundamental resource for quantum information processing, and its controlled generation and detection remain key challenges in scalable quantum architectures. Here, we numerically demonstrate the deterministic generation of entangled spin states in a solid-state platform by implementing quantum gates via electron spin resonance combined with scanning tunneling microscopy (… ▽ More Quantum entanglement is a fundamental resource for quantum information processing, and its controlled generation and detection remain key challenges in scalable quantum architectures. Here, we numerically demonstrate the deterministic generation of entangled spin states in a solid-state platform by implementing quantum gates via electron spin resonance combined with scanning tunneling microscopy (ESR-STM). Using two titanium atoms on a MgO/Ag(100) substrate as a model, we construct a two-qubit system whose dynamics are coherently manipulated through tailored microwave pulse sequences. We generate Bell states by implementing a Hadamard gate followed by a controlled-NOT gate, and evaluate its fidelity and concurrence using the quantum-master equation-based code TimeESR. Our results demonstrate that ESR-STM can create entangled states with significant fidelity. This study paves the way for the realization of atom-based quantum circuits and highlights ESR-STM as a powerful tool for probing and engineering entangled states on surfaces. △ Less

Submitted 14 May, 2025; originally announced May 2025.

arXiv:2505.08835 [pdf, other]

Robustness Analysis against Adversarial Patch Attacks in Fully Unmanned Stores

Authors: Hyunsik Na, Wonho Lee, Seungdeok Roh, Sohee Park, Daeseon Choi

Abstract: The advent of convenient and efficient fully unmanned stores equipped with artificial intelligence-based automated checkout systems marks a new era in retail. However, these systems have inherent artificial intelligence security vulnerabilities, which are exploited via adversarial patch attacks, particularly in physical environments. This study demonstrated that adversarial patches can severely di… ▽ More The advent of convenient and efficient fully unmanned stores equipped with artificial intelligence-based automated checkout systems marks a new era in retail. However, these systems have inherent artificial intelligence security vulnerabilities, which are exploited via adversarial patch attacks, particularly in physical environments. This study demonstrated that adversarial patches can severely disrupt object detection models used in unmanned stores, leading to issues such as theft, inventory discrepancies, and interference. We investigated three types of adversarial patch attacks -- Hiding, Creating, and Altering attacks -- and highlighted their effectiveness. We also introduce the novel color histogram similarity loss function by leveraging attacker knowledge of the color information of a target class object. Besides the traditional confusion-matrix-based attack success rate, we introduce a new bounding-boxes-based metric to analyze the practical impact of these attacks. Starting with attacks on object detection models trained on snack and fruit datasets in a digital environment, we evaluated the effectiveness of adversarial patches in a physical testbed that mimicked a real unmanned store with RGB cameras and realistic conditions. Furthermore, we assessed the robustness of these attacks in black-box scenarios, demonstrating that shadow attacks can enhance success rates of attacks even without direct access to model parameters. Our study underscores the necessity for robust defense strategies to protect unmanned stores from adversarial threats. Highlighting the limitations of the current defense mechanisms in real-time detection systems and discussing various proactive measures, we provide insights into improving the robustness of object detection models and fortifying unmanned retail environments against these attacks. △ Less

Submitted 13 May, 2025; originally announced May 2025.

arXiv:2505.02268 [pdf]

doi 10.1016/j.apples.2025.100218

Phantom Domain Finite Element Method: A novel approach for heterogeneous materials

Authors: Tianlong He, Philippe Karamian-Surville, Daniel Choï

Abstract: In this paper, we introduce the Phantom Domain Finite Element Method (PDFEM), a novel computational approach tailored for the efficient analysis of heterogeneous and composite materials. Inspired by fictitious domain methods, this method employs a structured mesh to discretize the entire material domain while utilizing separate, independent meshes for the inclusions. These inclusion meshes are cou… ▽ More In this paper, we introduce the Phantom Domain Finite Element Method (PDFEM), a novel computational approach tailored for the efficient analysis of heterogeneous and composite materials. Inspired by fictitious domain methods, this method employs a structured mesh to discretize the entire material domain while utilizing separate, independent meshes for the inclusions. These inclusion meshes are coupled to the structured mesh via a substitution matrix, enabling them to act as phantom meshes that do not directly contribute to the final system of equations. This framework offers significant advantages, including enhanced flexibility in handling complex inclusion geometries and improved computational efficiency. To assess the accuracy and robustness of the proposed method, numerical experiments are conducted on structures containing inclusions of various geometries. In order to emphasize the efficiency of the PDFEM method, a numerical simulation is presented to highlight its advantages in the case of long natural fibers, such as flax and linen. These simulations are compared against FEM calculations, demonstrating the efficiency of PDFEM. Indeed, meshing such fine structures requires an extremely high number of elements, and in some cases, meshing becomes particularly challenging due to the complexity of the geometries. △ Less

Submitted 4 May, 2025; originally announced May 2025.

arXiv:2505.01015 [pdf, ps, other]

Value Portrait: Assessing Language Models' Values through Psychometrically and Ecologically Valid Items

Authors: Jongwook Han, Dongmin Choi, Woojung Song, Eun-Ju Lee, Yohan Jo

Abstract: The importance of benchmarks for assessing the values of language models has been pronounced due to the growing need of more authentic, human-aligned responses. However, existing benchmarks rely on human or machine annotations that are vulnerable to value-related biases. Furthermore, the tested scenarios often diverge from real-world contexts in which models are commonly used to generate text and… ▽ More The importance of benchmarks for assessing the values of language models has been pronounced due to the growing need of more authentic, human-aligned responses. However, existing benchmarks rely on human or machine annotations that are vulnerable to value-related biases. Furthermore, the tested scenarios often diverge from real-world contexts in which models are commonly used to generate text and express values. To address these issues, we propose the Value Portrait benchmark, a reliable framework for evaluating LLMs' value orientations with two key characteristics. First, the benchmark consists of items that capture real-life user-LLM interactions, enhancing the relevance of assessment results to real-world LLM usage. Second, each item is rated by human subjects based on its similarity to their own thoughts, and correlations between these ratings and the subjects' actual value scores are derived. This psychometrically validated approach ensures that items strongly correlated with specific values serve as reliable items for assessing those values. Through evaluating 44 LLMs with our benchmark, we find that these models prioritize Benevolence, Security, and Self-Direction values while placing less emphasis on Tradition, Power, and Achievement values. Also, our analysis reveals biases in how LLMs perceive various demographic groups, deviating from real human data. △ Less

Submitted 11 June, 2025; v1 submitted 2 May, 2025; originally announced May 2025.

Comments: This paper has been accepted for publication at ACL 2025

ACM Class: I.2.7

arXiv:2505.00043 [pdf, ps, other]

EchoNet-Quality: Denoising Echocardiograms via Deep Generative Modeling of Ultrasound Noise

Authors: David Choi, Milos Vukadinovic, Bryan He, Christina Binder, Yuki Sahashi, David Ouyang

Abstract: Echocardiography (echo), or cardiac ultrasound, is the most widely used imaging modality for cardiac form and function due to its relatively low cost, rapid acquisition time, and non-invasive nature. However, ultrasound acquisitions are often limited by artifacts and noise that hinder diagnostic interpretation in clinical settings. Existing methodologies for denoising echos consist solely of tradi… ▽ More Echocardiography (echo), or cardiac ultrasound, is the most widely used imaging modality for cardiac form and function due to its relatively low cost, rapid acquisition time, and non-invasive nature. However, ultrasound acquisitions are often limited by artifacts and noise that hinder diagnostic interpretation in clinical settings. Existing methodologies for denoising echos consist solely of traditional filtering-based algorithms or deep learning methods developed on radio-frequency (RF) signals which prevents clinical applicability and scalability. To address these limitations, we introduce the first deep generative model capable of simulating ultrasound noise developed on B-mode data. Using this generative model, we develop a synthetic dataset of paired clean and noisy echo images to train a downstream model for real-world image denoising and demonstrate state-of-the-art performance in both internal and external experiments. In both held-out test sets, our method results in echo images with higher gCNR in comparison to noisy image counterparts and images derived from a comparable method which is consistent with provided visual comparisons. Our experiments showcase the potential of our method for future clinical use to improve the quality of echo acquisitions. To encourage further research into the field, we release our source code and model weights at https://github.com/echonet/image_quality. △ Less

Submitted 19 June, 2025; v1 submitted 29 April, 2025; originally announced May 2025.

arXiv:2504.21851 [pdf, other]

TRUST: An LLM-Based Dialogue System for Trauma Understanding and Structured Assessments

Authors: Sichang Tu, Abigail Powers, Stephen Doogan, Jinho D. Choi

Abstract: Objectives: While Large Language Models (LLMs) have been widely used to assist clinicians and support patients, no existing work has explored dialogue systems for standard diagnostic interviews and assessments. This study aims to bridge the gap in mental healthcare accessibility by developing an LLM-powered dialogue system that replicates clinician behavior. Materials and Methods: We introduce TRU… ▽ More Objectives: While Large Language Models (LLMs) have been widely used to assist clinicians and support patients, no existing work has explored dialogue systems for standard diagnostic interviews and assessments. This study aims to bridge the gap in mental healthcare accessibility by developing an LLM-powered dialogue system that replicates clinician behavior. Materials and Methods: We introduce TRUST, a framework of cooperative LLM modules capable of conducting formal diagnostic interviews and assessments for Post-Traumatic Stress Disorder (PTSD). To guide the generation of appropriate clinical responses, we propose a Dialogue Acts schema specifically designed for clinical interviews. Additionally, we develop a patient simulation approach based on real-life interview transcripts to replace time-consuming and costly manual testing by clinicians. Results: A comprehensive set of evaluation metrics is designed to assess the dialogue system from both the agent and patient simulation perspectives. Expert evaluations by conversation and clinical specialists show that TRUST performs comparably to real-life clinical interviews. Discussion: Our system performs at the level of average clinicians, with room for future enhancements in communication styles and response appropriateness. Conclusions: Our TRUST framework shows its potential to facilitate mental healthcare availability. △ Less

Submitted 30 April, 2025; originally announced April 2025.

Comments: 5 figures, 4 tables

arXiv:2504.20566 [pdf, ps, other]

Inclusive Training Separation and Implicit Knowledge Interaction for Balanced Online Class-Incremental Learning

Authors: Shunjie Wen, Thomas Heinis, Dong-Wan Choi

Abstract: Online class-incremental learning (OCIL) focuses on gradually learning new classes (called plasticity) from a stream of data in a single-pass, while concurrently preserving knowledge of previously learned classes (called stability). The primary challenge in OCIL lies in maintaining a good balance between the knowledge of old and new classes within the continually updated model. Most existing metho… ▽ More Online class-incremental learning (OCIL) focuses on gradually learning new classes (called plasticity) from a stream of data in a single-pass, while concurrently preserving knowledge of previously learned classes (called stability). The primary challenge in OCIL lies in maintaining a good balance between the knowledge of old and new classes within the continually updated model. Most existing methods rely on explicit knowledge interaction through experience replay, and often employ exclusive training separation to address bias problems. Nevertheless, it still remains a big challenge to achieve a well-balanced learner, as these methods often exhibit either reduced plasticity or limited stability due to difficulties in continually integrating knowledge in the OCIL setting. In this paper, we propose a novel replay-based method, called Balanced Online Incremental Learning (BOIL), which can achieve both high plasticity and stability, thus ensuring more balanced performance in OCIL. Our BOIL method proposes an inclusive training separation strategy using dual classifiers so that knowledge from both old and new classes can effectively be integrated into the model, while introducing implicit approaches for transferring knowledge across the two classifiers. Extensive experimental evaluations over three widely-used OCIL benchmark datasets demonstrate the superiority of BOIL, showing more balanced yet better performance compared to state-of-the-art replay-based OCIL methods. △ Less

Submitted 29 April, 2025; originally announced April 2025.

Comments: Under review

arXiv:2504.18474 [pdf, other]

Generative Induction of Dialogue Task Schemas with Streaming Refinement and Simulated Interactions

Authors: James D. Finch, Yasasvi Josyula, Jinho D. Choi

Abstract: In task-oriented dialogue (TOD) systems, Slot Schema Induction (SSI) is essential for automatically identifying key information slots from dialogue data without manual intervention. This paper presents a novel state-of-the-art (SoTA) approach that formulates SSI as a text generation task, where a language model incrementally constructs and refines a slot schema over a stream of dialogue data. To d… ▽ More In task-oriented dialogue (TOD) systems, Slot Schema Induction (SSI) is essential for automatically identifying key information slots from dialogue data without manual intervention. This paper presents a novel state-of-the-art (SoTA) approach that formulates SSI as a text generation task, where a language model incrementally constructs and refines a slot schema over a stream of dialogue data. To develop this approach, we present a fully automatic LLM-based TOD simulation method that creates data with high-quality state labels for novel task domains. Furthermore, we identify issues in SSI evaluation due to data leakage and poor metric alignment with human judgment. We resolve these by creating new evaluation data using our simulation method with human guidance and correction, as well as designing improved evaluation metrics. These contributions establish a foundation for future SSI research and advance the SoTA in dialogue understanding and system development. △ Less

Submitted 25 April, 2025; originally announced April 2025.

Comments: Accepted (B) to TACL 2025

arXiv:2504.13969 [pdf, other]

Tinker Tales: Interactive Storytelling Framework for Early Childhood Narrative Development and AI Literacy

Authors: Nayoung Choi, Peace Cyebukayire, Jinho D. Choi

Abstract: This paper presents Tinker Tales, an interactive storytelling framework in the format of a board game, designed to support both narrative development and AI literacy in early childhood. The framework integrates tangible and speech-based interactions with AI through NFC chip-attached pawns and tokens, along with a speaker and microphone. Children select and define key story elements-such as charact… ▽ More This paper presents Tinker Tales, an interactive storytelling framework in the format of a board game, designed to support both narrative development and AI literacy in early childhood. The framework integrates tangible and speech-based interactions with AI through NFC chip-attached pawns and tokens, along with a speaker and microphone. Children select and define key story elements-such as characters, places, items, and emotions-using the pawns and tokens, providing further details to the AI and receiving proper assistance, similar to how adults prompt AI for specific tasks (e.g., writing). For evaluation, several game sessions were simulated with a child AI agent, and the quality and safety of the generated stories were assessed from various perspectives. This work highlights the potential of combining physical and digital elements in AI literacy, offering a safe and engaging way for children to learn how to effectively collaborate with AI. △ Less

Submitted 22 April, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

arXiv:2504.13439 [pdf, ps, other]

D-GEN: Automatic Distractor Generation and Evaluation for Reliable Assessment of Generative Model

Authors: Grace Byun, Jinho D. Choi

Abstract: Evaluating generative models with open-ended generation is challenging due to inconsistencies in response formats. Multiple-choice (MC) evaluation mitigates this issue, but generating high-quality distractors is time-consuming and labor-intensive. We introduce D-GEN, the first open-source distractor generator model that transforms open-ended data into an MC format. To evaluate distractor quality,… ▽ More Evaluating generative models with open-ended generation is challenging due to inconsistencies in response formats. Multiple-choice (MC) evaluation mitigates this issue, but generating high-quality distractors is time-consuming and labor-intensive. We introduce D-GEN, the first open-source distractor generator model that transforms open-ended data into an MC format. To evaluate distractor quality, we propose two novel methods: (1) ranking alignment, ensuring generated distractors retain the discriminatory power of ground-truth distractors, and (2) entropy analysis, comparing model confidence distributions. Our results show that D-GEN preserves ranking consistency (Spearman's rho 0.99, Kendall's tau 0.94) and closely matches the entropy distribution of ground-truth distractors. Human evaluation further confirms the fluency, coherence, distractiveness, and incorrectness. Our work advances robust and efficient distractor generation with automated evaluation, setting a new standard for MC evaluation. △ Less

Submitted 12 June, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

Comments: ACL 2025 Findings

Journal ref: ACL 2025 Findings

arXiv:2504.13425 [pdf, other]

Secure Multifaceted-RAG for Enterprise: Hybrid Knowledge Retrieval with Security Filtering

Authors: Grace Byun, Shinsun Lee, Nayoung Choi, Jinho D. Choi

Abstract: Existing Retrieval-Augmented Generation (RAG) systems face challenges in enterprise settings due to limited retrieval scope and data security risks. When relevant internal documents are unavailable, the system struggles to generate accurate and complete responses. Additionally, using closed-source Large Language Models (LLMs) raises concerns about exposing proprietary information. To address these… ▽ More Existing Retrieval-Augmented Generation (RAG) systems face challenges in enterprise settings due to limited retrieval scope and data security risks. When relevant internal documents are unavailable, the system struggles to generate accurate and complete responses. Additionally, using closed-source Large Language Models (LLMs) raises concerns about exposing proprietary information. To address these issues, we propose the Secure Multifaceted-RAG (SecMulti-RAG) framework, which retrieves not only from internal documents but also from two supplementary sources: pre-generated expert knowledge for anticipated queries and on-demand external LLM-generated knowledge. To mitigate security risks, we adopt a local open-source generator and selectively utilize external LLMs only when prompts are deemed safe by a filtering mechanism. This approach enhances completeness, prevents data leakage, and reduces costs. In our evaluation on a report generation task in the automotive industry, SecMulti-RAG significantly outperforms traditional RAG - achieving 79.3 to 91.9 percent win rates across correctness, richness, and helpfulness in LLM-based evaluation, and 56.3 to 70.4 percent in human evaluation. This highlights SecMulti-RAG as a practical and secure solution for enterprise RAG. △ Less

Submitted 17 April, 2025; originally announced April 2025.

arXiv:2504.12870 [pdf, other]

CST-former: Multidimensional Attention-based Transformer for Sound Event Localization and Detection in Real Scenes

Authors: Yusun Shul, Dayun Choi, Jung-Woo Choi

Abstract: Sound event localization and detection (SELD) is a task for the classification of sound events and the identification of direction of arrival (DoA) utilizing multichannel acoustic signals. For effective classification and localization, a channel-spectro-temporal transformer (CST-former) was suggested. CST-former employs multidimensional attention mechanisms across the spatial, spectral, and tempor… ▽ More Sound event localization and detection (SELD) is a task for the classification of sound events and the identification of direction of arrival (DoA) utilizing multichannel acoustic signals. For effective classification and localization, a channel-spectro-temporal transformer (CST-former) was suggested. CST-former employs multidimensional attention mechanisms across the spatial, spectral, and temporal domains to enlarge the model's capacity to learn the domain information essential for event detection and DoA estimation over time. In this work, we present an enhanced version of CST-former with multiscale unfolded local embedding (MSULE) developed to capture and aggregate domain information over multiple time-frequency scales. Also, we propose finetuning and post-processing techniques beneficial for conducting the SELD task over limited training datasets. In-depth ablation studies of the proposed architecture and detailed analysis on the proposed modules are carried out to validate the efficacy of multidimensional attentions on the SELD task. Empirical validation through experimentation on STARSS22 and STARSS23 datasets demonstrates the remarkable performance of CST-former and post-processing techniques without using external data. △ Less

Submitted 17 April, 2025; originally announced April 2025.

Comments: 12 pages, 10 figures, Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

arXiv:2504.02877 [pdf, other]

Revisiting Funnel Transformers for Modern LLM Architectures with Comprehensive Ablations in Training and Inference Configurations

Authors: DongHyun Choi, Lucas Spangher, Chris Hidey, Peter Grabowski, Ramy Eskander

Abstract: Transformer-based Large Language Models, which suffer from high computational costs, advance so quickly that techniques proposed to streamline earlier iterations are not guaranteed to benefit more modern models. Building upon the Funnel Transformer proposed by Dai and Le (2020), which progressively compresses intermediate representations, we investigate the impact of funneling in contemporary Gemm… ▽ More Transformer-based Large Language Models, which suffer from high computational costs, advance so quickly that techniques proposed to streamline earlier iterations are not guaranteed to benefit more modern models. Building upon the Funnel Transformer proposed by Dai and Le (2020), which progressively compresses intermediate representations, we investigate the impact of funneling in contemporary Gemma2 Transformer architectures. We systematically evaluate various funnel configurations and recovery methods, comparing: (1) standard pretraining to funnel-aware pretraining strategies, (2) the impact of funnel-aware fine-tuning, and (3) the type of sequence recovery operation. Our results demonstrate that funneling creates information bottlenecks that propagate through deeper network layers, particularly in larger models (e.g., Gemma 7B), leading to at times unmanageable performance lost. However, carefully selecting the funneling layer and employing effective recovery strategies, can substantially mitigate performance losses, achieving up to a 44\% reduction in latency. Our findings highlight key trade-offs between computational efficiency and model accuracy, providing practical guidance for deploying funnel-based approaches in large-scale natural language applications. △ Less

Submitted 1 April, 2025; originally announced April 2025.

arXiv:2503.22968 [pdf, ps, other]

Redefining Evaluation Standards: A Unified Framework for Evaluating the Korean Capabilities of Language Models

Authors: Hanwool Lee, Dasol Choi, Sooyong Kim, Ilgyun Jung, Sangwon Baek, Guijin Son, Inseon Hwang, Naeun Lee, Seunghyeok Hong

Abstract: Recent advancements in Korean large language models (LLMs) have driven numerous benchmarks and evaluation methods, yet inconsistent protocols cause up to 10 p.p performance gaps across institutions. Overcoming these reproducibility gaps does not mean enforcing a one-size-fits-all evaluation. Rather, effective benchmarking requires diverse experimental approaches and a framework robust enough to su… ▽ More Recent advancements in Korean large language models (LLMs) have driven numerous benchmarks and evaluation methods, yet inconsistent protocols cause up to 10 p.p performance gaps across institutions. Overcoming these reproducibility gaps does not mean enforcing a one-size-fits-all evaluation. Rather, effective benchmarking requires diverse experimental approaches and a framework robust enough to support them. To this end, we introduce HRET (Haerae Evaluation Toolkit), an open-source, registry-based framework that unifies Korean LLM assessment. HRET integrates major Korean benchmarks, multiple inference backends, and multi-method evaluation, with language consistency enforcement to ensure genuine Korean outputs. Its modular registry design also enables rapid incorporation of new datasets, methods, and backends, ensuring the toolkit adapts to evolving research needs. Beyond standard accuracy metrics, HRET incorporates Korean-focused output analyses-morphology-aware Type-Token Ratio (TTR) for evaluating lexical diversity and systematic keyword-omission detection for identifying missing concepts-to provide diagnostic insights into language-specific behaviors. These targeted analyses help researchers pinpoint morphological and semantic shortcomings in model outputs, guiding focused improvements in Korean LLM development. △ Less

Submitted 29 June, 2025; v1 submitted 29 March, 2025; originally announced March 2025.

arXiv:2503.22194 [pdf, other]

ORIGEN: Zero-Shot 3D Orientation Grounding in Text-to-Image Generation

Authors: Yunhong Min, Daehyeon Choi, Kyeongmin Yeo, Jihyun Lee, Minhyuk Sung

Abstract: We introduce ORIGEN, the first zero-shot method for 3D orientation grounding in text-to-image generation across multiple objects and diverse categories. While previous work on spatial grounding in image generation has mainly focused on 2D positioning, it lacks control over 3D orientation. To address this, we propose a reward-guided sampling approach using a pretrained discriminative model for 3D o… ▽ More We introduce ORIGEN, the first zero-shot method for 3D orientation grounding in text-to-image generation across multiple objects and diverse categories. While previous work on spatial grounding in image generation has mainly focused on 2D positioning, it lacks control over 3D orientation. To address this, we propose a reward-guided sampling approach using a pretrained discriminative model for 3D orientation estimation and a one-step text-to-image generative flow model. While gradient-ascent-based optimization is a natural choice for reward-based guidance, it struggles to maintain image realism. Instead, we adopt a sampling-based approach using Langevin dynamics, which extends gradient ascent by simply injecting random noise--requiring just a single additional line of code. Additionally, we introduce adaptive time rescaling based on the reward function to accelerate convergence. Our experiments show that ORIGEN outperforms both training-based and test-time guidance methods across quantitative metrics and user studies. △ Less

Submitted 28 May, 2025; v1 submitted 28 March, 2025; originally announced March 2025.

Comments: Project Page: https://origen2025.github.io

arXiv:2503.13936 [pdf, other]

Time-domain identification of distinct mechanisms for competing charge density waves in a rare-earth tritelluride

Authors: Yifan Su, B. Q. Lv, Alfred Zong, Aaron Müller, Sambuddha Chattopadhyay, Pavel E. Dolgirev, Anisha G. Singh, Joshua A. W. Straquadine, Dongsung Choi, Doron Azoury, Masataka Mogi, Ian R. Fisher, Eugene Demler, Nuh Gedik

Abstract: Understanding the origin of phase transitions and the interactions between distinct phases remains a central task in condensed matter physics. Charge density wave (CDW) systems provide an ideal platform for investigating these phenomena. While the dominant CDW phases in many materials can be explained through Fermi surface nesting or electron-phonon interactions, certain CDW phase transitions rema… ▽ More Understanding the origin of phase transitions and the interactions between distinct phases remains a central task in condensed matter physics. Charge density wave (CDW) systems provide an ideal platform for investigating these phenomena. While the dominant CDW phases in many materials can be explained through Fermi surface nesting or electron-phonon interactions, certain CDW phase transitions remain poorly understood, challenging conventional paradigms. One notable example is rare-earth tritelluride ErTe3, which hosts two competing CDW orders. While the dominant CDW phase fits within the electron-phonon coupling framework, the formation mechanism of the subdominant CDW remains enigmatic. In this study, we combine time-and-angle-resolved photoemission spectroscopy (trARPES) with time-dependent Ginzburg-Landau (TDGL) theory to establish a time-domain approach for probing phase transitions in solid-state systems. By analyzing the distinct recovery dynamics of the two CDW orders in ErTe3 following light excitation, we reveal a novel nucleation-like growth mechanism that likely drives the secondary CDW phase transition. This work not only uncovers a previously unknown CDW formation mechanism in rare-earth tritellurides but also introduces a non-equilibrium framework for understanding phase transitions and phase competition in quantum materials. △ Less

Submitted 25 March, 2025; v1 submitted 18 March, 2025; originally announced March 2025.

arXiv:2503.00791 [pdf, other]

doi 10.1145/3706599.3720189

Expandora: Broadening Design Exploration with Text-to-Image Model

Authors: DaEun Choi, Kihoon Son, Hyunjoon Jung, Juho Kim

Abstract: Broad exploration of references is critical in the visual design process. While text-to-image (T2I) models offer efficiency and customization of exploration, they often limit support for divergence in exploration. We conducted a formative study (N=6) to investigate the limitations of current interaction with the T2I model for broad exploration and found that designers struggle to articulate explor… ▽ More Broad exploration of references is critical in the visual design process. While text-to-image (T2I) models offer efficiency and customization of exploration, they often limit support for divergence in exploration. We conducted a formative study (N=6) to investigate the limitations of current interaction with the T2I model for broad exploration and found that designers struggle to articulate exploratory intentions and manage iterative, non-linear workflows. To address these challenges, we developed Expandora. Users can specify their exploratory intentions and desired diversity levels through structured input, and using an LLM-based pipeline, Expandora generates tailored prompt variations. The results are displayed in a mindmap-like interface that encourages non-linear workflows. A user study (N=8) demonstrated that Expandora significantly increases prompt diversity, the number of prompts users tried within a given time, and user satisfaction compared to the baseline. Nonetheless, its limitations in supporting convergent thinking suggest opportunities for holistically improving creative processes. △ Less

Submitted 2 March, 2025; originally announced March 2025.

Comments: Accepted to CHI'25 LBW

arXiv:2502.21270 [pdf, ps, other]

Conformal Block Divisors for Discrete Series Virasoro VOA $\text{Vir}_{2k+1,2}$

Authors: Daebeom Choi

Abstract: In this work, we study a family of vector bundles on the moduli space of curves constructed from representations of $\text{Vir}_{2k+1,2}$, a family of vertex operator algebras derived from the Virasoro Lie algebra. Using the relationship between rank and degree, we characterize their asymptotic behavior, demonstrating that their first Chern classes are nef on $\overline{\rm{M}}_{g,n}$ in many case… ▽ More In this work, we study a family of vector bundles on the moduli space of curves constructed from representations of $\text{Vir}_{2k+1,2}$, a family of vertex operator algebras derived from the Virasoro Lie algebra. Using the relationship between rank and degree, we characterize their asymptotic behavior, demonstrating that their first Chern classes are nef on $\overline{\rm{M}}_{g,n}$ in many cases. This is the first nontrivial example of divisors arising from vertex operator algebras that are uniformly positive for all genera. Furthermore, for $g = 1$, these divisors form a $\mathbb{Q}$-basis of the Picard group of $\overline{\rm{M}}_{1,n}$, with several desirable functorial properties. Using this basis, we characterize line bundles on certain contractions of $\overline{\rm{M}}_{1,n}$. We also propose conjectures regarding the conformal blocks of Virasoro VOAs and potential generalizations. In particular, by introducing a generalization of conformal block divisors, we provide a nonlinear nef interpolation between affine and Virasoro conformal block divisors. △ Less

Submitted 28 February, 2025; originally announced February 2025.

Comments: 37 pages. Comments Welcome!

MSC Class: 14H10; 17B69 (primary); 81R10 (secondary)

arXiv:2502.19703 [pdf, ps, other]

Singularities and syzygies of secant varieties of smooth projective varieties

Authors: Doyoung Choi, Justin Lacini, Jinhyung Park, John Sheridan

Abstract: We study the higher secant varieties of a smooth projective variety embedded in projective space. We prove that when the variety is a surface and the embedding line bundle is sufficiently positive, these varieties are normal with Du Bois singularities and the syzygies of their defining ideals are linear to the expected order. We show that the cohomology of the structure sheaf of the surface comple… ▽ More We study the higher secant varieties of a smooth projective variety embedded in projective space. We prove that when the variety is a surface and the embedding line bundle is sufficiently positive, these varieties are normal with Du Bois singularities and the syzygies of their defining ideals are linear to the expected order. We show that the cohomology of the structure sheaf of the surface completely determines whether the singularities of its secant varieties are Cohen--Macaulay or rational. We also prove analogous results when the dimension of the original variety is higher and the secant order is low, and by contrast we prove a result that strongly implies these statements do not generalize to higher dimensional varieties when the secant order is high. Finally, we deduce a complementary result characterizing the ideal of secant varieties of a surface in terms of the symbolic powers of the ideal of the surface itself, and we include a theorem concerning the weight one syzygies of an embedded surface -- analogous to the gonality conjecture for curves -- which we discovered as a natural application of our techniques. △ Less

Submitted 26 February, 2025; originally announced February 2025.

Comments: 77 pages. Comments welcome

arXiv:2502.19596 [pdf]

Reference-Aligned Retrieval-Augmented Question Answering over Heterogeneous Proprietary Documents

Authors: Nayoung Choi, Grace Byun, Andrew Chung, Ellie S. Paek, Shinsun Lee, Jinho D. Choi

Abstract: Proprietary corporate documents contain rich domain-specific knowledge, but their overwhelming volume and disorganized structure make it difficult even for employees to access the right information when needed. For example, in the automotive industry, vehicle crash-collision tests, each costing hundreds of thousands of dollars, produce highly detailed documentation. However, retrieving relevant co… ▽ More Proprietary corporate documents contain rich domain-specific knowledge, but their overwhelming volume and disorganized structure make it difficult even for employees to access the right information when needed. For example, in the automotive industry, vehicle crash-collision tests, each costing hundreds of thousands of dollars, produce highly detailed documentation. However, retrieving relevant content during decision-making remains time-consuming due to the scale and complexity of the material. While Retrieval-Augmented Generation (RAG)-based Question Answering (QA) systems offer a promising solution, building an internal RAG-QA system poses several challenges: (1) handling heterogeneous multi-modal data sources, (2) preserving data confidentiality, and (3) enabling traceability between each piece of information in the generated answer and its original source document. To address these, we propose a RAG-QA framework for internal enterprise use, consisting of: (1) a data pipeline that converts raw multi-modal documents into a structured corpus and QA pairs, (2) a fully on-premise, privacy-preserving architecture, and (3) a lightweight reference matcher that links answer segments to supporting content. Applied to the automotive domain, our system improves factual correctness (+1.79, +1.94), informativeness (+1.33, +1.16), and helpfulness (+1.08, +1.67) over a non-RAG baseline, based on 1-5 scale ratings from both human and LLM judge. △ Less

Submitted 16 June, 2025; v1 submitted 26 February, 2025; originally announced February 2025.

MSC Class: H.3

arXiv:2502.14800 [pdf]

Discovery of transient topological crystalline order in optically driven SnSe

Authors: Masataka Mogi, Dongsung Choi, Kyoung Hun Oh, Diana Golovanova, Yufei Zhao, Yifan Su, Zongqi Shen, Doron Azoury, Haoyu Xia, Batyr Ilyas, Tianchuang Luo, Noriaki Kida, Taito Osaka, Tadashi Togashi, Binghai Yan, Nuh Gedik

Abstract: Ultrafast optical excitation provides a powerful route for accessing emergent quantum phases far from equilibrium, enabling transient light-induced phenomena such as magnetism, ferroelectricity, and superconductivity. However, extending this approach to induce topological phases, especially in conventional semiconductors, remains challenging. Here, we report the observation of a thermally inaccess… ▽ More Ultrafast optical excitation provides a powerful route for accessing emergent quantum phases far from equilibrium, enabling transient light-induced phenomena such as magnetism, ferroelectricity, and superconductivity. However, extending this approach to induce topological phases, especially in conventional semiconductors, remains challenging. Here, we report the observation of a thermally inaccessible, transient topological crystalline order in the layered semiconductor SnSe, a trivial insulator with a sizable (~ 0.8 eV) band gap, induced by femtosecond above-gap excitation. Time- and angle-resolved photoemission spectroscopy directly reveals the sub-picosecond emergence of Dirac-like linear dispersions within the band gap. Their features, including a high Fermi velocity (~ 2.5x10^5 m/s), multiple Dirac points away from high-symmetry momenta, and independence from probe photon energy, are consistent with mirror-symmetry-protected surface states of a topological crystalline insulator. The observed spectral dynamics, combined with density functional theory calculations, indicate that the femtosecond excitation transiently increases lattice symmetry, enabling topological crystalline order to emerge. Our discovery opens new avenues for ultrafast optical control of topological quantum phases in semiconductors, with potential applications in quantum and spintronic devices. △ Less

Submitted 16 May, 2025; v1 submitted 20 February, 2025; originally announced February 2025.

Comments: 27 pages, 5 figures

arXiv:2502.08861 [pdf, other]

Two-dimensional Si spin qubit arrays with multilevel interconnects

Authors: Sieu D. Ha, Edwin Acuna, Kate Raach, Zachery T. Bloom, Teresa L. Brecht, James M. Chappell, Maxwell D. Choi, Justin E. Christensen, Ian T. Counts, Dominic Daprano, J. P. Dodson, Kevin Eng, David J. Fialkow, Christina A. C. Garcia, Wonill Ha, Thomas R. B. Harris, nathan holman, Isaac Khalaf, Justine W. Matten, Christi A. Peterson, Clifford E. Plesha, Matthew J. Ruiz, Aaron Smith, Bryan J. Thomas, Samuel J. Whiteley , et al. (4 additional authors not shown)

Abstract: The promise of quantum computation is contingent upon physical qubits with both low gate error rate and broad scalability. Silicon-based spins are a leading qubit platform, but demonstrations to date have not utilized fabrication processes capable of extending arrays in two dimensions while maintaining complete control of individual spins. Here, we implement an interconnect process, common in semi… ▽ More The promise of quantum computation is contingent upon physical qubits with both low gate error rate and broad scalability. Silicon-based spins are a leading qubit platform, but demonstrations to date have not utilized fabrication processes capable of extending arrays in two dimensions while maintaining complete control of individual spins. Here, we implement an interconnect process, common in semiconductor manufacturing, with multiple back-end-of-line layers to show an extendable two-dimensional array of spins with fully controllable nearest-neighbor exchange interactions. In a device using three interconnect layers, we encode exchange-only qubits and achieve average single-qubit gate fidelities consistent with single-layer devices, including fidelities greater than 99.9%, as measured by blind randomized benchmarking. Moreover, with spin connectivity in two dimensions, we show that both linear and right-angle exchange-only qubits with high performance can be formed, enabling qubit array reconfigurability in the presence of defects. This extendable device platform demonstrates that industrial manufacturing techniques can be leveraged for scalable spin qubit technologies. △ Less

Submitted 12 February, 2025; originally announced February 2025.

arXiv:2502.08474 [pdf, other]

Training-Free Restoration of Pruned Neural Networks

Authors: Keonho Lee, Minsoo Kim, Dong-Wan Choi

Abstract: Although network pruning has been highly popularized to compress deep neural networks, its resulting accuracy heavily depends on a fine-tuning process that is often computationally expensive and requires the original data. However, this may not be the case in real-world scenarios, and hence a few recent works attempt to restore pruned networks without any expensive retraining process. Their strong… ▽ More Although network pruning has been highly popularized to compress deep neural networks, its resulting accuracy heavily depends on a fine-tuning process that is often computationally expensive and requires the original data. However, this may not be the case in real-world scenarios, and hence a few recent works attempt to restore pruned networks without any expensive retraining process. Their strong assumption is that every neuron being pruned can be replaced with another one quite similar to it, but unfortunately this does not hold in many neural networks, where the similarity between neurons is extremely low in some layers. In this article, we propose a more rigorous and robust method of restoring pruned networks in a fine-tuning free and data-free manner, called LBYL (Leave Before You Leave). LBYL significantly relaxes the aforementioned assumption in a way that each pruned neuron leaves its pieces of information to as many preserved neurons as possible and thereby multiple neurons together obtain a more robust approximation to the original output of the neuron who just left. Our method is based on a theoretical analysis on how to formulate the reconstruction error between the original network and its approximation, which nicely leads to a closed form solution for our derived loss function. Through the extensive experiments, LBYL is confirmed to be indeed more effective to approximate the original network and consequently able to achieve higher accuracy for restored networks, compared to the recent approaches exploiting the similarity between two neurons. The very first version of this work, which contains major technical and theoretical components, was submitted to NeurIPS 2021 and ICML 2022. △ Less

Submitted 6 February, 2025; originally announced February 2025.

Comments: Under Review in TNNLS since May 2022

arXiv:2502.03984 [pdf, other]

PGB: One-Shot Pruning for BERT via Weight Grouping and Permutation

Authors: Hyemin Lim, Jaeyeon Lee, Dong-Wan Choi

Abstract: Large pretrained language models such as BERT suffer from slow inference and high memory usage, due to their huge size. Recent approaches to compressing BERT rely on iterative pruning and knowledge distillation, which, however, are often too complicated and computationally intensive. This paper proposes a novel semi-structured one-shot pruning method for BERT, called… ▽ More Large pretrained language models such as BERT suffer from slow inference and high memory usage, due to their huge size. Recent approaches to compressing BERT rely on iterative pruning and knowledge distillation, which, however, are often too complicated and computationally intensive. This paper proposes a novel semi-structured one-shot pruning method for BERT, called $\textit{Permutation and Grouping for BERT}$ (PGB), which achieves high compression efficiency and sparsity while preserving accuracy. To this end, PGB identifies important groups of individual weights by permutation and prunes all other weights as a structure in both multi-head attention and feed-forward layers. Furthermore, if no important group is formed in a particular layer, PGB drops the entire layer to produce an even more compact model. Our experimental results on BERT$_{\text{BASE}}$ demonstrate that PGB outperforms the state-of-the-art structured pruning methods in terms of computational cost and accuracy preservation. △ Less

Submitted 6 February, 2025; originally announced February 2025.

arXiv:2502.00196 [pdf, other]

DermaSynth: Rich Synthetic Image-Text Pairs Using Open Access Dermatology Datasets

Authors: Abdurrahim Yilmaz, Furkan Yuceyalcin, Ece Gokyayla, Donghee Choi, Ozan Erdem, Ali Anil Demircali, Rahmetullah Varol, Ufuk Gorkem Kirabali, Gulsum Gencoglan, Joram M. Posma, Burak Temelkuran

Abstract: A major barrier to developing vision large language models (LLMs) in dermatology is the lack of large image--text pairs dataset. We introduce DermaSynth, a dataset comprising of 92,020 synthetic image--text pairs curated from 45,205 images (13,568 clinical and 35,561 dermatoscopic) for dermatology-related clinical tasks. Leveraging state-of-the-art LLMs, using Gemini 2.0, we used clinically relate… ▽ More A major barrier to developing vision large language models (LLMs) in dermatology is the lack of large image--text pairs dataset. We introduce DermaSynth, a dataset comprising of 92,020 synthetic image--text pairs curated from 45,205 images (13,568 clinical and 35,561 dermatoscopic) for dermatology-related clinical tasks. Leveraging state-of-the-art LLMs, using Gemini 2.0, we used clinically related prompts and self-instruct method to generate diverse and rich synthetic texts. Metadata of the datasets were incorporated into the input prompts by targeting to reduce potential hallucinations. The resulting dataset builds upon open access dermatological image repositories (DERM12345, BCN20000, PAD-UFES-20, SCIN, and HIBA) that have permissive CC-BY-4.0 licenses. We also fine-tuned a preliminary Llama-3.2-11B-Vision-Instruct model, DermatoLlama 1.0, on 5,000 samples. We anticipate this dataset to support and accelerate AI research in dermatology. Data and code underlying this work are accessible at https://github.com/abdurrahimyilmaz/DermaSynth. △ Less

Submitted 4 March, 2025; v1 submitted 31 January, 2025; originally announced February 2025.

Comments: 12 pages, 4 figures

arXiv:2501.16769 [pdf, ps, other]

Beyond-Labels: Advancing Open-Vocabulary Segmentation With Vision-Language Models

Authors: Muhammad Atta ur Rahman, Dooseop Choi, Seung-Ik Lee, KyoungWook Min

Abstract: Open-vocabulary semantic segmentation attempts to classify and outline objects in an image using arbitrary text labels, including those unseen during training. Self-supervised learning resolves numerous visual and linguistic processing problems when effectively trained. This study investigates simple yet efficient methods for adapting previously learned foundation models for open-vocabulary semant… ▽ More Open-vocabulary semantic segmentation attempts to classify and outline objects in an image using arbitrary text labels, including those unseen during training. Self-supervised learning resolves numerous visual and linguistic processing problems when effectively trained. This study investigates simple yet efficient methods for adapting previously learned foundation models for open-vocabulary semantic segmentation tasks. Our research proposes "Beyond-Labels", a lightweight transformer-based fusion module that uses a small amount of image segmentation data to fuse frozen visual representations with language concepts. This strategy allows the model to leverage the extensive knowledge of pre-trained models without requiring significant retraining, making the approach data-efficient and scalable. Furthermore, we capture positional information in images using Fourier embeddings, improving generalization and enabling smooth and consistent spatial encoding. We perform thorough ablation studies to examine the main components of our proposed method. On the standard benchmark PASCAL-5i, the method performs better despite being trained on frozen vision and language representations. Index Terms: Beyond-Labels, open-vocabulary semantic segmentation, Fourier embeddings, PASCAL-5i △ Less

Submitted 1 July, 2025; v1 submitted 28 January, 2025; originally announced January 2025.

Comments: Accepted at the 17th IEEE International Conference on Advanced Computational Intelligence (ICACI 2025)

arXiv:2501.16726 [pdf, other]

Bridging Neural Networks and Wireless Systems with MIMO-OFDM Semantic Communications

Authors: Hanju Yoo, Dongha Choi, Yonghwi Kim, Yoontae Kim, Songkuk Kim, Chan-Byoung Chae, Robert W. Heath Jr

Abstract: Semantic communications aim to enhance transmission efficiency by jointly optimizing source coding, channel coding, and modulation. While prior research has demonstrated promising performance in simulations, real-world implementations often face significant challenges, including noise variability and nonlinear distortions, leading to performance gaps. This article investigates these challenges in… ▽ More Semantic communications aim to enhance transmission efficiency by jointly optimizing source coding, channel coding, and modulation. While prior research has demonstrated promising performance in simulations, real-world implementations often face significant challenges, including noise variability and nonlinear distortions, leading to performance gaps. This article investigates these challenges in a multiple-input multiple-output (MIMO) and orthogonal frequency division multiplexing (OFDM)-based semantic communication system, focusing on the practical impacts of power amplifier (PA) nonlinearity and peak-to-average power ratio (PAPR) variations. Our analysis identifies frequency selectivity of the actual channel as a critical factor in performance degradation and demonstrates that targeted mitigation strategies can enable semantic systems to approach theoretical performance. By addressing key limitations in existing designs, we provide actionable insights for advancing semantic communications in practical wireless environments. This work establishes a foundation for bridging the gap between theoretical models and real-world deployment, highlighting essential considerations for system design and optimization. △ Less

Submitted 28 January, 2025; originally announced January 2025.

Comments: 7 pages, 5 figures

arXiv:2501.11055 [pdf, ps, other]

Singularities of the nested Hilbert scheme of points of length 3, 4

Authors: Doyoung Choi

Abstract: We show that the projection morphism $X^{[3,4]} \longrightarrow X^{[3]}$ is flat even if it has reducible fiber. After showing the rational singularities of the fiber of residual morphism $\textrm{res}_{3,4} :X^{[3,4]} \longrightarrow X$, we conclude that $X^{[3,4]}$ has canonical Gorenstein singularities. As a corollary, we specify the singularities of several nested Hilbert schemes. We show that the projection morphism $X^{[3,4]} \longrightarrow X^{[3]}$ is flat even if it has reducible fiber. After showing the rational singularities of the fiber of residual morphism $\textrm{res}_{3,4} :X^{[3,4]} \longrightarrow X$, we conclude that $X^{[3,4]}$ has canonical Gorenstein singularities. As a corollary, we specify the singularities of several nested Hilbert schemes. △ Less

Submitted 19 January, 2025; originally announced January 2025.

Comments: 15 pages

MSC Class: Primary:14C05; Secondary:14E18

arXiv:2501.05712 [pdf, other]

Multi-Step Reasoning in Korean and the Emergent Mirage

Authors: Guijin Son, Hyunwoo Ko, Dasol Choi

Abstract: We introduce HRMCR (HAE-RAE Multi-Step Commonsense Reasoning), a benchmark designed to evaluate large language models' ability to perform multi-step reasoning in culturally specific contexts, focusing on Korean. The questions are automatically generated via templates and algorithms, requiring LLMs to integrate Korean cultural knowledge into sequential reasoning steps. Consistent with prior observa… ▽ More We introduce HRMCR (HAE-RAE Multi-Step Commonsense Reasoning), a benchmark designed to evaluate large language models' ability to perform multi-step reasoning in culturally specific contexts, focusing on Korean. The questions are automatically generated via templates and algorithms, requiring LLMs to integrate Korean cultural knowledge into sequential reasoning steps. Consistent with prior observations on emergent abilities, our experiments reveal that models trained on fewer than $2 \cdot 10^{25}$ training FLOPs struggle to solve any questions, showing near-zero performance. Beyond this threshold, performance improves sharply. State-of-the-art models (e.g., O1) still score under 50\%, underscoring the difficulty of our tasks. Notably, stepwise analysis suggests the observed emergent behavior may stem from compounding errors across multiple steps rather than reflecting a genuinely new capability. We publicly release the benchmark and commit to regularly updating the dataset to prevent contamination. △ Less

Submitted 12 March, 2025; v1 submitted 10 January, 2025; originally announced January 2025.

Comments: C3NLP @ NAACL 2025

arXiv:2501.03441 [pdf, other]

Finding A Voice: Evaluating African American Dialect Generation for Chatbot Technology

Authors: Sarah E. Finch, Ellie S. Paek, Sejung Kwon, Ikseon Choi, Jessica Wells, Rasheeta Chandler, Jinho D. Choi

Abstract: As chatbots become increasingly integrated into everyday tasks, designing systems that accommodate diverse user populations is crucial for fostering trust, engagement, and inclusivity. This study investigates the ability of contemporary Large Language Models (LLMs) to generate African American Vernacular English (AAVE) and evaluates the impact of AAVE usage on user experiences in chatbot applicati… ▽ More As chatbots become increasingly integrated into everyday tasks, designing systems that accommodate diverse user populations is crucial for fostering trust, engagement, and inclusivity. This study investigates the ability of contemporary Large Language Models (LLMs) to generate African American Vernacular English (AAVE) and evaluates the impact of AAVE usage on user experiences in chatbot applications. We analyze the performance of three LLM families (Llama, GPT, and Claude) in producing AAVE-like utterances at varying dialect intensities and assess user preferences across multiple domains, including healthcare and education. Despite LLMs' proficiency in generating AAVE-like language, findings indicate that AAVE-speaking users prefer Standard American English (SAE) chatbots, with higher levels of AAVE correlating with lower ratings for a variety of characteristics, including chatbot trustworthiness and role appropriateness. These results highlight the complexities of creating inclusive AI systems and underscore the need for further exploration of diversity to enhance human-computer interactions. △ Less

Submitted 6 January, 2025; originally announced January 2025.

arXiv:2501.02448 [pdf, other]

Understand, Solve and Translate: Bridging the Multilingual Mathematical Reasoning Gap

Authors: Hyunwoo Ko, Guijin Son, Dasol Choi

Abstract: Large language models (LLMs) demonstrate exceptional performance on complex reasoning tasks. However, despite their strong reasoning capabilities in high-resource languages (e.g., English and Chinese), a significant performance gap persists in other languages. To investigate this gap in Korean, we introduce HRM8K, a benchmark comprising 8,011 English-Korean parallel bilingual math problems. Throug… ▽ More Large language models (LLMs) demonstrate exceptional performance on complex reasoning tasks. However, despite their strong reasoning capabilities in high-resource languages (e.g., English and Chinese), a significant performance gap persists in other languages. To investigate this gap in Korean, we introduce HRM8K, a benchmark comprising 8,011 English-Korean parallel bilingual math problems. Through systematic analysis of model behaviors, we identify a key finding: these performance disparities stem primarily from difficulties in comprehending non-English inputs, rather than limitations in reasoning capabilities. Based on these findings, we propose UST (Understand, Solve, and Translate), a method that strategically uses English as an anchor for reasoning and solution generation. By fine-tuning the model on 130k synthetically generated data points, UST achieves a 10.91% improvement on the HRM8K benchmark and reduces the multilingual performance gap from 11.6% to 0.7%. Additionally, we show that improvements from UST generalize effectively to different Korean domains, demonstrating that capabilities acquired from machine-verifiable content can be generalized to other areas. We publicly release the benchmark, training dataset, and models. △ Less

Submitted 31 January, 2025; v1 submitted 5 January, 2025; originally announced January 2025.

Comments: 18 pages, 14 figures, 9 tables

arXiv:2501.00271 [pdf, ps, other]

Generalized finite and affine $W$-algebras in type $A$

Authors: Dong Jun Choi, Alexander Molev, Uhi Rinn Suh

Abstract: We construct a new family of affine $W$-algebras $W^k(λ,μ)$ parameterized by partitions $λ$ and $μ$ associated with the centralizers of nilpotent elements in $\mathfrak{gl}_N$. The new family unifies a few known classes of $W$-algebras. In particular, for the column-partition $λ$ we recover the affine $W$-algebras $W^k(\mathfrak{gl}_N,f)$ of Kac, Roan and Wakimoto, associated with nilpotent elemen… ▽ More We construct a new family of affine $W$-algebras $W^k(λ,μ)$ parameterized by partitions $λ$ and $μ$ associated with the centralizers of nilpotent elements in $\mathfrak{gl}_N$. The new family unifies a few known classes of $W$-algebras. In particular, for the column-partition $λ$ we recover the affine $W$-algebras $W^k(\mathfrak{gl}_N,f)$ of Kac, Roan and Wakimoto, associated with nilpotent elements $f\in\mathfrak{gl}_N$ of type $μ$. Our construction is based on a version of the BRST complex of the quantum Drinfeld-Sokolov reduction. We show that the application of the Zhu functor to the vertex algebras $W^k(λ,μ)$ yields a family of generalized finite $W$-algebras $U(λ,μ)$ which we also describe independently as associative algebras. △ Less

Submitted 30 December, 2024; originally announced January 2025.

Comments: 29 pages

Showing 1–50 of 418 results for author: Choi, D