Search | arXiv e-print repository

Instance-Specific Test-Time Training for Speech Editing in the Wild

Authors: Taewoo Kim, Uijong Lee, Hayoung Park, Choongsang Cho, Nam In Park, Young Han Lee

Abstract: Speech editing systems aim to naturally modify speech content while preserving acoustic consistency and speaker identity. However, previous studies often struggle to adapt to unseen and diverse acoustic conditions, resulting in degraded editing performance in real-world scenarios. To address this, we propose an instance-specific test-time training method for speech editing in the wild. Our approac… ▽ More Speech editing systems aim to naturally modify speech content while preserving acoustic consistency and speaker identity. However, previous studies often struggle to adapt to unseen and diverse acoustic conditions, resulting in degraded editing performance in real-world scenarios. To address this, we propose an instance-specific test-time training method for speech editing in the wild. Our approach employs direct supervision from ground-truth acoustic features in unedited regions, and indirect supervision in edited regions via auxiliary losses based on duration constraints and phoneme prediction. This strategy mitigates the bandwidth discontinuity problem in speech editing, ensuring smooth acoustic transitions between unedited and edited regions. Additionally, it enables precise control over speech rate by adapting the model to target durations via mask length adjustment during test-time training. Experiments on in-the-wild benchmark datasets demonstrate that our method outperforms existing speech editing systems in both objective and subjective evaluations. △ Less

Submitted 16 June, 2025; originally announced June 2025.

Comments: Submitted to IEEE Signal Processing Letters

arXiv:2506.12482 [pdf, ps, other]

Tiered Agentic Oversight: A Hierarchical Multi-Agent System for AI Safety in Healthcare

Authors: Yubin Kim, Hyewon Jeong, Chanwoo Park, Eugene Park, Haipeng Zhang, Xin Liu, Hyeonhoon Lee, Daniel McDuff, Marzyeh Ghassemi, Cynthia Breazeal, Samir Tulebaev, Hae Won Park

Abstract: Current large language models (LLMs), despite their power, can introduce safety risks in clinical settings due to limitations such as poor error detection and single point of failure. To address this, we propose Tiered Agentic Oversight (TAO), a hierarchical multi-agent framework that enhances AI safety through layered, automated supervision. Inspired by clinical hierarchies (e.g., nurse, physicia… ▽ More Current large language models (LLMs), despite their power, can introduce safety risks in clinical settings due to limitations such as poor error detection and single point of failure. To address this, we propose Tiered Agentic Oversight (TAO), a hierarchical multi-agent framework that enhances AI safety through layered, automated supervision. Inspired by clinical hierarchies (e.g., nurse, physician, specialist), TAO conducts agent routing based on task complexity and agent roles. Leveraging automated inter- and intra-tier collaboration and role-playing, TAO creates a robust safety framework. Ablation studies reveal that TAO's superior performance is driven by its adaptive tiered architecture, which improves safety by over 3.2% compared to static single-tier configurations; the critical role of its lower tiers, particularly tier 1, whose removal most significantly impacts safety; and the strategic assignment of more advanced LLM to these initial tiers, which boosts performance by over 2% compared to less optimal allocations while achieving near-peak safety efficiently. These mechanisms enable TAO to outperform single-agent and multi-agent frameworks in 4 out of 5 healthcare safety benchmarks, showing up to an 8.2% improvement over the next-best methods in these evaluations. Finally, we validate TAO via an auxiliary clinician-in-the-loop study where integrating expert feedback improved TAO's accuracy in medical triage from 40% to 60%. △ Less

Submitted 14 June, 2025; originally announced June 2025.

arXiv:2506.12471 [pdf, ps, other]

Adaptive Multi-resolution Hash-Encoding Framework for INR-based Dental CBCT Reconstruction with Truncated FOV

Authors: Hyoung Suk Park, Kiwan Jeon

Abstract: Implicit neural representation (INR), particularly in combination with hash encoding, has recently emerged as a promising approach for computed tomography (CT) image reconstruction. However, directly applying INR techniques to 3D dental cone-beam CT (CBCT) with a truncated field of view (FOV) is challenging. During the training process, if the FOV does not fully encompass the patient's head, a dis… ▽ More Implicit neural representation (INR), particularly in combination with hash encoding, has recently emerged as a promising approach for computed tomography (CT) image reconstruction. However, directly applying INR techniques to 3D dental cone-beam CT (CBCT) with a truncated field of view (FOV) is challenging. During the training process, if the FOV does not fully encompass the patient's head, a discrepancy arises between the measured projections and the forward projections computed within the truncated domain. This mismatch leads the network to estimate attenuation values inaccurately, producing severe artifacts in the reconstructed images. In this study, we propose a computationally efficient INR-based reconstruction framework that leverages multi-resolution hash encoding for 3D dental CBCT with a truncated FOV. To mitigate truncation artifacts, we train the network over an expanded reconstruction domain that fully encompasses the patient's head. For computational efficiency, we adopt an adaptive training strategy that uses a multi-resolution grid: finer resolution levels and denser sampling inside the truncated FOV, and coarser resolution levels with sparser sampling outside. To maintain consistent input dimensionality of the network across spatially varying resolutions, we introduce an adaptive hash encoder that selectively activates the lower-level features of the hash hierarchy for points outside the truncated FOV. The proposed method with an extended FOV effectively mitigates truncation artifacts. Compared with a naive domain extension using fixed resolution levels and a fixed sampling rate, the adaptive strategy reduces computational time by over 60% for an image volume of 800x800x600, while preserving the PSNR within the truncated FOV. △ Less

Submitted 14 June, 2025; originally announced June 2025.

Comments: 18 pages, 4 figures

MSC Class: 68Wxx

arXiv:2506.11920 [pdf, ps, other]

Nanoscale Magnetic Resonance Imaging and Control of a Strongly Interacting Dipolar System

Authors: Piotr Put, Nathaniel T. Leitao, Christina Spaegele, Haoyang Gao, Oksana Makarova, Bartholomeus Machielse, Hengyun Zhou, Federico Capasso, Leigh S. Martin, Hongkun Park, Mikhail D. Lukin

Abstract: Magnetic Resonance Imaging (MRI) is a fundamental tool for physical and life sciences, yet its spatial resolution is typically limited to macroscopic scales. Here, we demonstrate nanoscale MRI by combining strong, time-dependent local magnetic field gradients with coherent control of a dense ensemble of electron spins hosted in atom-like defects in diamond. Using this platform, we generate and man… ▽ More Magnetic Resonance Imaging (MRI) is a fundamental tool for physical and life sciences, yet its spatial resolution is typically limited to macroscopic scales. Here, we demonstrate nanoscale MRI by combining strong, time-dependent local magnetic field gradients with coherent control of a dense ensemble of electron spins hosted in atom-like defects in diamond. Using this platform, we generate and manipulate nanoscale spin textures - spatially structured patterns of spin orientation - and track their evolution under engineered many-body interactions. Controlling the dipolar spin exchange driving the dynamics, we observe striking signatures of sensitivity to the microscopic details underlying the polarization distribution. Our results open the door for robust control of metrologically useful entanglement, and nanoscale imaging of materials and biological systems under ambient conditions. △ Less

Submitted 13 June, 2025; originally announced June 2025.

arXiv:2506.11329 [pdf, ps, other]

A4: Microarchitecture-Aware LLC Management for Datacenter Servers with Emerging I/O Devices

Authors: Haneul Park, Jiaqi Lou, Sangjin Lee, Yifan Yuan, Kyoung Soo Park, Yongseok Son, Ipoom Jeong, Nam Sung Kim

Abstract: In modern server CPUs, the Last-Level Cache (LLC) serves not only as a victim cache for higher-level private caches but also as a buffer for low-latency DMA transfers between CPU cores and I/O devices through Direct Cache Access (DCA). However, prior work has shown that high-bandwidth network-I/O devices can rapidly flood the LLC with packets, often causing significant contention with co-running w… ▽ More In modern server CPUs, the Last-Level Cache (LLC) serves not only as a victim cache for higher-level private caches but also as a buffer for low-latency DMA transfers between CPU cores and I/O devices through Direct Cache Access (DCA). However, prior work has shown that high-bandwidth network-I/O devices can rapidly flood the LLC with packets, often causing significant contention with co-running workloads. One step further, this work explores hidden microarchitectural properties of the Intel Xeon CPUs, uncovering two previously unrecognized LLC contentions triggered by emerging high-bandwidth I/O devices. Specifically, (C1) DMA-written cache lines in LLC ways designated for DCA (referred to as DCA ways) are migrated to certain LLC ways (denoted as inclusive ways) when accessed by CPU cores, unexpectedly contending with non-I/O cache lines within the inclusive ways. In addition, (C2) high-bandwidth storage-I/O devices, which are increasingly common in datacenter servers, benefit little from DCA while contending with (latency-sensitive) network-I/O devices within DCA ways. To this end, we present \design, a runtime LLC management framework designed to alleviate both (C1) and (C2) among diverse co-running workloads, using a hidden knob and other hardware features implemented in those CPUs. Additionally, we demonstrate that \design can also alleviate other previously known network-I/O-driven LLC contentions. Overall, it improves the performance of latency-sensitive, high-priority workloads by 51\% without notably compromising that of low-priority workloads. △ Less

Submitted 12 June, 2025; originally announced June 2025.

arXiv:2506.11081 [pdf, ps, other]

SAGE:Specification-Aware Grammar Extraction for Automated Test Case Generation with LLMs

Authors: Aditi, Hyunwoo Park, Sicheol Sung, Yo-Sub Han, Sang-Ki Ko

Abstract: Grammar-based test case generation has proven effective for competitive programming problems, but generating valid and general grammars from natural language specifications remains a key challenge, especially under limited supervision. Context-Free Grammars with Counters (CCFGs) have recently been introduced as a formalism to represent such specifications with logical constraints by storing and re… ▽ More Grammar-based test case generation has proven effective for competitive programming problems, but generating valid and general grammars from natural language specifications remains a key challenge, especially under limited supervision. Context-Free Grammars with Counters (CCFGs) have recently been introduced as a formalism to represent such specifications with logical constraints by storing and reusing counter values during derivation. In this work, we explore the use of open-source large language models (LLMs) to induce CCFGs from specifications using a small number of labeled examples and verifiable reward-guided reinforcement learning. Our approach first fine-tunes an open-source LLM to perform specification-to-grammar translation, and further applies Group Relative Policy Optimization (GRPO) to enhance grammar validity and generality. We also examine the effectiveness of iterative feedback for open and closed-source LLMs in correcting syntactic and semantic errors in generated grammars. Experimental results show that our approach SAGE achieves stronger generalization and outperforms 17 open and closed-source LLMs in both grammar quality and test effectiveness, improving over the state-of-the-art by 15.92%p in grammar validity and 12.34%p in test effectiveness. We provide our implementation and dataset at the following anonymous repository:https://anonymous.4open.science/r/SAGE-5714 △ Less

Submitted 4 June, 2025; originally announced June 2025.

arXiv:2506.10567 [pdf, ps, other]

LRSLAM: Low-rank Representation of Signed Distance Fields in Dense Visual SLAM System

Authors: Hongbeen Park, Minjeong Park, Giljoo Nam, Jinkyu Kim

Abstract: Simultaneous Localization and Mapping (SLAM) has been crucial across various domains, including autonomous driving, mobile robotics, and mixed reality. Dense visual SLAM, leveraging RGB-D camera systems, offers advantages but faces challenges in achieving real-time performance, robustness, and scalability for large-scale scenes. Recent approaches utilizing neural implicit scene representations sho… ▽ More Simultaneous Localization and Mapping (SLAM) has been crucial across various domains, including autonomous driving, mobile robotics, and mixed reality. Dense visual SLAM, leveraging RGB-D camera systems, offers advantages but faces challenges in achieving real-time performance, robustness, and scalability for large-scale scenes. Recent approaches utilizing neural implicit scene representations show promise but suffer from high computational costs and memory requirements. ESLAM introduced a plane-based tensor decomposition but still struggled with memory growth. Addressing these challenges, we propose a more efficient visual SLAM model, called LRSLAM, utilizing low-rank tensor decomposition methods. Our approach, leveraging the Six-axis and CP decompositions, achieves better convergence rates, memory efficiency, and reconstruction/localization quality than existing state-of-the-art approaches. Evaluation across diverse indoor RGB-D datasets demonstrates LRSLAM's superior performance in terms of parameter efficiency, processing time, and accuracy, retaining reconstruction and localization quality. Our code will be publicly available upon publication. △ Less

Submitted 12 June, 2025; originally announced June 2025.

Comments: Accepted at ECCV 2024

arXiv:2506.09993 [pdf, other]

Text-Aware Image Restoration with Diffusion Models

Authors: Jaewon Min, Jin Hyeon Kim, Paul Hyunbin Cho, Jaeeun Lee, Jihye Park, Minkyu Park, Sangpil Kim, Hyunhee Park, Seungryong Kim

Abstract: Image restoration aims to recover degraded images. However, existing diffusion-based restoration methods, despite great success in natural image restoration, often struggle to faithfully reconstruct textual regions in degraded images. Those methods frequently generate plausible but incorrect text-like patterns, a phenomenon we refer to as text-image hallucination. In this paper, we introduce Text-… ▽ More Image restoration aims to recover degraded images. However, existing diffusion-based restoration methods, despite great success in natural image restoration, often struggle to faithfully reconstruct textual regions in degraded images. Those methods frequently generate plausible but incorrect text-like patterns, a phenomenon we refer to as text-image hallucination. In this paper, we introduce Text-Aware Image Restoration (TAIR), a novel restoration task that requires the simultaneous recovery of visual contents and textual fidelity. To tackle this task, we present SA-Text, a large-scale benchmark of 100K high-quality scene images densely annotated with diverse and complex text instances. Furthermore, we propose a multi-task diffusion framework, called TeReDiff, that integrates internal features from diffusion models into a text-spotting module, enabling both components to benefit from joint training. This allows for the extraction of rich text representations, which are utilized as prompts in subsequent denoising steps. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art restoration methods, achieving significant gains in text recognition accuracy. See our project page: https://cvlab-kaist.github.io/TAIR/ △ Less

Submitted 11 June, 2025; originally announced June 2025.

Comments: Project page: https://cvlab-kaist.github.io/TAIR/

arXiv:2506.08660 [pdf, other]

Towards Robust Real-World Multivariate Time Series Forecasting: A Unified Framework for Dependency, Asynchrony, and Missingness

Authors: Jinkwan Jang, Hyungjin Park, Jinmyeong Choi, Taesup Kim

Abstract: Real-world time series data are inherently multivariate, often exhibiting complex inter-channel dependencies. Each channel is typically sampled at its own period and is prone to missing values due to various practical and operational constraints. These characteristics pose fundamental challenges related to channel dependency, sampling asynchrony, and missingness, all of which must be addressed to… ▽ More Real-world time series data are inherently multivariate, often exhibiting complex inter-channel dependencies. Each channel is typically sampled at its own period and is prone to missing values due to various practical and operational constraints. These characteristics pose fundamental challenges related to channel dependency, sampling asynchrony, and missingness, all of which must be addressed to enable robust and reliable forecasting in practical settings. However, most existing architectures are built on oversimplified assumptions, such as identical sampling periods across channels and fully observed inputs at test time, which often do not hold in real-world scenarios. To bridge this gap, we propose ChannelTokenFormer, a Transformer-based forecasting model with a flexible architecture designed to explicitly capture cross-channel interactions, accommodate channel-wise asynchronous sampling, and effectively handle missing values. Extensive experiments on three benchmark datasets modified to reflect practical settings, along with one real-world industrial dataset, demonstrate the superior robustness and accuracy of ChannelTokenFormer under challenging real-world conditions. △ Less

Submitted 10 June, 2025; originally announced June 2025.

arXiv:2506.08573 [pdf, ps, other]

Designing funding rates for perpetual futures in cryptocurrency markets

Authors: Jaehyun Kim, Hyungbin Park

Abstract: In cryptocurrency markets, a key challenge for perpetual future issuers is maintaining alignment between the perpetual future price and target value. This study addresses this challenge by exploring the relationship between funding rates and perpetual future prices. Our results demonstrate that by appropriately designing funding rates, the perpetual future price can remain aligned with the target… ▽ More In cryptocurrency markets, a key challenge for perpetual future issuers is maintaining alignment between the perpetual future price and target value. This study addresses this challenge by exploring the relationship between funding rates and perpetual future prices. Our results demonstrate that by appropriately designing funding rates, the perpetual future price can remain aligned with the target value. We develop replicating portfolios for perpetual futures, offering issuers an effective method to hedge their positions. Additionally, we provide path-dependent funding rates as a practical alternative and investigate the difference between the original and path-dependent funding rates. To achieve these results, our study employs path-dependent infinite-horizon BSDEs in conjunction with arbitrage pricing theory. Our main results are obtained by establishing the existence and uniqueness of solutions to these BSDEs and analyzing the large-time behavior of these solutions. △ Less

Submitted 10 June, 2025; originally announced June 2025.

arXiv:2506.08476 [pdf, ps, other]

Bridging Electrostatic Screening and Ion Transport in Lithium Salt-Doped Ionic Liquids

Authors: Hyungshick Park, Bong June Sung, Jeongmin Kim

Abstract: Alkali salt-doped ionic liquids are emerging as promising electrolyte systems for energy applications, owing to their excellent interfacial stability. To address their limited ionic conductivity, various strategies have been proposed, including modifying the ion solvation environment and enhancing the transport of selected ions (e.g., Li$^+$). Despite the pivotal role of electrostatic interactions… ▽ More Alkali salt-doped ionic liquids are emerging as promising electrolyte systems for energy applications, owing to their excellent interfacial stability. To address their limited ionic conductivity, various strategies have been proposed, including modifying the ion solvation environment and enhancing the transport of selected ions (e.g., Li$^+$). Despite the pivotal role of electrostatic interactions in determining key physicochemical properties, their influence on ion transport in such systems has received relatively little attention. In this work, we investigate the connection between ion transport and electrostatic screening using atomistic molecular dynamics simulations of 1-butyl-1-methylpyrrolidinium bis(trifluoromethanesulfonyl)imide ([pyr$_{14}$][TFSI]) doped with lithium bis(trifluoromethanesulfonyl)imide (LiTFSI) at molar fractions x$_{LiTFSI}$ $\le$ 0.3. We find that the charge-charge and density-density correlation functions exhibit oscillatory exponential decay, indicating that LiTFSI doped [pyr$_{14}$][TFSI] is a charge- and mass-dense system. The electrostatic screening length decreases with increasing LiTFSI concentration, whereas the decay length of the density-density correlation functions remains nearly unchanged. Notably, we find that the x$_{LiTFSI}$-sensitive screening length serves as a central length scale for disentangling species-specific contributions of ion pairs to collective ion transport upon LiTFSI doping. This framework provides a unifying perspective on the interplay between structure and transport in ionic liquid systems. △ Less

Submitted 10 June, 2025; originally announced June 2025.

Comments: 11 pages, 5 figures

arXiv:2506.07879 [pdf, ps, other]

Measurement of the CP asymmetry in $D^+ \to π^+ π^0$ decays at Belle II

Authors: Belle II Collaboration, I. Adachi, L. Aggarwal, H. Ahmed, H. Aihara, N. Akopov, S. Alghamdi, M. Alhakami, A. Aloisio, K. Amos, M. Angelsmark, N. Anh Ky, C. Antonioli, D. M. Asner, H. Atmacan, V. Aushev, M. Aversano, R. Ayad, V. Babu, H. Bae, N. K. Baghel, P. Bambade, Sw. Banerjee, S. Bansal, M. Barrett , et al. (380 additional authors not shown)

Abstract: We measure the CP asymmetry in $D^+ \to π^+ π^0$ decays reconstructed in $e^+ e^-$ collisions at the Belle II experiment using a data set corresponding to an integrated luminosity of 428 fb$^{-1}$. A control sample of $D^+ \to π^+ K_{S}$ decays is used to correct for detection and production asymmetries. The result, $A_{CP}(D^+ \to π^+π^0) =(-1.8 \pm 0.9 \pm 0.1)\%$, where the first uncertainty is… ▽ More We measure the CP asymmetry in $D^+ \to π^+ π^0$ decays reconstructed in $e^+ e^-$ collisions at the Belle II experiment using a data set corresponding to an integrated luminosity of 428 fb$^{-1}$. A control sample of $D^+ \to π^+ K_{S}$ decays is used to correct for detection and production asymmetries. The result, $A_{CP}(D^+ \to π^+π^0) =(-1.8 \pm 0.9 \pm 0.1)\%$, where the first uncertainty is statistical and the second systematic, is the most precise determination to date. It agrees with the prediction of CP symmetry from the standard model, and with results of previous measurements. △ Less

Submitted 9 June, 2025; originally announced June 2025.

Report number: Belle II Preprint 2025-012, KEK Preprint 2025-10

arXiv:2506.04054 [pdf, ps, other]

Video Deblurring with Deconvolution and Aggregation Networks

Authors: Giyong Choi, HyunWook Park

Abstract: In contrast to single-image deblurring, video deblurring has the advantage that neighbor frames can be utilized to deblur a target frame. However, existing video deblurring algorithms often fail to properly employ the neighbor frames, resulting in sub-optimal performance. In this paper, we propose a deconvolution and aggregation network (DAN) for video deblurring that utilizes the information of n… ▽ More In contrast to single-image deblurring, video deblurring has the advantage that neighbor frames can be utilized to deblur a target frame. However, existing video deblurring algorithms often fail to properly employ the neighbor frames, resulting in sub-optimal performance. In this paper, we propose a deconvolution and aggregation network (DAN) for video deblurring that utilizes the information of neighbor frames well. In DAN, both deconvolution and aggregation strategies are achieved through three sub-networks: the preprocessing network (PPN) and the alignment-based deconvolution network (ABDN) for the deconvolution scheme; the frame aggregation network (FAN) for the aggregation scheme. In the deconvolution part, blurry inputs are first preprocessed by the PPN with non-local operations. Then, the output frames from the PPN are deblurred by the ABDN based on the frame alignment. In the FAN, these deblurred frames from the deconvolution part are combined into a latent frame according to reliability maps which infer pixel-wise sharpness. The proper combination of three sub-networks can achieve favorable performance on video deblurring by using the neighbor frames suitably. In experiments, the proposed DAN was demonstrated to be superior to existing state-of-the-art methods through both quantitative and qualitative evaluations on the public datasets. △ Less

Submitted 4 June, 2025; originally announced June 2025.

arXiv:2506.03892 [pdf, ps, other]

Joint Video Enhancement with Deblurring, Super-Resolution, and Frame Interpolation Network

Authors: Giyong Choi, HyunWook Park

Abstract: Video quality is often severely degraded by multiple factors rather than a single factor. These low-quality videos can be restored to high-quality videos by sequentially performing appropriate video enhancement techniques. However, the sequential approach was inefficient and sub-optimal because most video enhancement approaches were designed without taking into account that multiple factors togeth… ▽ More Video quality is often severely degraded by multiple factors rather than a single factor. These low-quality videos can be restored to high-quality videos by sequentially performing appropriate video enhancement techniques. However, the sequential approach was inefficient and sub-optimal because most video enhancement approaches were designed without taking into account that multiple factors together degrade video quality. In this paper, we propose a new joint video enhancement method that mitigates multiple degradation factors simultaneously by resolving an integrated enhancement problem. Our proposed network, named DSFN, directly produces a high-resolution, high-frame-rate, and clear video from a low-resolution, low-frame-rate, and blurry video. In the DSFN, low-resolution and blurry input frames are enhanced by a joint deblurring and super-resolution (JDSR) module. Meanwhile, intermediate frames between input adjacent frames are interpolated by a triple-frame-based frame interpolation (TFBFI) module. The proper combination of the proposed modules of DSFN can achieve superior performance on the joint video enhancement task. Experimental results show that the proposed method outperforms other sequential state-of-the-art techniques on public datasets with a smaller network size and faster processing time. △ Less

Submitted 4 June, 2025; originally announced June 2025.

arXiv:2506.02338 [pdf, other]

One Missing Piece for Open-Source Reasoning Models: A Dataset to Mitigate Cold-Starting Short CoT LLMs in RL

Authors: Hyungjoo Chae, Dongjin Kang, Jihyuk Kim, Beong-woo Kwak, Sunghyun Park, Haeju Park, Jinyoung Yeo, Moontae Lee, Kyungjae Lee

Abstract: With the release of R1, a publicly available large reasoning model (LRM), researchers commonly train new LRMs by training language models on R1's long chain-of-thought (CoT) inferences. While prior works show that LRMs' capabilities can be reproduced through direct distillation, the continued reliance on the existing models (e.g., R1) remains a critical limitation in advancing the field. As a firs… ▽ More With the release of R1, a publicly available large reasoning model (LRM), researchers commonly train new LRMs by training language models on R1's long chain-of-thought (CoT) inferences. While prior works show that LRMs' capabilities can be reproduced through direct distillation, the continued reliance on the existing models (e.g., R1) remains a critical limitation in advancing the field. As a first step toward independent LRM development, this paper explores the possibility of constructing a long CoT dataset with LLMs that are not trained for inference-time scaling. To this end, we present the Long CoT Collection, a dataset of 100K CoT rationales annotated using existing short CoT LLMs. We develop a pipeline that induces o1's novel reasoning strategies into short CoT LLMs, enabling them to think longer and introducing controllability over the thought budget to better manage the overthinking problem. Our extensive analyses validate that our dataset achieves quality comparable to--or slightly below--R1. Furthermore, our experiments demonstrate that training on our dataset not only strengthens general reasoning skills, but also provides a strong foundation for reinforcement learning--models initialized on our data achieve 2-3x larger gains with RLVR. △ Less

Submitted 2 June, 2025; originally announced June 2025.

Comments: ACL 2025 Industry

arXiv:2506.01620 [pdf, ps, other]

Exploring the potential for kinematically colder HI component as a tracer for star-forming gas in nearby galaxies

Authors: Hye-Jin Park, Andrew J. Battisti, Antoine Marchal, Luca Cortese, Emily Wisnioski, Mark Seibert, Shin-Jeong Kim, Naomi McClure-Griffiths, W. J. G. de Blok, Kathryn Grasha, Barry F. Madore, Jeff A. Rich, Rachael L. Beaton

Abstract: Atomic hydrogen (HI) dominates the mass of the cold interstellar medium, undergoing thermal condensation to form molecular gas and fuel star formation. Kinematically colder HI components, identified via kinematic decomposition of HI 21 cm data cubes, serve as a crucial transition phase between diffuse warm neutral gas and molecular hydrogen (H$_{2}$). We analyse these colder HI components by decom… ▽ More Atomic hydrogen (HI) dominates the mass of the cold interstellar medium, undergoing thermal condensation to form molecular gas and fuel star formation. Kinematically colder HI components, identified via kinematic decomposition of HI 21 cm data cubes, serve as a crucial transition phase between diffuse warm neutral gas and molecular hydrogen (H$_{2}$). We analyse these colder HI components by decomposing HI 21 cm data cubes of seven nearby galaxies - Sextans A, NGC 6822, WLM, NGC 5068, NGC 7793, NGC 1566, and NGC 5236 - spanning metallicities (0.1 < $Z/Z_{\odot}$ < 1.0) and physical scales (53-1134 pc). Using a velocity dispersion threshold of 6 km s$^{-1}$, we classify the kinematically distinct components into narrow (colder) and broad (warmer). Cross-correlation analysis between the narrow HI components and H$_{2}$ or star formation rate (SFR) surface density at different spatial scales reveals that dwarf galaxies exhibit the strongest correlation at ~500-700 pc. The radially binned narrow HI fraction, $f_{\rm n} = I_{\rm narrowHI}/I_{\rm totalHI}$, in dwarf galaxies shows no clear trend with metallicity or SFR, while in spirals, $f_{\rm n}$ is lower in inner regions with higher metallicity and SFR. We find that the dataset resolution significantly impacts the results, with higher physical resolution data yielding a higher median $f_{\rm n}$, $\langle f_{\rm n} \rangle$, per galaxy. With this considered, dwarf galaxies consistently exhibit a larger $f_{\rm n}$ than spiral galaxies. These findings highlight the critical role of cold HI in regulating star formation across different galactic environments and emphasise the need for high-resolution HI observations to further unravel the connection between atomic-to-molecular gas conversion and galaxy evolution. △ Less

Submitted 2 June, 2025; originally announced June 2025.

Comments: 18 pages, 9 figures, accepted for publication to MNRAS

arXiv:2506.01411 [pdf, ps, other]

ViTA-PAR: Visual and Textual Attribute Alignment with Attribute Prompting for Pedestrian Attribute Recognition

Authors: Minjeong Park, Hongbeen Park, Jinkyu Kim

Abstract: The Pedestrian Attribute Recognition (PAR) task aims to identify various detailed attributes of an individual, such as clothing, accessories, and gender. To enhance PAR performance, a model must capture features ranging from coarse-grained global attributes (e.g., for identifying gender) to fine-grained local details (e.g., for recognizing accessories) that may appear in diverse regions. Recent re… ▽ More The Pedestrian Attribute Recognition (PAR) task aims to identify various detailed attributes of an individual, such as clothing, accessories, and gender. To enhance PAR performance, a model must capture features ranging from coarse-grained global attributes (e.g., for identifying gender) to fine-grained local details (e.g., for recognizing accessories) that may appear in diverse regions. Recent research suggests that body part representation can enhance the model's robustness and accuracy, but these methods are often restricted to attribute classes within fixed horizontal regions, leading to degraded performance when attributes appear in varying or unexpected body locations. In this paper, we propose Visual and Textual Attribute Alignment with Attribute Prompting for Pedestrian Attribute Recognition, dubbed as ViTA-PAR, to enhance attribute recognition through specialized multimodal prompting and vision-language alignment. We introduce visual attribute prompts that capture global-to-local semantics, enabling diverse attribute representations. To enrich textual embeddings, we design a learnable prompt template, termed person and attribute context prompting, to learn person and attributes context. Finally, we align visual and textual attribute features for effective fusion. ViTA-PAR is validated on four PAR benchmarks, achieving competitive performance with efficient inference. We release our code and model at https://github.com/mlnjeongpark/ViTA-PAR. △ Less

Submitted 2 June, 2025; originally announced June 2025.

Comments: Accepted to IEEE ICIP 2025

arXiv:2506.00827 [pdf, ps, other]

Improving Keystep Recognition in Ego-Video via Dexterous Focus

Authors: Zachary Chavis, Stephen J. Guy, Hyun Soo Park

Abstract: In this paper, we address the challenge of understanding human activities from an egocentric perspective. Traditional activity recognition techniques face unique challenges in egocentric videos due to the highly dynamic nature of the head during many activities. We propose a framework that seeks to address these challenges in a way that is independent of network architecture by restricting the ego… ▽ More In this paper, we address the challenge of understanding human activities from an egocentric perspective. Traditional activity recognition techniques face unique challenges in egocentric videos due to the highly dynamic nature of the head during many activities. We propose a framework that seeks to address these challenges in a way that is independent of network architecture by restricting the ego-video input to a stabilized, hand-focused video. We demonstrate that this straightforward video transformation alone outperforms existing egocentric video baselines on the Ego-Exo4D Fine-Grained Keystep Recognition benchmark without requiring any alteration of the underlying model infrastructure. △ Less

Submitted 1 June, 2025; originally announced June 2025.

arXiv:2505.24751 [pdf, ps, other]

EL-AGHF: Extended Lagrangian Affine Geometric Heat Flow

Authors: Sangmin Kim, Hae-Won Park

Abstract: We propose a constrained Affine Geometric Heat Flow (AGHF) method that evolves so as to suppress the dynamics gaps associated with inadmissible control directions. AGHF provides a unified framework applicable to a wide range of motion planning problems, including both holonomic and non-holonomic systems. However, to generate admissible trajectories, it requires assigning infinite penalties to inad… ▽ More We propose a constrained Affine Geometric Heat Flow (AGHF) method that evolves so as to suppress the dynamics gaps associated with inadmissible control directions. AGHF provides a unified framework applicable to a wide range of motion planning problems, including both holonomic and non-holonomic systems. However, to generate admissible trajectories, it requires assigning infinite penalties to inadmissible control directions. This design choice, while theoretically valid, often leads to high computational cost or numerical instability when the penalty becomes excessively large. To overcome this limitation, we extend AGHF in an Augmented Lagrangian method approach by introducing a dual trajectory related to dynamics gaps in inadmissible control directions. This method solves the constrained variational problem as an extended parabolic partial differential equation defined over both the state and dual trajectorys, ensuring the admissibility of the resulting trajectory. We demonstrate the effectiveness of our algorithm through simulation examples. △ Less

Submitted 30 May, 2025; originally announced May 2025.

Comments: 6 pages, 4 figures

arXiv:2505.23026 [pdf, ps, other]

Context-Robust Knowledge Editing for Language Models

Authors: Haewon Park, Gyubin Choi, Minjun Kim, Yohan Jo

Abstract: Knowledge editing (KE) methods offer an efficient way to modify knowledge in large language models. Current KE evaluations typically assess editing success by considering only the edited knowledge without any preceding contexts. In real-world applications, however, preceding contexts often trigger the retrieval of the original knowledge and undermine the intended edit. To address this issue, we de… ▽ More Knowledge editing (KE) methods offer an efficient way to modify knowledge in large language models. Current KE evaluations typically assess editing success by considering only the edited knowledge without any preceding contexts. In real-world applications, however, preceding contexts often trigger the retrieval of the original knowledge and undermine the intended edit. To address this issue, we develop CHED -- a benchmark designed to evaluate the context robustness of KE methods. Evaluations on CHED show that they often fail when preceding contexts are present. To mitigate this shortcoming, we introduce CoRE, a KE method designed to strengthen context robustness by minimizing context-sensitive variance in hidden states of the model for edited knowledge. This method not only improves the editing success rate in situations where a preceding context is present but also preserves the overall capabilities of the model. We provide an in-depth analysis of the differing impacts of preceding contexts when introduced as user utterances versus assistant responses, and we dissect attention-score patterns to assess how specific tokens influence editing success. △ Less

Submitted 31 May, 2025; v1 submitted 28 May, 2025; originally announced May 2025.

Comments: ACL 2025 Findings. Our code and datasets are available at https://github.com/holi-lab/CoRE

arXiv:2505.23006 [pdf, ps, other]

A Practical Approach for Building Production-Grade Conversational Agents with Workflow Graphs

Authors: Chiwan Park, Wonjun Jang, Daeryong Kim, Aelim Ahn, Kichang Yang, Woosung Hwang, Jihyeon Roh, Hyerin Park, Hyosun Wang, Min Seok Kim, Jihoon Kang

Abstract: The advancement of Large Language Models (LLMs) has led to significant improvements in various service domains, including search, recommendation, and chatbot applications. However, applying state-of-the-art (SOTA) research to industrial settings presents challenges, as it requires maintaining flexible conversational abilities while also strictly complying with service-specific constraints. This ca… ▽ More The advancement of Large Language Models (LLMs) has led to significant improvements in various service domains, including search, recommendation, and chatbot applications. However, applying state-of-the-art (SOTA) research to industrial settings presents challenges, as it requires maintaining flexible conversational abilities while also strictly complying with service-specific constraints. This can be seen as two conflicting requirements due to the probabilistic nature of LLMs. In this paper, we propose our approach to addressing this challenge and detail the strategies we employed to overcome their inherent limitations in real-world applications. We conduct a practical case study of a conversational agent designed for the e-commerce domain, detailing our implementation workflow and optimizations. Our findings provide insights into bridging the gap between academic research and real-world application, introducing a framework for developing scalable, controllable, and reliable AI-driven agents. △ Less

Submitted 28 May, 2025; originally announced May 2025.

Comments: Accepted to ACL 2025 Industry Track. 12 pages, 5 figures

ACM Class: I.2.7

arXiv:2505.21757 [pdf, ps, other]

BehaviorSFT: Behavioral Token Conditioning for Clinical Agents Across the Proactivity Spectrum

Authors: Yubin Kim, Zhiyuan Hu, Hyewon Jeong, Eugene Park, Shuyue Stella Li, Chanwoo Park, Shiyun Xiong, MingYu Lu, Hyeonhoon Lee, Xin Liu, Daniel McDuff, Cynthia Breazeal, Samir Tulebaev, Hae Won Park

Abstract: Large Language Models (LLMs) as clinical agents require careful behavioral adaptation. While adept at reactive tasks (e.g., diagnosis reasoning), LLMs often struggle with proactive engagement, like unprompted identification of critical missing information or risks. We introduce BehaviorBench, a comprehensive dataset to evaluate agent behaviors across a clinical assistance spectrum, ranging from re… ▽ More Large Language Models (LLMs) as clinical agents require careful behavioral adaptation. While adept at reactive tasks (e.g., diagnosis reasoning), LLMs often struggle with proactive engagement, like unprompted identification of critical missing information or risks. We introduce BehaviorBench, a comprehensive dataset to evaluate agent behaviors across a clinical assistance spectrum, ranging from reactive query responses to proactive interventions (e.g., clarifying ambiguities, flagging overlooked critical data). Our BehaviorBench experiments reveal LLMs' inconsistent proactivity. To address this, we propose BehaviorSFT, a novel training strategy using behavioral tokens to explicitly condition LLMs for dynamic behavioral selection along this spectrum. BehaviorSFT boosts performance, achieving up to 97.3% overall Macro F1 on BehaviorBench and improving proactive task scores (e.g., from 95.0% to 96.5% for Qwen2.5-7B-Ins). Crucially, blind clinician evaluations confirmed BehaviorSFT-trained agents exhibit more realistic clinical behavior, striking a superior balance between helpful proactivity (e.g., timely, relevant suggestions) and necessary restraint (e.g., avoiding over-intervention) versus standard fine-tuning or explicit instructed agents. △ Less

Submitted 27 May, 2025; originally announced May 2025.

arXiv:2505.21451 [pdf, ps, other]

Words Like Knives: Backstory-Personalized Modeling and Detection of Violent Communication

Authors: Jocelyn Shen, Akhila Yerukola, Xuhui Zhou, Cynthia Breazeal, Maarten Sap, Hae Won Park

Abstract: Conversational breakdowns in close relationships are deeply shaped by personal histories and emotional context, yet most NLP research treats conflict detection as a general task, overlooking the relational dynamics that influence how messages are perceived. In this work, we leverage nonviolent communication (NVC) theory to evaluate LLMs in detecting conversational breakdowns and assessing how rela… ▽ More Conversational breakdowns in close relationships are deeply shaped by personal histories and emotional context, yet most NLP research treats conflict detection as a general task, overlooking the relational dynamics that influence how messages are perceived. In this work, we leverage nonviolent communication (NVC) theory to evaluate LLMs in detecting conversational breakdowns and assessing how relationship backstory influences both human and model perception of conflicts. Given the sensitivity and scarcity of real-world datasets featuring conflict between familiar social partners with rich personal backstories, we contribute the PersonaConflicts Corpus, a dataset of N=5,772 naturalistic simulated dialogues spanning diverse conflict scenarios between friends, family members, and romantic partners. Through a controlled human study, we annotate a subset of dialogues and obtain fine-grained labels of communication breakdown types on individual turns, and assess the impact of backstory on human and model perception of conflict in conversation. We find that the polarity of relationship backstories significantly shifted human perception of communication breakdowns and impressions of the social partners, yet models struggle to meaningfully leverage those backstories in the detection task. Additionally, we find that models consistently overestimate how positively a message will make a listener feel. Our findings underscore the critical role of personalization to relationship contexts in enabling LLMs to serve as effective mediators in human communication for authentic connection. △ Less

Submitted 27 May, 2025; originally announced May 2025.

arXiv:2505.21380 [pdf, ps, other]

PHISH in MESH: Korean Adversarial Phonetic Substitution and Phonetic-Semantic Feature Integration Defense

Authors: Byungjun Kim, Minju Kim, Hyeonchu Park, Bugeun Kim

Abstract: As malicious users increasingly employ phonetic substitution to evade hate speech detection, researchers have investigated such strategies. However, two key challenges remain. First, existing studies have overlooked the Korean language, despite its vulnerability to phonetic perturbations due to its phonographic nature. Second, prior work has primarily focused on constructing datasets rather than d… ▽ More As malicious users increasingly employ phonetic substitution to evade hate speech detection, researchers have investigated such strategies. However, two key challenges remain. First, existing studies have overlooked the Korean language, despite its vulnerability to phonetic perturbations due to its phonographic nature. Second, prior work has primarily focused on constructing datasets rather than developing architectural defenses. To address these challenges, we propose (1) PHonetic-Informed Substitution for Hangul (PHISH) that exploits the phonological characteristics of the Korean writing system, and (2) Mixed Encoding of Semantic-pHonetic features (MESH) that enhances the detector's robustness by incorporating phonetic information at the architectural level. Our experimental results demonstrate the effectiveness of our proposed methods on both perturbed and unperturbed datasets, suggesting that they not only improve detection performance but also reflect realistic adversarial behaviors employed by malicious users. △ Less

Submitted 27 May, 2025; originally announced May 2025.

Comments: Under review

arXiv:2505.21127 [pdf, ps, other]

The TYPHOON Stellar Population Synthesis Survey. II. Pushing Full Spectral Fitting to the Limit in the Nearby Grand Design Barred Spiral M83

Authors: Eva Sextl, Rolf-Peter Kudritzki, Fabio Bresolin, Kathryn Grasha, Hye-Jin Park, Qian-Hui Chen, Andrew J. Battisti, Mark Seibert, Barry F. Madore, Jeffrey A. Rich

Abstract: We apply population synthesis techniques to analyze TYPHOON long slit spectra of the starburst barred spiral galaxy M83. The analysis covers a central square of 5 arcmin side length. We determine the spatial distribution of dust through the analysis of reddening and extinction, together with star formation rates, ages, and metallicities of young and old stellar populations. For the first time, a s… ▽ More We apply population synthesis techniques to analyze TYPHOON long slit spectra of the starburst barred spiral galaxy M83. The analysis covers a central square of 5 arcmin side length. We determine the spatial distribution of dust through the analysis of reddening and extinction, together with star formation rates, ages, and metallicities of young and old stellar populations. For the first time, a spatial one-to-one comparison of metallicities derived from full-spectral fitting techniques with those obtained from individual young stellar probes has been carried out. The comparison with blue supergiant stars, young massive star clusters, and super star clusters shows a high degree of concordance when wavelength coverage in the $B$-band is available. The metallicity of the young population is supersolar and does not show a radial metallicity gradient along the investigated part of the disk, in agreement with our chemical evolution model. However, a notable decrease in metallicity is observed in a tightly confined region at the galaxy center, coinciding with circumnuclear orbits. We attribute this to matter infall either from the circumgalactic medium or a dwarf galaxy interloper or, alternatively, to AGN-interrupted chemical evolution. We confirm the presence of a dust cavity with a diameter of 260~pc close to the galaxy center. Dust absorption and molecular CO emission are spatially well correlated. We find an anticorrelation between R$_V$, the ratio of dust attenuation to reddening, and the emission strength of molecular species present in photo-dissociation regions. We confirm our results by using alternative fitting algorithms and stellar libraries. △ Less

Submitted 27 May, 2025; originally announced May 2025.

Comments: 25 pages, 21 figures

arXiv:2505.20609 [pdf, other]

Comparisons between a Large Language Model-based Real-Time Compound Diagnostic Medical AI Interface and Physicians for Common Internal Medicine Cases using Simulated Patients

Authors: Hyungjun Park, Chang-Yun Woo, Seungjo Lim, Seunghwan Lim, Keunho Kwak, Ju Young Jeong, Chong Hyun Suh

Abstract: Objective To develop an LLM based realtime compound diagnostic medical AI interface and performed a clinical trial comparing this interface and physicians for common internal medicine cases based on the United States Medical License Exam (USMLE) Step 2 Clinical Skill (CS) style exams. Methods A nonrandomized clinical trial was conducted on August 20, 2024. We recruited one general physician, two i… ▽ More Objective To develop an LLM based realtime compound diagnostic medical AI interface and performed a clinical trial comparing this interface and physicians for common internal medicine cases based on the United States Medical License Exam (USMLE) Step 2 Clinical Skill (CS) style exams. Methods A nonrandomized clinical trial was conducted on August 20, 2024. We recruited one general physician, two internal medicine residents (2nd and 3rd year), and five simulated patients. The clinical vignettes were adapted from the USMLE Step 2 CS style exams. We developed 10 representative internal medicine cases based on actual patients and included information available on initial diagnostic evaluation. Primary outcome was the accuracy of the first differential diagnosis. Repeatability was evaluated based on the proportion of agreement. Results The accuracy of the physicians' first differential diagnosis ranged from 50% to 70%, whereas the realtime compound diagnostic medical AI interface achieved an accuracy of 80%. The proportion of agreement for the first differential diagnosis was 0.7. The accuracy of the first and second differential diagnoses ranged from 70% to 90% for physicians, whereas the AI interface achieved an accuracy rate of 100%. The average time for the AI interface (557 sec) was 44.6% shorter than that of the physicians (1006 sec). The AI interface ($0.08) also reduced costs by 98.1% compared to the physicians' average ($4.2). Patient satisfaction scores ranged from 4.2 to 4.3 for care by physicians and were 3.9 for the AI interface Conclusion An LLM based realtime compound diagnostic medical AI interface demonstrated diagnostic accuracy and patient satisfaction comparable to those of a physician, while requiring less time and lower costs. These findings suggest that AI interfaces may have the potential to assist primary care consultations for common internal medicine cases. △ Less

Submitted 26 May, 2025; originally announced May 2025.

arXiv:2505.19519 [pdf, ps, other]

Regularized Personalization of Text-to-Image Diffusion Models without Distributional Drift

Authors: Gihoon Kim, Hyungjin Park, Taesup Kim

Abstract: Personalization using text-to-image diffusion models involves adapting a pretrained model to novel subjects with only a few image examples. This task presents a fundamental challenge, as the model must not only learn the new subject effectively but also preserve its ability to generate diverse and coherent outputs across a wide range of prompts. In other words, successful personalization requires… ▽ More Personalization using text-to-image diffusion models involves adapting a pretrained model to novel subjects with only a few image examples. This task presents a fundamental challenge, as the model must not only learn the new subject effectively but also preserve its ability to generate diverse and coherent outputs across a wide range of prompts. In other words, successful personalization requires integrating new concepts without forgetting previously learned generative capabilities. Forgetting denotes unintended distributional drift, where the model's output distribution deviates from that of the original pretrained model. In this paper, we provide an analysis of this issue and identify a mismatch between standard training objectives and the goals of personalization. To address this, we propose a new training objective based on a Lipschitz-bounded formulation that explicitly constrains deviation from the pretrained distribution. Our method provides improved control over distributional drift and performs well even in data-scarce scenarios. Experimental results demonstrate that our approach consistently outperforms existing personalization methods, achieving higher CLIP-T, CLIP-I, and DINO scores. △ Less

Submitted 27 May, 2025; v1 submitted 26 May, 2025; originally announced May 2025.

arXiv:2505.19401 [pdf, ps, other]

Stack Less, Repeat More: A Block Reusing Approach for Progressive Speech Enhancement

Authors: Jangyeon Kim, Ui-Hyeop Shin, Jaehyun Ko, Hyung-Min Park

Abstract: This paper presents an efficient speech enhancement (SE) approach that reuses a processing block repeatedly instead of conventional stacking. Rather than increasing the number of blocks for learning deep latent representations, repeating a single block leads to progressive refinement while reducing parameter redundancy. We also minimize domain transformation by keeping an encoder and decoder shall… ▽ More This paper presents an efficient speech enhancement (SE) approach that reuses a processing block repeatedly instead of conventional stacking. Rather than increasing the number of blocks for learning deep latent representations, repeating a single block leads to progressive refinement while reducing parameter redundancy. We also minimize domain transformation by keeping an encoder and decoder shallow and reusing a single sequence modeling block. Experimental results show that the number of processing stages is more critical to performance than the number of blocks with different weights. Also, we observed that the proposed method gradually refines a noisy input within a single block. Furthermore, with the block reuse method, we demonstrate that deepening the encoder and decoder can be redundant for learning deep complex representation. Therefore, the experimental results confirm that the proposed block reusing enables progressive learning and provides an efficient alternative for SE. △ Less

Submitted 25 May, 2025; originally announced May 2025.

Comments: Accepted to Interspeech 2025

arXiv:2505.16351 [pdf, other]

Dysfluent WFST: A Framework for Zero-Shot Speech Dysfluency Transcription and Detection

Authors: Chenxu Guo, Jiachen Lian, Xuanru Zhou, Jinming Zhang, Shuhe Li, Zongli Ye, Hwi Joo Park, Anaisha Das, Zoe Ezzes, Jet Vonk, Brittany Morin, Rian Bogley, Lisa Wauters, Zachary Miller, Maria Gorno-Tempini, Gopala Anumanchipalli

Abstract: Automatic detection of speech dysfluency aids speech-language pathologists in efficient transcription of disordered speech, enhancing diagnostics and treatment planning. Traditional methods, often limited to classification, provide insufficient clinical insight, and text-independent models misclassify dysfluency, especially in context-dependent cases. This work introduces Dysfluent-WFST, a zero-sh… ▽ More Automatic detection of speech dysfluency aids speech-language pathologists in efficient transcription of disordered speech, enhancing diagnostics and treatment planning. Traditional methods, often limited to classification, provide insufficient clinical insight, and text-independent models misclassify dysfluency, especially in context-dependent cases. This work introduces Dysfluent-WFST, a zero-shot decoder that simultaneously transcribes phonemes and detects dysfluency. Unlike previous models, Dysfluent-WFST operates with upstream encoders like WavLM and requires no additional training. It achieves state-of-the-art performance in both phonetic error rate and dysfluency detection on simulated and real speech data. Our approach is lightweight, interpretable, and effective, demonstrating that explicit modeling of pronunciation behavior in decoding, rather than complex architectures, is key to improving dysfluency processing systems. △ Less

Submitted 24 May, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

Comments: Accepted for Interspeech2025

arXiv:2505.15922 [pdf, ps, other]

Aligning Dialogue Agents with Global Feedback via Large Language Model Reward Decomposition

Authors: Dong Won Lee, Hae Won Park, Cynthia Breazeal, Louis-Philippe Morency

Abstract: We propose a large language model based reward decomposition framework for aligning dialogue agents using only a single session-level feedback signal. We leverage the reasoning capabilities of a frozen, pretrained large language model (LLM) to infer fine-grained local implicit rewards by decomposing global, session-level feedback. Our first text-only variant prompts the LLM to perform reward decom… ▽ More We propose a large language model based reward decomposition framework for aligning dialogue agents using only a single session-level feedback signal. We leverage the reasoning capabilities of a frozen, pretrained large language model (LLM) to infer fine-grained local implicit rewards by decomposing global, session-level feedback. Our first text-only variant prompts the LLM to perform reward decomposition using only the dialogue transcript. The second multimodal variant incorporates additional behavioral cues, such as pitch, gaze, and facial affect, expressed as natural language descriptions. These inferred turn-level rewards are distilled into a lightweight reward model, which we utilize for RL-based fine-tuning for dialogue generation. We evaluate both text-only and multimodal variants against state-of-the-art reward decomposition methods and demonstrate notable improvements in human evaluations of conversation quality, suggesting that LLMs are strong reward decomposers that obviate the need for manual reward shaping and granular human feedback. △ Less

Submitted 21 May, 2025; originally announced May 2025.

Comments: 9 pages, 3 figures, 3 tables

arXiv:2505.14814 [pdf, ps, other]

GraphemeAug: A Systematic Approach to Synthesized Hard Negative Keyword Spotting Examples

Authors: Harry Zhang, Kurt Partridge, Pai Zhu, Neng Chen, Hyun Jin Park, Dhruuv Agarwal, Quan Wang

Abstract: Spoken Keyword Spotting (KWS) is the task of distinguishing between the presence and absence of a keyword in audio. The accuracy of a KWS model hinges on its ability to correctly classify examples close to the keyword and non-keyword boundary. These boundary examples are often scarce in training data, limiting model performance. In this paper, we propose a method to systematically generate adversa… ▽ More Spoken Keyword Spotting (KWS) is the task of distinguishing between the presence and absence of a keyword in audio. The accuracy of a KWS model hinges on its ability to correctly classify examples close to the keyword and non-keyword boundary. These boundary examples are often scarce in training data, limiting model performance. In this paper, we propose a method to systematically generate adversarial examples close to the decision boundary by making insertion/deletion/substitution edits on the keyword's graphemes. We evaluate this technique on held-out data for a popular keyword and show that the technique improves AUC on a dataset of synthetic hard negatives by 61% while maintaining quality on positives and ambient negative audio data. △ Less

Submitted 24 May, 2025; v1 submitted 20 May, 2025; originally announced May 2025.

Comments: Accepted at Interspeech 2025

arXiv:2505.13577 [pdf, other]

VocalAgent: Large Language Models for Vocal Health Diagnostics with Safety-Aware Evaluation

Authors: Yubin Kim, Taehan Kim, Wonjune Kang, Eugene Park, Joonsik Yoon, Dongjae Lee, Xin Liu, Daniel McDuff, Hyeonhoon Lee, Cynthia Breazeal, Hae Won Park

Abstract: Vocal health plays a crucial role in peoples' lives, significantly impacting their communicative abilities and interactions. However, despite the global prevalence of voice disorders, many lack access to convenient diagnosis and treatment. This paper introduces VocalAgent, an audio large language model (LLM) to address these challenges through vocal health diagnosis. We leverage Qwen-Audio-Chat fi… ▽ More Vocal health plays a crucial role in peoples' lives, significantly impacting their communicative abilities and interactions. However, despite the global prevalence of voice disorders, many lack access to convenient diagnosis and treatment. This paper introduces VocalAgent, an audio large language model (LLM) to address these challenges through vocal health diagnosis. We leverage Qwen-Audio-Chat fine-tuned on three datasets collected in-situ from hospital patients, and present a multifaceted evaluation framework encompassing a safety assessment to mitigate diagnostic biases, cross-lingual performance analysis, and modality ablation studies. VocalAgent demonstrates superior accuracy on voice disorder classification compared to state-of-the-art baselines. Its LLM-based method offers a scalable solution for broader adoption of health diagnostics, while underscoring the importance of ethical and technical validation. △ Less

Submitted 26 May, 2025; v1 submitted 19 May, 2025; originally announced May 2025.

arXiv:2505.12231 [pdf, ps, other]

Design of a 3-DOF Hopping Robot with an Optimized Gearbox: An Intermediate Platform Toward Bipedal Robots

Authors: JongHun Choe, Gijeong Kim, Hajun Kim, Dongyun Kang, Min-Su Kim, Hae-Won Park

Abstract: This paper presents a 3-DOF hopping robot with a human-like lower-limb joint configuration and a flat foot, capable of performing dynamic and repetitive jumping motions. To achieve both high torque output and a large hollow shaft diameter for efficient cable routing, a compact 3K compound planetary gearbox was designed using mixed-integer nonlinear programming for gear tooth optimization. To meet… ▽ More This paper presents a 3-DOF hopping robot with a human-like lower-limb joint configuration and a flat foot, capable of performing dynamic and repetitive jumping motions. To achieve both high torque output and a large hollow shaft diameter for efficient cable routing, a compact 3K compound planetary gearbox was designed using mixed-integer nonlinear programming for gear tooth optimization. To meet performance requirements within the constrained joint geometry, all major components-including the actuator, motor driver, and communication interface-were custom-designed. The robot weighs 12.45 kg, including a dummy mass, and measures 840 mm in length when the knee joint is fully extended. A reinforcement learning-based controller was employed, and robot's performance was validated through hardware experiments, demonstrating stable and repetitive hopping motions in response to user inputs. These experimental results indicate that the platform serves as a solid foundation for future bipedal robot development. △ Less

Submitted 20 May, 2025; v1 submitted 18 May, 2025; originally announced May 2025.

arXiv:2505.12222 [pdf, other]

Learning Impact-Rich Rotational Maneuvers via Centroidal Velocity Rewards and Sim-to-Real Techniques: A One-Leg Hopper Flip Case Study

Authors: Dongyun Kang, Gijeong Kim, JongHun Choe, Hajun Kim, Hae-Won Park

Abstract: Dynamic rotational maneuvers, such as front flips, inherently involve large angular momentum generation and intense impact forces, presenting major challenges for reinforcement learning and sim-to-real transfer. In this work, we propose a general framework for learning and deploying impact-rich, rotation-intensive behaviors through centroidal velocity-based rewards and actuator-aware sim-to-real t… ▽ More Dynamic rotational maneuvers, such as front flips, inherently involve large angular momentum generation and intense impact forces, presenting major challenges for reinforcement learning and sim-to-real transfer. In this work, we propose a general framework for learning and deploying impact-rich, rotation-intensive behaviors through centroidal velocity-based rewards and actuator-aware sim-to-real techniques. We identify that conventional link-level reward formulations fail to induce true whole-body rotation and introduce a centroidal angular velocity reward that accurately captures system-wide rotational dynamics. To bridge the sim-to-real gap under extreme conditions, we model motor operating regions (MOR) and apply transmission load regularization to ensure realistic torque commands and mechanical robustness. Using the one-leg hopper front flip as a representative case study, we demonstrate the first successful hardware realization of a full front flip. Our results highlight that incorporating centroidal dynamics and actuator constraints is critical for reliably executing highly dynamic motions. A supplementary video is available at: https://youtu.be/atMAVI4s1RY △ Less

Submitted 20 May, 2025; v1 submitted 17 May, 2025; originally announced May 2025.

arXiv:2505.12089 [pdf, ps, other]

NTIRE 2025 Challenge on Efficient Burst HDR and Restoration: Datasets, Methods, and Results

Authors: Sangmin Lee, Eunpil Park, Angel Canelo, Hyunhee Park, Youngjo Kim, Hyung-Ju Chun, Xin Jin, Chongyi Li, Chun-Le Guo, Radu Timofte, Qi Wu, Tianheng Qiu, Yuchun Dong, Shenglin Ding, Guanghua Pan, Weiyu Zhou, Tao Hu, Yixu Feng, Duwei Dai, Yu Cao, Peng Wu, Wei Dong, Yanning Zhang, Qingsen Yan, Simon J. Larsen , et al. (11 additional authors not shown)

Abstract: This paper reviews the NTIRE 2025 Efficient Burst HDR and Restoration Challenge, which aims to advance efficient multi-frame high dynamic range (HDR) and restoration techniques. The challenge is based on a novel RAW multi-frame fusion dataset, comprising nine noisy and misaligned RAW frames with various exposure levels per scene. Participants were tasked with developing solutions capable of effect… ▽ More This paper reviews the NTIRE 2025 Efficient Burst HDR and Restoration Challenge, which aims to advance efficient multi-frame high dynamic range (HDR) and restoration techniques. The challenge is based on a novel RAW multi-frame fusion dataset, comprising nine noisy and misaligned RAW frames with various exposure levels per scene. Participants were tasked with developing solutions capable of effectively fusing these frames while adhering to strict efficiency constraints: fewer than 30 million model parameters and a computational budget under 4.0 trillion FLOPs. A total of 217 participants registered, with six teams finally submitting valid solutions. The top-performing approach achieved a PSNR of 43.22 dB, showcasing the potential of novel methods in this domain. This paper provides a comprehensive overview of the challenge, compares the proposed solutions, and serves as a valuable reference for researchers and practitioners in efficient burst HDR and restoration. △ Less

Submitted 17 May, 2025; originally announced May 2025.

arXiv:2505.09705 [pdf, other]

Search for a dark Higgs boson produced in association with inelastic dark matter at the Belle II experiment

Authors: Belle II Collaboration, I. Adachi, L. Aggarwal, H. Ahmed, H. Aihara, N. Akopov, S. Alghamdi, M. Alhakami, A. Aloisio, N. Althubiti, K. Amos, M. Angelsmark, N. Anh Ky, C. Antonioli, D. M. Asner, H. Atmacan, V. Aushev, M. Aversano, R. Ayad, V. Babu, N. K. Baghel, S. Bahinipati, P. Bambade, Sw. Banerjee, S. Bansal , et al. (415 additional authors not shown)

Abstract: Inelastic dark matter models that have two dark matter particles and a massive dark photon can reproduce the observed relic dark matter density without violating cosmological limits. The mass splitting between the two dark matter particles $χ_{1}$ and $χ_{2}$, with $m(χ_{2}) > m(χ_{1})$, is induced by a dark Higgs field and a corresponding dark Higgs boson $h^{\prime}$. We present a search for dar… ▽ More Inelastic dark matter models that have two dark matter particles and a massive dark photon can reproduce the observed relic dark matter density without violating cosmological limits. The mass splitting between the two dark matter particles $χ_{1}$ and $χ_{2}$, with $m(χ_{2}) > m(χ_{1})$, is induced by a dark Higgs field and a corresponding dark Higgs boson $h^{\prime}$. We present a search for dark matter in events with two vertices, at least one of which must be displaced from the interaction region, and missing energy. Using a $365\,\mbox{fb}^{-1}$ data sample collected at Belle II, which operates at the SuperKEKB $e^+e^-$ collider, we observe no evidence for a signal. We set upper limits on the product of the production cross section $σ\left(e^+e^- \to h^\prime χ_1 χ_2\right)$, and the product of branching fractions $\mathcal{B}\left(χ_2\toχ_1 e^+ e^-\right)\times\mathcal{B}\left(h^\prime\to x^+x^-\right)$, where $x^+x^-$ indicates $μ^+μ^-, π^+π^-$, or $K^+K^-$, as functions of $h^{\prime}$ mass and lifetime at the level of $10^{-1}\,\mbox{fb}$. We set model-dependent upper limits on the dark Higgs mixing angle at the level of $10^{-5}$ and on the dark photon kinetic mixing parameter at the level of $10^{-3}$. This is the first search for dark Higgs bosons in association with inelastic dark matter. △ Less

Submitted 14 May, 2025; originally announced May 2025.

Comments: Submitted for publication with Physical Review Letters

Report number: Belle II Preprint 2025-015, KEK Preprint 2025-14

arXiv:2505.08418 [pdf, ps, other]

Search for lepton flavor-violating decay modes $B^0 \to K^{\ast 0}τ^\pm\ell^\mp$ ($\ell = e,μ$) with hadronic B-tagging at Belle and Belle II

Authors: Belle, Belle II Collaborations, :, I. Adachi, Y. Ahn, H. Aihara, N. Akopov, S. Alghamdi, M. Alhakami, A. Aloisio, K. Amos, M. Angelsmark, N. Anh Ky, C. Antonioli, D. M. Asner, H. Atmacan, V. Aushev, M. Aversano, R. Ayad, V. Babu, H. Bae, N. K. Baghel, S. Bahinipati, P. Bambade, Sw. Banerjee , et al. (353 additional authors not shown)

Abstract: We present the results of a search for the charged-lepton-flavor violating decays $B^0 \rightarrow K^{*0}τ^\pm \ell^{\mp}$, where $\ell^{\mp}$ is either an electron or a muon. The results are based on 365 fb$^{-1}$ and 711 fb$^{-1}$ datasets collected with the Belle II and Belle detectors, respectively. We use an exclusive hadronic $B$-tagging technique, and search for a signal decay in the system… ▽ More We present the results of a search for the charged-lepton-flavor violating decays $B^0 \rightarrow K^{*0}τ^\pm \ell^{\mp}$, where $\ell^{\mp}$ is either an electron or a muon. The results are based on 365 fb$^{-1}$ and 711 fb$^{-1}$ datasets collected with the Belle II and Belle detectors, respectively. We use an exclusive hadronic $B$-tagging technique, and search for a signal decay in the system recoiling against a fully reconstructed $B$ meson. We find no evidence for $B^0 \rightarrow K^{*0}τ^\pm \ell^{\mp}$ decays and set upper limits on the branching fractions in the range of $(2.9-6.4)\times10^{-5}$ at 90% confidence level. △ Less

Submitted 13 May, 2025; originally announced May 2025.

Comments: 19 pages, 4 figures

Report number: Belle II preprint: 2025-014, KEK preprint: 2025-13

arXiv:2505.05710 [pdf, ps, other]

HyperspectralMAE: The Hyperspectral Imagery Classification Model using Fourier-Encoded Dual-Branch Masked Autoencoder

Authors: Wooyoung Jeong, Hyun Jae Park, Seonghun Jeong, Jong Wook Jang, Tae Hoon Lim, Dae Seoung Kim

Abstract: Hyperspectral imagery provides rich spectral detail but poses unique challenges because of its high dimensionality in both spatial and spectral domains. We propose \textit{HyperspectralMAE}, a Transformer-based foundation model for hyperspectral data that employs a \textit{dual masking} strategy: during pre-training we randomly occlude 50\% of spatial patches and 50\% of spectral bands. This force… ▽ More Hyperspectral imagery provides rich spectral detail but poses unique challenges because of its high dimensionality in both spatial and spectral domains. We propose \textit{HyperspectralMAE}, a Transformer-based foundation model for hyperspectral data that employs a \textit{dual masking} strategy: during pre-training we randomly occlude 50\% of spatial patches and 50\% of spectral bands. This forces the model to learn representations capable of reconstructing missing information across both dimensions. To encode spectral order, we introduce learnable harmonic Fourier positional embeddings based on wavelength. The reconstruction objective combines mean-squared error (MSE) with the spectral angle mapper (SAM) to balance pixel-level accuracy and spectral-shape fidelity. The resulting model contains about $1.8\times10^{8}$ parameters and produces 768-dimensional embeddings, giving it sufficient capacity for transfer learning. We pre-trained HyperspectralMAE on two large hyperspectral corpora -- NASA EO-1 Hyperion ($\sim$1\,600 scenes, $\sim$$3\times10^{11}$ pixel spectra) and DLR EnMAP Level-0 ($\sim$1\,300 scenes, $\sim$$3\times10^{11}$ pixel spectra) -- and fine-tuned it for land-cover classification on the Indian Pines benchmark. HyperspectralMAE achieves state-of-the-art transfer-learning accuracy on Indian Pines, confirming that masked dual-dimensional pre-training yields robust spectral-spatial representations. These results demonstrate that dual masking and wavelength-aware embeddings advance hyperspectral image reconstruction and downstream analysis. △ Less

Submitted 8 May, 2025; originally announced May 2025.

arXiv:2505.05068 [pdf]

Orbital-Selective Quasiparticle Depletion across the Density Wave Transition in Trilayer Nickelate La$_4$Ni$_3$O$_{10}$

Authors: Dong-Hyeon Gim, Chung Ha Park, Kee Hoon Kim

Abstract: We investigate the evolution of polarized electronic Raman response in trilayer nickelate La$_4$Ni$_3$O$_{10}$, uncovering a systematic reduction of the incoherent electron continuum across the density wave transition in the $A_{1g}$ and $B_{1g}$ representations. Analysis based on the Fermi surface band curvatures points to quasiparticle coherence in momentum positions with dominant $d_{x^2-y^2}$… ▽ More We investigate the evolution of polarized electronic Raman response in trilayer nickelate La$_4$Ni$_3$O$_{10}$, uncovering a systematic reduction of the incoherent electron continuum across the density wave transition in the $A_{1g}$ and $B_{1g}$ representations. Analysis based on the Fermi surface band curvatures points to quasiparticle coherence in momentum positions with dominant $d_{x^2-y^2}$ orbital character. Our findings establish the symmetry channels and the active role of $d_{x^2-y^2}$ orbitals involved in the density wave formation, offering important insight into the electronic and magnetic correlations in the nickelate. △ Less

Submitted 8 May, 2025; originally announced May 2025.

Comments: (Main text) 12 pages, 4 figures. (Supplemental materials) 8 pages, 5 figures

arXiv:2505.04394 [pdf, other]

doi 10.1016/j.neucom.2025.130289

SwinLip: An Efficient Visual Speech Encoder for Lip Reading Using Swin Transformer

Authors: Young-Hu Park, Rae-Hong Park, Hyung-Min Park

Abstract: This paper presents an efficient visual speech encoder for lip reading. While most recent lip reading studies have been based on the ResNet architecture and have achieved significant success, they are not sufficiently suitable for efficiently capturing lip reading features due to high computational complexity in modeling spatio-temporal information. Additionally, using a complex visual model not o… ▽ More This paper presents an efficient visual speech encoder for lip reading. While most recent lip reading studies have been based on the ResNet architecture and have achieved significant success, they are not sufficiently suitable for efficiently capturing lip reading features due to high computational complexity in modeling spatio-temporal information. Additionally, using a complex visual model not only increases the complexity of lip reading models but also induces delays in the overall network for multi-modal studies (e.g., audio-visual speech recognition, speech enhancement, and speech separation). To overcome the limitations of Convolutional Neural Network (CNN)-based models, we apply the hierarchical structure and window self-attention of the Swin Transformer to lip reading. We configure a new lightweight scale of the Swin Transformer suitable for processing lip reading data and present the SwinLip visual speech encoder, which efficiently reduces computational load by integrating modified Convolution-augmented Transformer (Conformer) temporal embeddings with conventional spatial embeddings in the hierarchical structure. Through extensive experiments, we have validated that our SwinLip successfully improves the performance and inference speed of the lip reading network when applied to various backbones for word and sentence recognition, reducing computational load. In particular, our SwinLip demonstrated robust performance in both English LRW and Mandarin LRW-1000 datasets and achieved state-of-the-art performance on the Mandarin LRW-1000 dataset with less computation compared to the existing state-of-the-art model. △ Less

Submitted 7 May, 2025; originally announced May 2025.

Journal ref: Neurocomputing, Volume 639, 28 July 2025, 130289

arXiv:2505.03777 [pdf, other]

MolMole: Molecule Mining from Scientific Literature

Authors: LG AI Research, Sehyun Chun, Jiye Kim, Ahra Jo, Yeonsik Jo, Seungyul Oh, Seungjun Lee, Kwangrok Ryoo, Jongmin Lee, Seung Hwan Kim, Byung Jun Kang, Soonyoung Lee, Jun Ha Park, Chanwoo Moon, Jiwon Ham, Haein Lee, Heejae Han, Jaeseung Byun, Soojong Do, Minju Ha, Dongyun Kim, Kyunghoon Bae, Woohyung Lim, Edward Hwayoung Lee, Yongmin Park , et al. (9 additional authors not shown)

Abstract: The extraction of molecular structures and reaction data from scientific documents is challenging due to their varied, unstructured chemical formats and complex document layouts. To address this, we introduce MolMole, a vision-based deep learning framework that unifies molecule detection, reaction diagram parsing, and optical chemical structure recognition (OCSR) into a single pipeline for automat… ▽ More The extraction of molecular structures and reaction data from scientific documents is challenging due to their varied, unstructured chemical formats and complex document layouts. To address this, we introduce MolMole, a vision-based deep learning framework that unifies molecule detection, reaction diagram parsing, and optical chemical structure recognition (OCSR) into a single pipeline for automating the extraction of chemical data directly from page-level documents. Recognizing the lack of a standard page-level benchmark and evaluation metric, we also present a testset of 550 pages annotated with molecule bounding boxes, reaction labels, and MOLfiles, along with a novel evaluation metric. Experimental results demonstrate that MolMole outperforms existing toolkits on both our benchmark and public datasets. The benchmark testset will be publicly available, and the MolMole toolkit will be accessible soon through an interactive demo on the LG AI Research website. For commercial inquiries, please contact us at \href{mailto:[email protected]}{contact\[email protected]}. △ Less

Submitted 7 May, 2025; v1 submitted 30 April, 2025; originally announced May 2025.

Comments: 15 pages, 12 figures

arXiv:2505.03306 [pdf]

Magnetic-field dependent VB- spin decoherence in hexagonal boron nitrides: A first-principles study

Authors: Jaewook Lee, Hyeonsu Kim, Huijin Park, Hosung Seo

Abstract: The negatively charged boron vacancy (VB-) in h-BN operates as an optically addressable spin qubit in two-dimensional materials. To further advance the spin into a versatile qubit platform, it is imperative to understand its spin decoherence precisely, which is currently one of the major limiting factors for the VB- spin. In this study, we employ a first-principles quantum many-body simulation to… ▽ More The negatively charged boron vacancy (VB-) in h-BN operates as an optically addressable spin qubit in two-dimensional materials. To further advance the spin into a versatile qubit platform, it is imperative to understand its spin decoherence precisely, which is currently one of the major limiting factors for the VB- spin. In this study, we employ a first-principles quantum many-body simulation to investigate the decoherence of the VB- spin in dense nuclear spin baths as a function of magnetic field from 100 G to 3 T, revealing several unique phenomena and their origin. We found that decoherence mechanism changes at a specific magnetic field, which we refer to as the transition boundary (TB). Below the TB, the decoherence occurs within submicrosecond and it is primarily governed by independent nuclear spin dynamics. Above the TB, pair-wise flip-flop transitions become the dominant decoherence source, leading to the decoherence time of tens of microseconds. Building upon this understanding, we developed a method to predict the TB depending on the isotopic composition of h-BN, leading to TBs at 5020 G for h-10B14N and 2050 G for h-11B14N, which is in excellent agreement with our numerical results. We show that the larger TB in h-10BN derives from the larger nuclear spin of 10B than that of 11B, giving rise to strong nuclear modulation effects over a wider range of magnetic field in 10BN than in 11BN. We also explain the microscopic origin of several unique features in the decoherence, such as magnetic-field insensitive fast modulation found below the TB. Our results provide essential insight on the role of the 100% dense nuclear spin environment with large nuclear spins in the VB- decoherence, opening a new avenue for advancing the spin qubit in h-BN as robust platform in quantum information science. △ Less

Submitted 8 May, 2025; v1 submitted 6 May, 2025; originally announced May 2025.

Comments: 23 pages, 6 figures

arXiv:2505.02912 [pdf, other]

Measurement of the time-integrated $CP$ asymmetry in $D^0\toπ^0π^0$ decays at Belle II

Authors: Belle II Collaboration, I. Adachi, Y. Ahn, N. Akopov, S. Alghamdi, M. Alhakami, A. Aloisio, N. Althubiti, K. Amos, M. Angelsmark, N. Anh Ky, C. Antonioli, D. M. Asner, H. Atmacan, T. Aushev, M. Aversano, R. Ayad, V. Babu, H. Bae, N. K. Baghel, S. Bahinipati, P. Bambade, Sw. Banerjee, M. Barrett, M. Bartl , et al. (350 additional authors not shown)

Abstract: We measure the time-integrated $CP$ asymmetry, $A_{CP}$, in $D^0\toπ^0π^0$ decays reconstructed in $e^+e^-\to c\bar{c}$ events collected by Belle II during 2019--2022. The data corresponds to an integrated luminosity of 428$\mathrm{fb}^{-1}$. The $D^0$ decays are required to originate from the flavor-conserving $D^{*+} \to D^0 π^+$ decay to determine the charm flavor at production time. Control sa… ▽ More We measure the time-integrated $CP$ asymmetry, $A_{CP}$, in $D^0\toπ^0π^0$ decays reconstructed in $e^+e^-\to c\bar{c}$ events collected by Belle II during 2019--2022. The data corresponds to an integrated luminosity of 428$\mathrm{fb}^{-1}$. The $D^0$ decays are required to originate from the flavor-conserving $D^{*+} \to D^0 π^+$ decay to determine the charm flavor at production time. Control samples of $D^0\to K^- π^+$ decays, with or without an associated pion from a $D^{*+}$ decay, are used to correct for detection asymmetries. The result, $A_{CP}(D^0\toπ^0π^0) = (0.30\pm 0.72\pm 0.20)\%$, where the first uncertainty is statistical and the second systematic, is consistent with $CP$ symmetry. △ Less

Submitted 5 May, 2025; originally announced May 2025.

Report number: Belle II Preprint 2025-009, KEK Preprint 2025-7

arXiv:2505.02908 [pdf, other]

Early Shock-Cooling Observations and Progenitor Constraints of Type IIb SN 2024uwq

Authors: Bhagya M. Subrayan, David J. Sand, K. Azalee Bostroem, Saurabh W. Jha, Aravind P. Ravi, Michaela Schwab, Jennifer E. Andrews, Griffin Hosseinzadeh, Stefano Valenti, Yize Dong, Jeniveve Pearson, Manisha Shrestha, Lindsey A. Kwok, Emily Hoang, Jeonghee Rho, Seong Hyun Park, Sung-Chul Yoon, T. R. Geball, Joshua Haislip, Daryl Janzen, Vladimir Kouprianov, Darshana Mehta, Nicolás Meza Retamal, Daniel E. Reichart, Moira Andrews , et al. (4 additional authors not shown)

Abstract: We present early multi-wavelength photometric and spectroscopic observations of the Type IIb supernova SN 2024uwq, capturing its shock-cooling emission phase and double-peaked light curve evolution. Early spectra reveal broad H-alpha (v ~ 15,500 km s$^{-1}$) and He I P-Cygni profiles of similar strengths. Over time the He I lines increase in strength while the H-alpha decreases, consistent with a… ▽ More We present early multi-wavelength photometric and spectroscopic observations of the Type IIb supernova SN 2024uwq, capturing its shock-cooling emission phase and double-peaked light curve evolution. Early spectra reveal broad H-alpha (v ~ 15,500 km s$^{-1}$) and He I P-Cygni profiles of similar strengths. Over time the He I lines increase in strength while the H-alpha decreases, consistent with a hydrogen envelope ($M_{env}$ = 0.7 - 1.35 $M_\odot$ ) overlying helium-rich ejecta. Analytic modeling of early shock cooling emission and bolometric light analysis constrains the progenitor to a partially stripped star with radius R = 10 - 60 $R_\odot$, consistent with a blue/yellow supergiant with an initial ZAMS mass of 12 - 20 $M_\odot$ , likely stripped via binary interaction. SN 2024uwq occupies a transitional position between compact and extended Type IIb supernovae, highlighting the role of binary mass-transfer efficiency in shaping a continuum of stripped-envelope progenitors. Our results underscore the importance of both early UV/optical observations to characterize shock breakout signatures critical to map the diversity in evolutionary pathways of massive stars. Upcoming time domain surveys including Rubin Observatory's LSST and UV missions like ULTRASAT and UVEX will revolutionise our ability to systematically capture these early signatures, probing the full diversity of stripped progenitors and their explosive endpoints. △ Less

Submitted 5 May, 2025; originally announced May 2025.

Comments: 22 pages, 11 figures, Submitted to ApJL

arXiv:2505.01737 [pdf, other]

Learning Multi-frame and Monocular Prior for Estimating Geometry in Dynamic Scenes

Authors: Seong Hyeon Park, Jinwoo Shin

Abstract: In monocular videos that capture dynamic scenes, estimating the 3D geometry of video contents has been a fundamental challenge in computer vision. Specifically, the task is significantly challenged by the object motion, where existing models are limited to predict only partial attributes of the dynamic scenes, such as depth or pointmaps spanning only over a pair of frames. Since these attributes a… ▽ More In monocular videos that capture dynamic scenes, estimating the 3D geometry of video contents has been a fundamental challenge in computer vision. Specifically, the task is significantly challenged by the object motion, where existing models are limited to predict only partial attributes of the dynamic scenes, such as depth or pointmaps spanning only over a pair of frames. Since these attributes are inherently noisy under multiple frames, test-time global optimizations are often employed to fully recover the geometry, which is liable to failure and incurs heavy inference costs. To address the challenge, we present a new model, coined MMP, to estimate the geometry in a feed-forward manner, which produces a dynamic pointmap representation that evolves over multiple frames. Specifically, based on the recent Siamese architecture, we introduce a new trajectory encoding module to project point-wise dynamics on the representation for each frame, which can provide significantly improved expressiveness for dynamic scenes. In our experiments, we find MMP can achieve state-of-the-art quality in feed-forward pointmap prediction, e.g., 15.1% enhancement in the regression error. △ Less

Submitted 3 May, 2025; originally announced May 2025.

arXiv:2505.00735 [pdf, other]

Leveraging Depth Maps and Attention Mechanisms for Enhanced Image Inpainting

Authors: Jin Hyun Park, Harine Choi, Praewa Pitiphat

Abstract: Existing deep learning-based image inpainting methods typically rely on convolutional networks with RGB images to reconstruct images. However, relying exclusively on RGB images may neglect important depth information, which plays a critical role in understanding the spatial and structural context of a scene. Just as human vision leverages stereo cues to perceive depth, incorporating depth maps int… ▽ More Existing deep learning-based image inpainting methods typically rely on convolutional networks with RGB images to reconstruct images. However, relying exclusively on RGB images may neglect important depth information, which plays a critical role in understanding the spatial and structural context of a scene. Just as human vision leverages stereo cues to perceive depth, incorporating depth maps into the inpainting process can enhance the model's ability to reconstruct images with greater accuracy and contextual awareness. In this paper, we propose a novel approach that incorporates both RGB and depth images for enhanced image inpainting. Our models employ a dual encoder architecture, where one encoder processes the RGB image and the other handles the depth image. The encoded features from both encoders are then fused in the decoder using an attention mechanism, effectively integrating the RGB and depth representations. We use two different masking strategies, line and square, to test the robustness of the model under different types of occlusions. To further analyze the effectiveness of our approach, we use Gradient-weighted Class Activation Mapping (Grad-CAM) visualizations to examine the regions of interest the model focuses on during inpainting. We show that incorporating depth information alongside the RGB image significantly improves the reconstruction quality. Through both qualitative and quantitative comparisons, we demonstrate that the depth-integrated model outperforms the baseline, with attention mechanisms further enhancing inpainting performance, as evidenced by multiple evaluation metrics and visualization. △ Less

Submitted 8 May, 2025; v1 submitted 29 April, 2025; originally announced May 2025.

arXiv:2505.00260 [pdf, other]

Wideband covariance magnetometry below the diffraction limit

Authors: Xuan Hoang Le, Pavel E. Dolgirev, Piotr Put, Eric L. Peterson, Arjun Pillai, Alexander A. Zibrov, Eugene Demler, Hongkun Park, Mikhail D. Lukin

Abstract: We experimentally demonstrate a method for measuring correlations of wideband magnetic signals with spatial resolution below the optical diffraction limit. Our technique employs two nitrogen-vacancy (NV) centers in diamond as nanoscale magnetometers, spectrally resolved by inhomogeneous optical transitions. Using high-fidelity optical readout and long spin coherence time, we probe correlated MHz-r… ▽ More We experimentally demonstrate a method for measuring correlations of wideband magnetic signals with spatial resolution below the optical diffraction limit. Our technique employs two nitrogen-vacancy (NV) centers in diamond as nanoscale magnetometers, spectrally resolved by inhomogeneous optical transitions. Using high-fidelity optical readout and long spin coherence time, we probe correlated MHz-range noise with sensitivity of 15 nT Hz$^{-1/4}$. In addition, we use this system for correlated $T_1$ relaxometry, enabling correlation measurements of GHz-range noise. Under such externally applied noise, while individual NV centers exhibit featureless relaxation, their correlation displays rich coherent and incoherent dynamics reminiscent of superradiance physics. This capability to probe high-frequency correlations provides a powerful tool for investigating a variety of condensed-matter phenomena characterized by nonlocal correlations. △ Less

Submitted 30 April, 2025; originally announced May 2025.

Comments: 20 pages, 14 figures

arXiv:2504.21784 [pdf, other]

A Comparison of the Consistent and Independent Second Moment Methods Applied to Thermal Radiative Transfer

Authors: Samuel Olivier, James S. Warsa, HyeongKae Park

Abstract: The design of efficient numerical methods for modeling thermal radiative transfer (TRT) is challenging due to the stiff, nonlinear coupling between radiation and material energies, especially at the time scales of interest in high energy density physics and astrophysics. Here, we investigate the use of the Second Moment Method (SMM) to accelerate absorption-emission within the context of the multi… ▽ More The design of efficient numerical methods for modeling thermal radiative transfer (TRT) is challenging due to the stiff, nonlinear coupling between radiation and material energies, especially at the time scales of interest in high energy density physics and astrophysics. Here, we investigate the use of the Second Moment Method (SMM) to accelerate absorption-emission within the context of the multigroup, Discrete Ordinates transport equations with discontinuous Galerkin spatial discretization. SMM employs a reduced-dimensional, diffusion-based model of radiation transport that, when coupled with suitable discrete closures, serves as a proxy for the transport equation, isolating the transport equation from the stiff absorption-emission physics. We use a gray low-order system to reduce the cost of solving the low-order system and leverage SMM low-order discretizations specifically designed to be scalably solvable with existing linear solver technology. Our algorithm robustly resolves the nonlinear TRT system while only relying on transport sweeps, linearly solving symmetric and positive definite, gray diffusion systems, and nonlinearly solving the spatially pointwise energy balance equation. This algorithm is used as a vehicle to compare the efficacy of low-order discretizations developed for steady-state, linear transport on gray and multigroup TRT problems in one and two spatial dimensions. △ Less

Submitted 30 April, 2025; originally announced April 2025.

arXiv:2504.21340 [pdf, other]

Towards Improved Cervical Cancer Screening: Vision Transformer-Based Classification and Interpretability

Authors: Khoa Tuan Nguyen, Ho-min Park, Gaeun Oh, Joris Vankerschaver, Wesley De Neve

Abstract: We propose a novel approach to cervical cell image classification for cervical cancer screening using the EVA-02 transformer model. We developed a four-step pipeline: fine-tuning EVA-02, feature extraction, selecting important features through multiple machine learning models, and training a new artificial neural network with optional loss weighting for improved generalization. With this design, o… ▽ More We propose a novel approach to cervical cell image classification for cervical cancer screening using the EVA-02 transformer model. We developed a four-step pipeline: fine-tuning EVA-02, feature extraction, selecting important features through multiple machine learning models, and training a new artificial neural network with optional loss weighting for improved generalization. With this design, our best model achieved an F1-score of 0.85227, outperforming the baseline EVA-02 model (0.84878). We also utilized Kernel SHAP analysis and identified key features correlating with cell morphology and staining characteristics, providing interpretable insights into the decision-making process of the fine-tuned model. Our code is available at https://github.com/Khoa-NT/isbi2025_ps3c. △ Less

Submitted 30 April, 2025; originally announced April 2025.

Comments: Accepted at ISBI 2025 "Challenge 2: Pap Smear Cell Classification Challenge"

arXiv:2504.20615 [pdf, other]

doi 10.1109/LRA.2025.3564711

Multi-Sensor Fusion for Quadruped Robot State Estimation using Invariant Filtering and Smoothing

Authors: Ylenia Nisticò, Hajun Kim, João Carlos Virgolino Soares, Geoff Fink, Hae-Won Park, Claudio Semini

Abstract: This letter introduces two multi-sensor state estimation frameworks for quadruped robots, built on the Invariant Extended Kalman Filter (InEKF) and Invariant Smoother (IS). The proposed methods, named E-InEKF and E-IS, fuse kinematics, IMU, LiDAR, and GPS data to mitigate position drift, particularly along the z-axis, a common issue in proprioceptive-based approaches. We derived observation models… ▽ More This letter introduces two multi-sensor state estimation frameworks for quadruped robots, built on the Invariant Extended Kalman Filter (InEKF) and Invariant Smoother (IS). The proposed methods, named E-InEKF and E-IS, fuse kinematics, IMU, LiDAR, and GPS data to mitigate position drift, particularly along the z-axis, a common issue in proprioceptive-based approaches. We derived observation models that satisfy group-affine properties to integrate LiDAR odometry and GPS into InEKF and IS. LiDAR odometry is incorporated using Iterative Closest Point (ICP) registration on a parallel thread, preserving the computational efficiency of proprioceptive-based state estimation. We evaluate E-InEKF and E-IS with and without exteroceptive sensors, benchmarking them against LiDAR-based odometry methods in indoor and outdoor experiments using the KAIST HOUND2 robot. Our methods achieve lower Relative Position Errors (RPE) and significantly reduce Absolute Trajectory Error (ATE), with improvements of up to 28% indoors and 40% outdoors compared to LIO-SAM and FAST-LIO2. Additionally, we compare E-InEKF and E-IS in terms of computational efficiency and accuracy. △ Less

Submitted 29 April, 2025; originally announced April 2025.

Comments: Accepted for publication in IEEE Robotics and Automation Letters

Showing 1–50 of 2,495 results for author: Park, H