Search | arXiv e-print repository

doi 10.1038/s41567-024-02586-x

Dark states of electrons in a quantum system with two pairs of sublattices

Authors: Yoonah Chung, Minsu Kim, Yeryn Kim, Seyeong Cha, Joon Woo Park, Jeehong Park, Yeonjin Yi, Dongjoon Song, Jung Hyun Ryu, Kimoon Lee, Timur K. Kim, Cephise Cacho, Jonathan Denlinger, Chris Jozwiak, Eli Rotenberg, Aaron Bostwick, Keun Su Kim

Abstract: A quantum state of matter that is forbidden to interact with photons and is therefore undetectable by spectroscopic means is called a dark state. This basic concept can be applied to condensed matter where it suggests that a whole band of quantum states could be undetectable across a full Brillouin zone. Here we report the discovery of such condensed matter dark states in palladium diselenide as a… ▽ More A quantum state of matter that is forbidden to interact with photons and is therefore undetectable by spectroscopic means is called a dark state. This basic concept can be applied to condensed matter where it suggests that a whole band of quantum states could be undetectable across a full Brillouin zone. Here we report the discovery of such condensed matter dark states in palladium diselenide as a model system that has two pairs of sublattices in the primitive cell. By using angle-resolved photoemission spectroscopy, we find valence bands that are practically unobservable over the whole Brillouin zone at any photon energy, polarisation, and scattering plane. Our model shows that two pairs of sublattices located at half-translation positions and related by multiple glide-mirror symmetries make their relative quantum phases polarised into only four kinds, three of which become dark due to double destructive interference. This mechanism is generic to other systems with two pairs of sublattices, and we show how the phenomena observed in cuprates, lead-halide perovskites, and density wave systems can be resolved by the mechanism of dark states. Our results suggest that the sublattice degree of freedom, which has been overlooked so far, should be considered in the study of correlated phenomena and optoelectronic characteristics. △ Less

Submitted 10 July, 2025; originally announced July 2025.

Journal ref: Nature Physics 20, 1582-1588 (2024)

arXiv:2507.07476 [pdf, ps, other]

A comparative study of physics capabilities of a liquid argon and a water based liquid scintillator at DUNE

Authors: Nishat Fiza, Suhyeon Kim, Emar Masaku, Mehedi Masud, Hokyeong Nam, Juseong Park, Yujin Park, Kim Siyeon

Abstract: We present a comprehensive comparison of the physics sensitivities of a Liquid Argon Time Projection Chamber (LArTPC) and a Water-based Liquid Scintillator (WbLS) detector, considering their potential deployment as the fourth far detector module in the DUNE facility. Using GLoBES-based simulations, we evaluate their performance in measuring standard neutrino oscillation parameters (… ▽ More We present a comprehensive comparison of the physics sensitivities of a Liquid Argon Time Projection Chamber (LArTPC) and a Water-based Liquid Scintillator (WbLS) detector, considering their potential deployment as the fourth far detector module in the DUNE facility. Using GLoBES-based simulations, we evaluate their performance in measuring standard neutrino oscillation parameters ($θ_{23}, δ_{13}$ and $Δm^{2}_{31}$), both in standard 3-neutrino case, as well as in presence of new physics scenarios involving light sterile neutrinos and neutral-current non-standard interactions (NC NSI). Our findings show that THEIA (a WbLS-based detector) significantly outperforms LArTPC in resolving the CP phase $δ_{13}$,- especially near maximal CP violation, and in lifting the octant degeneracy of $θ_{23}$ due to its superior energy resolution and ability to clearly identify the second oscillation maximum. Furthermore, THEIA offers competitive reconstruction precision even with relatively moderate energy resolutions ($7-10\%/\sqrt{E}$) and demonstrates enhanced robustness under new physics scenarios. These results support the physics-driven case for a hybrid DUNE configuration utilizing both LArTPC and WbLS technologies for optimized sensitivity across the full spectrum of neutrino oscillation and physics beyond the standard model. △ Less

Submitted 10 July, 2025; originally announced July 2025.

Comments: 25 pages, 9 figures, 2 tables

arXiv:2507.07147 [pdf, ps, other]

Weighted Multi-Prompt Learning with Description-free Large Language Model Distillation

Authors: Sua Lee, Kyubum Shin, Jung Ho Park

Abstract: Recent advances in pre-trained Vision Language Models (VLM) have shown promising potential for effectively adapting to downstream tasks through prompt learning, without the need for additional annotated paired datasets. To supplement the text information in VLM trained on correlations with vision data, new approaches leveraging Large Language Models (LLM) in prompts have been proposed, enhancing r… ▽ More Recent advances in pre-trained Vision Language Models (VLM) have shown promising potential for effectively adapting to downstream tasks through prompt learning, without the need for additional annotated paired datasets. To supplement the text information in VLM trained on correlations with vision data, new approaches leveraging Large Language Models (LLM) in prompts have been proposed, enhancing robustness to unseen and diverse data. Existing methods typically extract text-based responses (i.e., descriptions) from LLM to incorporate into prompts; however, this approach suffers from high variability and low reliability. In this work, we propose Description-free Multi-prompt Learning(DeMul), a novel method that eliminates the process of extracting descriptions and instead directly distills knowledge from LLM into prompts. By adopting a description-free approach, prompts can encapsulate richer semantics while still being represented as continuous vectors for optimization, thereby eliminating the need for discrete pre-defined templates. Additionally, in a multi-prompt setting, we empirically demonstrate the potential of prompt weighting in reflecting the importance of different prompts during training. Experimental results show that our approach achieves superior performance across 11 recognition datasets. △ Less

Submitted 9 July, 2025; originally announced July 2025.

Comments: Published as a conference paper at ICLR 2025

arXiv:2507.06785 [pdf, ps, other]

Bayesian Bootstrap-based Gaussian Copula Model for Mixed Data with High Missing Rates

Authors: Seongmin Kim, Jeunghun Oh, Hungkuk Ko, Jeongmin Park, Jaeyong Lee

Abstract: Missing data is a common issue in various fields such as medicine, social sciences, and natural sciences, and it poses significant challenges for accurate statistical analysis. Although numerous imputation methods have been proposed to address this issue, many of them fail to adequately capture the complex dependency structure among variables. To overcome this limitation, models based on the Gauss… ▽ More Missing data is a common issue in various fields such as medicine, social sciences, and natural sciences, and it poses significant challenges for accurate statistical analysis. Although numerous imputation methods have been proposed to address this issue, many of them fail to adequately capture the complex dependency structure among variables. To overcome this limitation, models based on the Gaussian copula framework have been introduced. However, most existing copula-based approaches do not account for the uncertainty in the marginal distributions, which can lead to biased marginal estimates and degraded performance, especially under high missingness rates. In this study, we propose a Bayesian bootstrap-based Gaussian Copula model (BBGC) that explicitly incorporates uncertainty in the marginal distributions of each variable. The proposed BBGC combines the flexible dependency modeling capability of the Gaussian copula with the Bayesian uncertainty quantification of marginal cumulative distribution functions (CDFs) via the Bayesian bootstrap. Furthermore, it is extended to handle mixed data types by incorporating methods for ordinal variable modeling. Through simulation studies and experiments on real-world datasets from the UCI repository, we demonstrate that the proposed BBGC outperforms existing imputation methods across various missing rates and mechanisms (MCAR, MAR). Additionally, the proposed model shows superior performance on real semiconductor manufacturing process data compared to conventional imputation approaches. △ Less

Submitted 9 July, 2025; originally announced July 2025.

Comments: 29 pages, 1 figure, 4 tables

arXiv:2507.06782 [pdf, ps, other]

Temporal Information Retrieval via Time-Specifier Model Merging

Authors: SeungYoon Han, Taeho Hwang, Sukmin Cho, Soyeong Jeong, Hoyun Song, Huije Lee, Jong C. Park

Abstract: The rapid expansion of digital information and knowledge across structured and unstructured sources has heightened the importance of Information Retrieval (IR). While dense retrieval methods have substantially improved semantic matching for general queries, they consistently underperform on queries with explicit temporal constraints--often those containing numerical expressions and time specifiers… ▽ More The rapid expansion of digital information and knowledge across structured and unstructured sources has heightened the importance of Information Retrieval (IR). While dense retrieval methods have substantially improved semantic matching for general queries, they consistently underperform on queries with explicit temporal constraints--often those containing numerical expressions and time specifiers such as ``in 2015.'' Existing approaches to Temporal Information Retrieval (TIR) improve temporal reasoning but often suffer from catastrophic forgetting, leading to reduced performance on non-temporal queries. To address this, we propose Time-Specifier Model Merging (TSM), a novel method that enhances temporal retrieval while preserving accuracy on non-temporal queries. TSM trains specialized retrievers for individual time specifiers and merges them in to a unified model, enabling precise handling of temporal constraints without compromising non-temporal retrieval. Extensive experiments on both temporal and non-temporal datasets demonstrate that TSM significantly improves performance on temporally constrained queries while maintaining strong results on non-temporal queries, consistently outperforming other baseline methods. Our code is available at https://github.com/seungyoonee/TSM . △ Less

Submitted 9 July, 2025; originally announced July 2025.

arXiv:2507.06754 [pdf, ps, other]

Counting isomorphism classes of elliptic curves over $\mathbb{F}_q(t)$

Authors: Jun-Yong Park

Abstract: We determine the precise number of isomorphism classes of elliptic curves over $\mathbb{F}_q(t)$ with $\text{char}(\mathbb{F}_q) = 3,2$. The key idea is to obtain the exact unweighted number of rational points on the classifying stacks $\mathcal{B} Q_{12}$, $\mathcal{B} Q_{24}$ and $\mathcal{B} Z$, where $Q_{12}$ and $Q_{24}$ denote the dicyclic groups of orders 12 and 24, respectively, and $Z$ de… ▽ More We determine the precise number of isomorphism classes of elliptic curves over $\mathbb{F}_q(t)$ with $\text{char}(\mathbb{F}_q) = 3,2$. The key idea is to obtain the exact unweighted number of rational points on the classifying stacks $\mathcal{B} Q_{12}$, $\mathcal{B} Q_{24}$ and $\mathcal{B} Z$, where $Q_{12}$ and $Q_{24}$ denote the dicyclic groups of orders 12 and 24, respectively, and $Z$ denotes the non-reduced group scheme of order 2. This computation, inspired by the classical work of [de Jong] and performed via motivic height zeta functions of height moduli spaces constructed in [Bejleri-Park-Satriano], establishes a complete determination of the total number of isomorphism classes of rational points on $\overline{\mathcal{M}}_{1,1}$ over any rational function field $k(t)$ with perfect residue field $\text{char}(k) \ge 0$. △ Less

Submitted 9 July, 2025; originally announced July 2025.

Comments: 13 pages; Comments very welcome!

arXiv:2507.06543 [pdf, ps, other]

Token Bottleneck: One Token to Remember Dynamics

Authors: Taekyung Kim, Dongyoon Han, Byeongho Heo, Jeongeun Park, Sangdoo Yun

Abstract: Deriving compact and temporally aware visual representations from dynamic scenes is essential for successful execution of sequential scene understanding tasks such as visual tracking and robotic manipulation. In this paper, we introduce Token Bottleneck (ToBo), a simple yet intuitive self-supervised learning pipeline that squeezes a scene into a bottleneck token and predicts the subsequent scene u… ▽ More Deriving compact and temporally aware visual representations from dynamic scenes is essential for successful execution of sequential scene understanding tasks such as visual tracking and robotic manipulation. In this paper, we introduce Token Bottleneck (ToBo), a simple yet intuitive self-supervised learning pipeline that squeezes a scene into a bottleneck token and predicts the subsequent scene using minimal patches as hints. The ToBo pipeline facilitates the learning of sequential scene representations by conservatively encoding the reference scene into a compact bottleneck token during the squeeze step. In the expansion step, we guide the model to capture temporal dynamics by predicting the target scene using the bottleneck token along with few target patches as hints. This design encourages the vision backbone to embed temporal dependencies, thereby enabling understanding of dynamic transitions across scenes. Extensive experiments in diverse sequential tasks, including video label propagation and robot manipulation in simulated environments demonstrate the superiority of ToBo over baselines. Moreover, deploying our pre-trained model on physical robots confirms its robustness and effectiveness in real-world environments. We further validate the scalability of ToBo across different model scales. △ Less

Submitted 9 July, 2025; originally announced July 2025.

Comments: 17 pages, 9 figures, 8 tables, project page: https://token-bottleneck.github.io, code: https://github.com/naver-ai/tobo

arXiv:2507.06371 [pdf, ps, other]

doi 10.1038/s41586-024-08226-x

Terahertz field-induced metastable magnetization near criticality in FePS3

Authors: Batyr Ilyas, Tianchuang Luo, Alexander von Hoegen, Emil Viñas Boström, Zhuquan Zhang, Jaena Park, Junghyun Kim, Je-Geun Park, Keith A. Nelson, Angel Rubio, Nuh Gedik

Abstract: Controlling the functional properties of quantum materials with light has emerged as a frontier of condensed-matter physics, leading to the discovery of various light-induced phases of matter, such as superconductivity, ferroelectricity, magnetism and charge density waves. However, in most cases, the photoinduced phases return to equilibrium on ultrafast timescales after the light is turned off, l… ▽ More Controlling the functional properties of quantum materials with light has emerged as a frontier of condensed-matter physics, leading to the discovery of various light-induced phases of matter, such as superconductivity, ferroelectricity, magnetism and charge density waves. However, in most cases, the photoinduced phases return to equilibrium on ultrafast timescales after the light is turned off, limiting their practical applications. Here we use intense terahertz pulses to induce a metastable magnetization with a remarkably long lifetime of more than 2.5 milliseconds in the van der Waals antiferromagnet FePS3. The metastable state becomes increasingly robust as the temperature approaches the antiferromagnetic transition point, suggesting that critical order parameter fluctuations play an important part in facilitating the extended lifetime. By combining first-principles calculations with classical Monte Carlo and spin dynamics simulations, we find that the displacement of a specific phonon mode modulates the exchange couplings in a manner that favours a ground state with finite magnetization near the Néel temperature. This analysis also clarifies how the critical fluctuations of the dominant antiferromagnetic order can amplify both the magnitude and the lifetime of the new magnetic state. Our discovery demonstrates the efficient manipulation of the magnetic ground state in layered magnets through non-thermal pathways using terahertz light and establishes regions near critical points with enhanced order parameter fluctuations as promising areas to search for metastable hidden quantum states. △ Less

Submitted 8 July, 2025; originally announced July 2025.

Comments: 33 pages, 4 figures

Journal ref: Nature 636 (2024) 609-614

arXiv:2507.06261 [pdf, ps, other]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Authors: Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu , et al. (3278 additional authors not shown)

Abstract: In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal unde… ▽ More In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving. △ Less

Submitted 7 July, 2025; originally announced July 2025.

Comments: 72 pages, 17 figures

arXiv:2507.06233 [pdf, ps, other]

Learning to Track Any Points from Human Motion

Authors: Inès Hyeonsu Kim, Seokju Cho, Jahyeok Koo, Junghyun Park, Jiahui Huang, Joon-Young Lee, Seungryong Kim

Abstract: Human motion, with its inherent complexities, such as non-rigid deformations, articulated movements, clothing distortions, and frequent occlusions caused by limbs or other individuals, provides a rich and challenging source of supervision that is crucial for training robust and generalizable point trackers. Despite the suitability of human motion, acquiring extensive training data for point tracki… ▽ More Human motion, with its inherent complexities, such as non-rigid deformations, articulated movements, clothing distortions, and frequent occlusions caused by limbs or other individuals, provides a rich and challenging source of supervision that is crucial for training robust and generalizable point trackers. Despite the suitability of human motion, acquiring extensive training data for point tracking remains difficult due to laborious manual annotation. Our proposed pipeline, AnthroTAP, addresses this by proposing an automated pipeline to generate pseudo-labeled training data, leveraging the Skinned Multi-Person Linear (SMPL) model. We first fit the SMPL model to detected humans in video frames, project the resulting 3D mesh vertices onto 2D image planes to generate pseudo-trajectories, handle occlusions using ray-casting, and filter out unreliable tracks based on optical flow consistency. A point tracking model trained on AnthroTAP annotated dataset achieves state-of-the-art performance on the TAP-Vid benchmark, surpassing other models trained on real videos while using 10,000 times less data and only 1 day in 4 GPUs, compared to 256 GPUs used in recent state-of-the-art. △ Less

Submitted 8 July, 2025; originally announced July 2025.

Comments: Project Page: https://cvlab-kaist.github.io/AnthroTAP/

arXiv:2507.06133 [pdf, ps, other]

Bridging Sequential Deep Operator Network and Video Diffusion: Residual Refinement of Spatio-Temporal PDE Solutions

Authors: Jaewan Park, Farid Ahmed, Kazuma Kobayashi, Seid Koric, Syed Bahauddin Alam, Iwona Jasiuk, Diab Abueidda

Abstract: Video-diffusion models have recently set the standard in video generation, inpainting, and domain translation thanks to their training stability and high perceptual fidelity. Building on these strengths, we repurpose conditional video diffusion as a physics surrogate for spatio-temporal fields governed by partial differential equations (PDEs). Our two-stage surrogate first applies a Sequential Dee… ▽ More Video-diffusion models have recently set the standard in video generation, inpainting, and domain translation thanks to their training stability and high perceptual fidelity. Building on these strengths, we repurpose conditional video diffusion as a physics surrogate for spatio-temporal fields governed by partial differential equations (PDEs). Our two-stage surrogate first applies a Sequential Deep Operator Network (S-DeepONet) to produce a coarse, physics-consistent prior from the prescribed boundary or loading conditions. The prior is then passed to a conditional video diffusion model that learns only the residual: the point-wise difference between the ground truth and the S-DeepONet prediction. By shifting the learning burden from the full solution to its much smaller residual space, diffusion can focus on sharpening high-frequency structures without sacrificing global coherence. The framework is assessed on two disparate benchmarks: (i) vortex-dominated lid-driven cavity flow and (ii) tensile plastic deformation of dogbone specimens. Across these data sets the hybrid surrogate consistently outperforms its single-stage counterpart, cutting the mean relative L2 error from 4.57% to 0.83% for the flow problem and from 4.42% to 2.94% for plasticity, a relative improvements of 81.8% and 33.5% respectively. The hybrid approach not only lowers quantitative errors but also improves visual quality, visibly recovering fine spatial details. These results show that (i) conditioning diffusion on a physics-aware prior enables faithful reconstruction of localized features, (ii) residual learning reduces the problem, accelerating convergence and enhancing accuracy, and (iii) the same architecture transfers seamlessly from incompressible flow to nonlinear elasto-plasticity without problem-specific architectural modifications, highlighting its broad applicability to nonlinear, time-dependent continua. △ Less

Submitted 8 July, 2025; originally announced July 2025.

arXiv:2507.06101 [pdf]

Reference compositions for bismuth telluride thermoelectric materials for low-temperature power generation

Authors: Nirma Kumari, Jaywan Chung, Seunghyun Oh, Jeongin Jang, Jongho Park, Ji Hui Son, SuDong Park, Byungki Ryu

Abstract: Thermoelectric (TE) technology enables direct heat-to-electricity conversion and is gaining attention as a clean, fuel-saving, and carbon-neutral solution for industrial, automotive, and marine applications. Despite nearly a century of research, apart from successes in deep-space power sources and solid-state cooling modules, the industrialization and commercialization of TE power generation remai… ▽ More Thermoelectric (TE) technology enables direct heat-to-electricity conversion and is gaining attention as a clean, fuel-saving, and carbon-neutral solution for industrial, automotive, and marine applications. Despite nearly a century of research, apart from successes in deep-space power sources and solid-state cooling modules, the industrialization and commercialization of TE power generation remain limited. Since the new millennium, nanostructured bulk materials have accelerated the discovery of new TE systems. However, due to limited access to high-temperature heat sources, energy harvesting still relies almost exclusively on BiTe-based alloys, which are the only system operating stably near room temperature. Although many BiTe-based compositions have been proposed, concerns over reproducibility, reliability, and lifetime continue to hinder industrial adoption. Here, we aim to develop reference BiTe-based thermoelectric materials through data-driven analysis of Starrydata2, the world's largest thermoelectric database. We identify Bi0.46Sb1.54Te3 and Bi2Te2.7Se0.3 as the most frequently studied ternary compositions. These were synthesized using hot pressing and spark-plasma sintering. Thermoelectric properties were evaluated with respect to the processing method and measurement direction. The results align closely with the median of reported data, confirming the representativeness of the selected compositions. We propose these as reference BiTe materials, accompanied by transparent data and validated benchmarks. Their use can support the standardization of TE legs and modules while accelerating performance evaluation and industrial integration. We further estimated the performance of a thermoelectric module made from the reference composition, which gives the power output of over 2.51 W and an efficiency of 3.58% at a temperature difference of 120 K. △ Less

Submitted 9 July, 2025; v1 submitted 8 July, 2025; originally announced July 2025.

Comments: 45 pages, 4 tables, 14 figures (DOI info added for future activation upon publication. Error updated for k_ph)

arXiv:2507.06049 [pdf, ps, other]

FDR controlling procedures with dimension reduction and their application to GWAS with linkage disequilibrium score

Authors: Dayeon Jung, Yewon Kim, Junyong Park

Abstract: Genome-wide association studies (GWAS) have led to the discovery of numerous single nucleotide polymorphisms (SNPs) associated with various phenotypes and complex diseases. However, the identified genetic variants do not fully explain the heritability of complex traits, known as the missing heritability problem. To address this challenge and accurately control false positives while maximizing true… ▽ More Genome-wide association studies (GWAS) have led to the discovery of numerous single nucleotide polymorphisms (SNPs) associated with various phenotypes and complex diseases. However, the identified genetic variants do not fully explain the heritability of complex traits, known as the missing heritability problem. To address this challenge and accurately control false positives while maximizing true associations, we propose two approaches involving linkage disequilibrium (LD) scores as covariates. We apply principal component analysis (PCA), one of the dimensionality reduction techniques, to control the False Discovery Rate (FDR) in the presence of high-dimensional covariates. This method not only provides a convenient interpretation of how multiple covariates in high dimensions affect the control of FDR but also offers higher statistical power compared to cases where covariates are not used. Furthermore, we aim to investigate how covariates contribute to increasing the statistical power through various simulation experiments, comparing the results with real data examples to derive better interpretations. Using real-world datasets, including GWAS with Body Mass Index (BMI) as the phenotype, we evaluate the performance of our proposed approaches. By incorporating LD scores as covariates in FDR-controlled GWAS analyzes, we demonstrate their effectiveness in selecting informative LD scores and improving the identification of significant SNPs. Our methods alleviate computational burden and enhance interpretability while retaining essential information from LD scores. In general, our study contributes to the advancement of statistical methods in GWAS and provides practical guidance for researchers looking to improve the precision of genetic association analyses. △ Less

Submitted 8 July, 2025; originally announced July 2025.

arXiv:2507.05822 [pdf, ps, other]

Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models

Authors: L'ea Dubois, Klaus Schmidt, Chengyu Wang, Ji-Hoon Park, Lin Wang, Santiago Munoz

Abstract: Current video understanding models excel at recognizing "what" is happening but fall short in high-level cognitive tasks like causal reasoning and future prediction, a limitation rooted in their lack of commonsense world knowledge. To bridge this cognitive gap, we propose a novel framework that synergistically fuses a powerful Vision Foundation Model (VFM) for deep visual perception with a Large L… ▽ More Current video understanding models excel at recognizing "what" is happening but fall short in high-level cognitive tasks like causal reasoning and future prediction, a limitation rooted in their lack of commonsense world knowledge. To bridge this cognitive gap, we propose a novel framework that synergistically fuses a powerful Vision Foundation Model (VFM) for deep visual perception with a Large Language Model (LLM) serving as a knowledge-driven reasoning core. Our key technical innovation is a sophisticated fusion module, inspired by the Q-Former architecture, which distills complex spatiotemporal and object-centric visual features into a concise, language-aligned representation. This enables the LLM to effectively ground its inferential processes in direct visual evidence. The model is trained via a two-stage strategy, beginning with large-scale alignment pre-training on video-text data, followed by targeted instruction fine-tuning on a curated dataset designed to elicit advanced reasoning and prediction skills. Extensive experiments demonstrate that our model achieves state-of-the-art performance on multiple challenging benchmarks. Notably, it exhibits remarkable zero-shot generalization to unseen reasoning tasks, and our in-depth ablation studies validate the critical contribution of each architectural component. This work pushes the boundary of machine perception from simple recognition towards genuine cognitive understanding, paving the way for more intelligent and capable AI systems in robotics, human-computer interaction, and beyond. △ Less

Submitted 8 July, 2025; originally announced July 2025.

Comments: 22 pages, 4 figures

MSC Class: CS ACM Class: I.2.10

arXiv:2507.05673 [pdf, ps, other]

R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding

Authors: Joonhyung Park, Peng Tang, Sagnik Das, Srikar Appalaraju, Kunwar Yashraj Singh, R. Manmatha, Shabnam Ghadar

Abstract: Visual agent models for automating human activities on Graphical User Interfaces (GUIs) have emerged as a promising research direction, driven by advances in large Vision Language Models (VLMs). A critical challenge in GUI automation is the precise grounding of interface elements across diverse platforms. Existing vision-only GUI agents directly ground elements from large and cluttered screenshots… ▽ More Visual agent models for automating human activities on Graphical User Interfaces (GUIs) have emerged as a promising research direction, driven by advances in large Vision Language Models (VLMs). A critical challenge in GUI automation is the precise grounding of interface elements across diverse platforms. Existing vision-only GUI agents directly ground elements from large and cluttered screenshots, requiring them to process substantial irrelevant information that compromises their accuracy. In addition, these approaches typically employ basic cross-entropy loss for learning grounding objectives, which fails to effectively capture grounding quality compared to established object detection metrics like Intersection-over-Union (IoU). To address these issues, we introduce R-VLM, a novel GUI grounding approach that leverages zoomed-in region proposals for precise element localization. We also propose an IoU-aware objective function that facilitates model convergence toward high IoU predictions. Our approach bridges the gap between VLMs and conventional object detection techniques, improving the state-of-the-art grounding accuracy by 13% across diverse GUI platforms on the GUI grounding benchmarks ScreenSpot and AgentStudio. In addition, our R-VLM approach shows 3.2-9.7% absolute accuracy improvements in GUI navigation tasks on the AITW and Mind2Web benchmarks. △ Less

Submitted 8 July, 2025; originally announced July 2025.

Comments: ACL 2025; 17 pages

arXiv:2507.05585 [pdf, ps, other]

Capacity of the range of random walk: Moderate deviations in dimensions 4 and 5

Authors: Arka Adhikari, Jiyun Park

Abstract: We prove a moderate deviation principle for the capacity of the range of random walk in $\mathbb{Z}^5$. Depending on the scale of deviation, we get two different regimes. We observe Gaussian tails when the deviation scale is smaller than $n^{1/2} (\log n)^{3/4}$. Otherwise, we get non-Gaussian tails with a constant arising from a generalized Gagliardo-Nirenberg inequality. This is analogous to the… ▽ More We prove a moderate deviation principle for the capacity of the range of random walk in $\mathbb{Z}^5$. Depending on the scale of deviation, we get two different regimes. We observe Gaussian tails when the deviation scale is smaller than $n^{1/2} (\log n)^{3/4}$. Otherwise, we get non-Gaussian tails with a constant arising from a generalized Gagliardo-Nirenberg inequality. This is analogous to the behavior of the volume of the random walk range in $\mathbb{Z}^3$. Our methods can also be applied to the $d = 4$ case to prove the moderate deviation principle in almost the full range of interest. This extends the work of Okada and the first author \cite{AdhikariOkada2023}, where they showed moderate deviations up to a deviation scale of $\log \log n$ times the standard deviation. △ Less

Submitted 7 July, 2025; originally announced July 2025.

Comments: 33 pages

MSC Class: 60F10; 60G50

arXiv:2507.05094 [pdf, ps, other]

Observation of the decays $B^{+} \to Σ_{c}(2455)^{++} \overlineΞ_{c}^{-}$ and $B^{0} \to Σ_{c}(2455)^{0} \overlineΞ_{c}^{0}$

Authors: Belle, Belle II Collaborations, :, M. Abumusabh, I. Adachi, L. Aggarwal, H. Ahmed, Y. Ahn, H. Aihara, N. Akopov, S. Alghamdi, M. Alhakami, A. Aloisio, N. Althubiti, K. Amos, N. Anh Ky, D. M. Asner, H. Atmacan, T. Aushev, V. Aushev, R. Ayad, V. Babu, H. Bae, N. K. Baghel, S. Bahinipati , et al. (364 additional authors not shown)

Abstract: We report the first observation of the two-body baryonic decays $B^{+} \to Σ_{c}(2455)^{++} \overlineΞ_{c}^{-}$ and $B^{0} \to Σ_{c}(2455)^{0} \overlineΞ_{c}^{0}$ with significances of $7.3\,σ$ and $6.2\,σ$, respectively, including statistical and systematic uncertainties. The branching fractions are measured to be… ▽ More We report the first observation of the two-body baryonic decays $B^{+} \to Σ_{c}(2455)^{++} \overlineΞ_{c}^{-}$ and $B^{0} \to Σ_{c}(2455)^{0} \overlineΞ_{c}^{0}$ with significances of $7.3\,σ$ and $6.2\,σ$, respectively, including statistical and systematic uncertainties. The branching fractions are measured to be $\mathcal{B}(B^{+} \to Σ_{c}(2455)^{++} \overlineΞ_{c}^{-}) = (5.74 \pm 1.11 \pm 0.42_{-1.53}^{+2.47}) \times 10^{-4}$ and $\mathcal{B}(B^{0} \to Σ_{c}(2455)^{0} \overlineΞ_{c}^{0}) = (4.83 \pm 1.12 \pm 0.37_{-0.60}^{+0.72}) \times 10^{-4}$. The first and second uncertainties are statistical and systematic, respectively, while the third ones arise from the absolute branching fractions of $\overlineΞ_{c}^{-}$ or $\overlineΞ_{c}^{0}$ decays. The data samples used for this analysis have integrated luminosities of 711~$\mathrm{fb}^{-1}$ and 365~$\mathrm{fb}^{-1}$, and were collected at the $Υ(4S)$ resonance by the Belle and Belle~II detectors operating at the KEKB and SuperKEKB asymmetric-energy $e^{+}e^{-}$ colliders, respectively. △ Less

Submitted 7 July, 2025; originally announced July 2025.

Report number: Belle II Preprint 2025-019, KEK Preprint 2025-18

arXiv:2507.05050 [pdf, ps, other]

Measurement of the $ D^{0}\rightarrow K^{-}π^{+}e^{+}e^{-} $ branching fraction and search for $ D^{0}\rightarrow π^{+}π^{-}e^{+}e^{-} $ and $D^{0}\rightarrow K^{+}K^{-}e^{+}e^{-} $ decays at Belle

Authors: Belle, Belle II Collaborations, :, I. Adachi, L. Aggarwal, H. Ahmed, Y. Ahn, H. Aihara, N. Akopov, S. Alghamdi, M. Alhakami, A. Aloisio, N. Althubiti, K. Amos, M. Angelsmark, N. Anh Ky, C. Antonioli, D. M. Asner, H. Atmacan, T. Aushev, V. Aushev, M. Aversano, R. Ayad, V. Babu, H. Bae , et al. (458 additional authors not shown)

Abstract: We present a study of the rare charm meson decays $ D^{0}\rightarrow K^{+}K^{-}e^{+}e^{-} $, $ π^{+}π^{-}e^{+}e^{-} $, and $ K^{-}π^{+}e^{+}e^{-} $ using a 942 fb$^{-1}$ data set collected by the Belle detector at the KEKB asymmetric-energy $ e^{+}e^{-} $ collider. We use $ D^{0} $ candidates identified by the charge of the pion in $ D^{*} \rightarrow D^{0} π$ decays and normalize the branching fr… ▽ More We present a study of the rare charm meson decays $ D^{0}\rightarrow K^{+}K^{-}e^{+}e^{-} $, $ π^{+}π^{-}e^{+}e^{-} $, and $ K^{-}π^{+}e^{+}e^{-} $ using a 942 fb$^{-1}$ data set collected by the Belle detector at the KEKB asymmetric-energy $ e^{+}e^{-} $ collider. We use $ D^{0} $ candidates identified by the charge of the pion in $ D^{*} \rightarrow D^{0} π$ decays and normalize the branching fractions to $ D^{0} \rightarrow K^{-}π^{+}π^{-}π^{+} $ decays. The branching fraction for decay $ D^{0} \rightarrow K^{-}π^{+}e^{+}e^{-} $ is measured to be (39.6 $\pm$ 4.5 (stat) $\pm$ 2.9 (syst)) $\times$ $10^{-7}$, with the dielectron mass in the $ ρ/ω$ mass region $ 675 < m_{ee} < 875 $ MeV$/c^{2}$. We also search for $ D^{0}\rightarrow h^{-} h^{(\prime)+}e^{+}e^{-} $ ($ h^{(\prime)}=K,\,π$) decays with the dielectron mass near the $η$ and $φ$ resonances, and away from these resonances for the $ K^{+}K^{-}e^{+}e^{-} $ and $ π^{+}π^{-}e^{+}e^{-} $ modes. For these modes, we find no significant signals and set 90$\%$ confidence level upper limits on their branching fractions at the $\mathcal{O}$(10$^{-7}$) level. △ Less

Submitted 7 July, 2025; originally announced July 2025.

Report number: Belle II Preprint 2025-020; KEK Preprint 2025-19

arXiv:2507.04896 [pdf, ps, other]

Cross sections of $η$ mesons in $p$$+$$p$ collisions at forward rapidity at $\sqrt{s}=500$ GeV and central rapidity at $\sqrt{s}=510$ GeV

Authors: PHENIX Collaboration, N. J. Abdulameer, U. Acharya, A. Adare, C. Aidala, N. N. Ajitanand, Y. Akiba, R. Akimoto, H. Al-Ta'ani, J. Alexander, M. Alfred, D. Anderson, K. R. Andrews, A. Angerami, S. Antsupov, K. Aoki, N. Apadula, E. Appelt, Y. Aramaki, R. Armendariz, H. Asano, E. C. Aschenauer, E. T. Atomssa, T. C. Awes, B. Azmoun , et al. (476 additional authors not shown)

Abstract: We present the first measurements of the forward and midrapidity $η$-meson cross sections from $p$$+$$p$ collisions at $\sqrt{s}=500$ and $510$~GeV, respectively. We also report the midrapidity $η/π^0$ ratio at 510 GeV. The forward cross section is measured differentially in $η$-meson transverse momentum ($p_T$) from 1.0 to 6.5~GeV/$c$ for pseudorapidity $3.0<|η|<3.8$. The midrapidity cross sectio… ▽ More We present the first measurements of the forward and midrapidity $η$-meson cross sections from $p$$+$$p$ collisions at $\sqrt{s}=500$ and $510$~GeV, respectively. We also report the midrapidity $η/π^0$ ratio at 510 GeV. The forward cross section is measured differentially in $η$-meson transverse momentum ($p_T$) from 1.0 to 6.5~GeV/$c$ for pseudorapidity $3.0<|η|<3.8$. The midrapidity cross section is measured from 3.5 to 44 GeV/$c$ for pseudorapidity $|η|<0.35$. Both cross sections serve as critical inputs to an updated global analysis of the $η$-meson fragmentation functions. △ Less

Submitted 7 July, 2025; originally announced July 2025.

Comments: 500 authors from 81 institutions, 14 pages, 7 figures, 3 tables. v1 is version submitted to Physical Review D. HEPdata tables for the points plotted in figures for this and previous PHENIX publications are (or will be) publicly available at http://www.phenix.bnl.gov/papers.html

arXiv:2507.04482 [pdf, ps, other]

A Training-Free Style-Personalization via Scale-wise Autoregressive Model

Authors: Kyoungmin Lee, Jihun Park, Jongmin Gim, Wonhyeok Choi, Kyumin Hwang, Jaeyeul Kim, Sunghoon Im

Abstract: We present a training-free framework for style-personalized image generation that controls content and style information during inference using a scale-wise autoregressive model. Our method employs a three-path design--content, style, and generation--each guided by a corresponding text prompt, enabling flexible and efficient control over image semantics without any additional training. A central c… ▽ More We present a training-free framework for style-personalized image generation that controls content and style information during inference using a scale-wise autoregressive model. Our method employs a three-path design--content, style, and generation--each guided by a corresponding text prompt, enabling flexible and efficient control over image semantics without any additional training. A central contribution of this work is a step-wise and attention-wise intervention analysis. Through systematic prompt and feature injection, we find that early-to-middle generation steps play a pivotal role in shaping both content and style, and that query features predominantly encode content-specific information. Guided by these insights, we introduce two targeted mechanisms: Key Stage Attention Sharing, which aligns content and style during the semantically critical steps, and Adaptive Query Sharing, which reinforces content semantics in later steps through similarity-aware query blending. Extensive experiments demonstrate that our method achieves competitive style fidelity and prompt fidelity compared to fine-tuned baselines, while offering faster inference and greater deployment flexibility. △ Less

Submitted 6 July, 2025; originally announced July 2025.

Comments: 13 pages, 10 figures

arXiv:2507.04463 [pdf, ps, other]

Low-mass vector-meson production at forward rapidity in $p$$+$$p$ and Au$+$Au collisions at $\sqrt{s_{_{NN}}}=200$~GeV

Authors: PHENIX Collaboration, N. J. Abdulameer, U. Acharya, A. Adare, C. Aidala, N. N. Ajitanand, Y. Akiba, M. Alfred, D. Anderson, V. Andrieux, S. Antsupov, N. Apadula, H. Asano, B. Azmoun, V. Babintsev, M. Bai, N. S. Bandara, B. Bannier, E. Bannikov, K. N. Barish, S. Bathe, A. Bazilevsky, M. Beaumier, S. Beckman, R. Belmont , et al. (331 additional authors not shown)

Abstract: The PHENIX experiment at the Relativistic Heavy Ion Collider has measured low-mass vector-meson ($ω+ρ$ and $φ$) production through the dimuon decay channel at forward rapidity $(1.2<|\mbox{y}|<2.2)$ in $p$$+$$p$ and Au$+$Au collisions at $\sqrt{s_{_{NN}}}=200$~GeV. The low-mass vector-meson yield and nuclear-modification factor were measured as a function of the average number of participating nuc… ▽ More The PHENIX experiment at the Relativistic Heavy Ion Collider has measured low-mass vector-meson ($ω+ρ$ and $φ$) production through the dimuon decay channel at forward rapidity $(1.2<|\mbox{y}|<2.2)$ in $p$$+$$p$ and Au$+$Au collisions at $\sqrt{s_{_{NN}}}=200$~GeV. The low-mass vector-meson yield and nuclear-modification factor were measured as a function of the average number of participating nucleons, $\langle N_{\rm part}\rangle$, and the transverse momentum $p_T$. These results were compared with those obtained via the kaon decay channel in a similar $p_T$ range at midrapidity. The nuclear-modification factors in both rapidity regions are consistent within the uncertainties. A comparison of the $ω+ρ$ and $J/ψ$ mesons reveals that the light and heavy flavors are consistently suppressed across both $p_T$ and ${\langle}N_{\rm part}\rangle$. In contrast, the $φ$ meson displays a nuclear-modification factor consistent with unity, suggesting strangeness enhancement in the medium formed. △ Less

Submitted 6 July, 2025; originally announced July 2025.

Comments: 356 authors from 71 institutions, 14 pages, 14 figures, 1 table. v1 is version submitted to Physical Review C. HEPdata tables for the points plotted in figures for this and previous PHENIX publications are (or will be) publicly available at http://www.phenix.bnl.gov/papers.html

arXiv:2507.04157 [pdf, ps, other]

Hyperspectral Dual-Comb Compressive Imaging for Minimally-Invasive Video-Rate Endomicroscopy

Authors: Myoung-Gyun Suh, David Dang, Maodong Gao, Yucheng Jin, Byoung Jun Park, Beyonce Hu, Wilton J. M. Kort-Kamp, Ho Wai, Lee

Abstract: Endoscopic imaging is essential for real-time visualization of internal organs, yet conventional systems remain bulky, complex, and expensive due to their reliance on large, multi-element optical components. This limits their accessibility to delicate or constrained anatomical regions. Achieving real-time, high-resolution endomicroscopy using compact, low-cost hardware at the hundred-micron scale… ▽ More Endoscopic imaging is essential for real-time visualization of internal organs, yet conventional systems remain bulky, complex, and expensive due to their reliance on large, multi-element optical components. This limits their accessibility to delicate or constrained anatomical regions. Achieving real-time, high-resolution endomicroscopy using compact, low-cost hardware at the hundred-micron scale remains an unsolved challenge. Optical fibers offer a promising route toward miniaturization by providing sub-millimeter-scale imaging channels; however, existing fiber-based methods typically rely on raster scanning or multicore bundles, which limit the resolution and imaging speed. In this work, we overcome these limitations by integrating dual-comb interferometry with compressive ghost imaging and advanced computational reconstruction. Our technique, hyperspectral dual-comb compressive imaging, utilizes optical frequency combs to generate wavelength-multiplexed speckle patterns that are delivered through a single-core fiber and detected by a single-pixel photodetector. This parallel speckle illumination and detection enable snapshot compression and acquisition of image information using zero-dimensional hardware, completely eliminating the need for both spatial and spectral scanning. To decode these highly compressed signals, we develop a transformer-based deep learning model capable of rapid, high-fidelity image reconstruction at extremely low sampling ratios. This approach significantly outperforms classical ghost imaging methods in both speed and accuracy, achieving video-rate imaging with a dramatically simplified optical front-end. Our results represent a major advance toward minimally invasive, cost-effective endomicroscopy and provide a generalizable platform for optical sensing in applications where hardware constraints are critical. △ Less

Submitted 5 July, 2025; originally announced July 2025.

arXiv:2507.04018 [pdf, ps, other]

doi 10.1007/978-981-96-8180-8_38

Handling Korean Out-of-Vocabulary Words with Phoneme Representation Learning

Authors: Nayeon Kim, Eojin Jeon, Jun-Hyung Park, SangKeun Lee

Abstract: In this study, we introduce KOPL, a novel framework for handling Korean OOV words with Phoneme representation Learning. Our work is based on the linguistic property of Korean as a phonemic script, the high correlation between phonemes and letters. KOPL incorporates phoneme and word representations for Korean OOV words, facilitating Korean OOV word representations to capture both text and phoneme i… ▽ More In this study, we introduce KOPL, a novel framework for handling Korean OOV words with Phoneme representation Learning. Our work is based on the linguistic property of Korean as a phonemic script, the high correlation between phonemes and letters. KOPL incorporates phoneme and word representations for Korean OOV words, facilitating Korean OOV word representations to capture both text and phoneme information of words. We empirically demonstrate that KOPL significantly improves the performance on Korean Natural Language Processing (NLP) tasks, while being readily integrated into existing static and contextual Korean embedding models in a plug-and-play manner. Notably, we show that KOPL outperforms the state-of-the-art model by an average of 1.9%. Our code is available at https://github.com/jej127/KOPL.git. △ Less

Submitted 5 July, 2025; originally announced July 2025.

Journal ref: Advances in Knowledge Discovery and Data Mining. PAKDD 2025

arXiv:2507.03660 [pdf, ps, other]

When Network Architecture Meets Physics: Deep Operator Learning for Coupled Multiphysics

Authors: Kazuma Kobayashi, Jaewan Park, Qibang Liu, Seid Koric, Diab Abueidda, Syed Bahauddin Alam

Abstract: Scientific applications increasingly demand real-time surrogate models that can capture the behavior of strongly coupled multiphysics systems driven by multiple input functions, such as in thermo-mechanical and electro-thermal processes. While neural operator frameworks, such as Deep Operator Networks (DeepONets), have shown considerable success in single-physics settings, their extension to multi… ▽ More Scientific applications increasingly demand real-time surrogate models that can capture the behavior of strongly coupled multiphysics systems driven by multiple input functions, such as in thermo-mechanical and electro-thermal processes. While neural operator frameworks, such as Deep Operator Networks (DeepONets), have shown considerable success in single-physics settings, their extension to multiphysics problems remains poorly understood. In particular, the challenge of learning nonlinear interactions between tightly coupled physical fields has received little systematic attention. This study addresses a foundational question: should the architectural design of a neural operator reflect the strength of physical coupling it aims to model? To answer this, we present the first comprehensive, architecture-aware evaluation of DeepONet variants across three regimes: single-physics, weakly coupled, and strongly coupled multiphysics systems. We consider a reaction-diffusion equation with dual spatial inputs, a nonlinear thermo-electrical problem with bidirectional coupling through temperature-dependent conductivity, and a viscoplastic thermo-mechanical model of steel solidification governed by transient phase-driven interactions. Two operator-learning frameworks, the classical DeepONet and its sequential GRU-based extension, S-DeepONet, are benchmarked using both single-branch and multi-branch (MIONet-style) architectures. Our results demonstrate that architectural alignment with physical coupling is crucial: single-branch networks significantly outperform multi-branch counterparts in strongly coupled settings, whereas multi-branch encodings offer advantages for decoupled or single-physics problems. Once trained, these surrogates achieve full-field predictions up to 1.8e4 times faster than high-fidelity finite-element solvers, without compromising solution accuracy. △ Less

Submitted 4 July, 2025; originally announced July 2025.

arXiv:2507.03603 [pdf, ps, other]

Selection bias effects on high-$p_\mathrm{T}$ yield and correlation measurements in Oxygen+Oxygen collisions

Authors: JaeBeom Park, J. L. Nagle, Dennis V. Perepelitsa, Sanghoon Lim, Constantin Loizides

Abstract: Oxygen+Oxygen (O+O) collisions at RHIC and the LHC offer a unique experimental opportunity to observe the onset of jet quenching in intermediate relativistic collision systems. As with the smaller proton-nucleus or larger nucleus-nucleus systems, measurements of centrality-selected high-$p_\mathrm{T}$ processes in O+O collisions are expected to be sensitive to selection bias effects, which will be… ▽ More Oxygen+Oxygen (O+O) collisions at RHIC and the LHC offer a unique experimental opportunity to observe the onset of jet quenching in intermediate relativistic collision systems. As with the smaller proton-nucleus or larger nucleus-nucleus systems, measurements of centrality-selected high-$p_\mathrm{T}$ processes in O+O collisions are expected to be sensitive to selection bias effects, which will be necessary to quantify or mitigate before a definitive conclusion on the presence of jet quenching. Using two Monte Carlo heavy-ion event generators, we provide a survey of centrality bias effects on high-$p_\mathrm{T}$ yield and correlation measurements. Some highlights of our findings include that (1) bias factors for the accessible kinematic range at RHIC show a non-trivial $p_\mathrm{T}$ dependence, compared to a negligible one at the LHC given the smaller accessible Bjorken-$x$ range, (2) centrality definitions based on multiplicity are less sensitive to bias effects than those based on the transverse energy, (3) the Angantyr generator gives qualitatively similar but larger-magnitude bias factors than HIJING, and (4) correlation measurements have a much smaller sensitivity to bias effects than do yield measurements. The findings here are intended to guide the experimental design and interpretation of O+O jet quenching and other hard-process measurements. △ Less

Submitted 4 July, 2025; originally announced July 2025.

Comments: 9 pages, 12 figures, comments welcome before journal submission

arXiv:2507.03192 [pdf, ps, other]

Parallel multilevel methods for solving the Darcy--Forchheimer model based on a nearly semicoercive formulation

Authors: Jongho Park, S. Majid Hassanizadeh

Abstract: High-velocity fluid flow through porous media is modeled by prescribing a nonlinear relationship between the flow rate and the pressure gradient, called Darcy--Forchheimer equation. This paper is concerned with the analysis of parallel multilevel methods for solving the Darcy--Forchheimer model. We begin by reformulating the Darcy--Forchheimer model as a nearly semicoercive convex optimization pro… ▽ More High-velocity fluid flow through porous media is modeled by prescribing a nonlinear relationship between the flow rate and the pressure gradient, called Darcy--Forchheimer equation. This paper is concerned with the analysis of parallel multilevel methods for solving the Darcy--Forchheimer model. We begin by reformulating the Darcy--Forchheimer model as a nearly semicoercive convex optimization problem via the augmented Lagrangian method. Building on this formulation, we develop a parallel multilevel method within the framework of subspace correction for nearly semicoercive convex problems. The proposed method exhibits robustness with respect to both the nearly semicoercive nature of the problem and the size of the discretized system. To further enhance convergence, we incorporate a backtracking line search scheme. Numerical results validate the theoretical findings and demonstrate the effectiveness and superiority of the proposed approach. △ Less

Submitted 3 July, 2025; originally announced July 2025.

Comments: 21 pages, 3 figures

MSC Class: 65N55; 65N20; 76S05; 90C25

arXiv:2507.03114 [pdf, ps, other]

Characterizing Compute-Communication Overlap in GPU-Accelerated Distributed Deep Learning: Performance and Power Implications

Authors: Seonho Lee, Jihwan Oh, Junkyum Kim, Seokjin Go, Jongse Park, Divya Mahajan

Abstract: This paper provides an in-depth characterization of GPU-accelerated systems, to understand the interplay between overlapping computation and communication which is commonly employed in distributed training settings. Due to the large size of models, distributing them across multiple devices is required. Overlapping strategies, which enable concurrent computation and communication, are critical for… ▽ More This paper provides an in-depth characterization of GPU-accelerated systems, to understand the interplay between overlapping computation and communication which is commonly employed in distributed training settings. Due to the large size of models, distributing them across multiple devices is required. Overlapping strategies, which enable concurrent computation and communication, are critical for mitigating communication bottlenecks and maximizing GPU utilization. However, the current consensus is that we should always and aggressively overlap compute and communication to mitigate the overhead of distribution. By systematically evaluating state-of-the-art GPUs, this study investigates the impact of hardware features such as numeric precision, specialized cores, and power capping on distributed training workloads. Comprehensive experiments and studies showcase the effects of overlapping strategies on performance and power consumption across varying scenarios. We observe that overlapping computation and communication can result in an average computational slowdown of 18.9%, with a maximum of 40.0% slowdown. This slowdown is in comparison to the scenario when no communication was happening with the compute. We consider this an ideal execution scenario, where the communication in parallel has not impact on the compute time. However, performing computation and communication sequentially is, on average, 10.2% slower than overlapped execution, with a maximum slowdown of 26.6%. We further observe, while specialized datapath and optimized numeric precision mitigate certain slowdowns, overlapping execution can lead to resource contention and also increase power consumption under specific configurations. The analysis also uncovers trade-offs introduced by power and frequency capping, emphasizing the importance of balanced strategies to optimize energy efficiency and training throughput. △ Less

Submitted 3 July, 2025; originally announced July 2025.

arXiv:2507.01588 [pdf, ps, other]

doi 10.1007/978-3-031-78125-4_19

Enhancing Multi-Exposure High Dynamic Range Imaging with Overlapped Codebook for Improved Representation Learning

Authors: Keuntek Lee, Jaehyun Park, Nam Ik Cho

Abstract: High dynamic range (HDR) imaging technique aims to create realistic HDR images from low dynamic range (LDR) inputs. Specifically, Multi-exposure HDR imaging uses multiple LDR frames taken from the same scene to improve reconstruction performance. However, there are often discrepancies in motion among the frames, and different exposure settings for each capture can lead to saturated regions. In thi… ▽ More High dynamic range (HDR) imaging technique aims to create realistic HDR images from low dynamic range (LDR) inputs. Specifically, Multi-exposure HDR imaging uses multiple LDR frames taken from the same scene to improve reconstruction performance. However, there are often discrepancies in motion among the frames, and different exposure settings for each capture can lead to saturated regions. In this work, we first propose an Overlapped codebook (OLC) scheme, which can improve the capability of the VQGAN framework for learning implicit HDR representations by modeling the common exposure bracket process in the shared codebook structure. Further, we develop a new HDR network that utilizes HDR representations obtained from a pre-trained VQ network and OLC. This allows us to compensate for saturated regions and enhance overall visual quality. We have tested our approach extensively on various datasets and have demonstrated that it outperforms previous methods both qualitatively and quantitatively △ Less

Submitted 2 July, 2025; originally announced July 2025.

Comments: Accepted to International Conference on Pattern Recognition. Springer, Cham, 2025 (ICPR 2024)

arXiv:2507.01496 [pdf, ps, other]

ReFlex: Text-Guided Editing of Real Images in Rectified Flow via Mid-Step Feature Extraction and Attention Adaptation

Authors: Jimyeong Kim, Jungwon Park, Yeji Song, Nojun Kwak, Wonjong Rhee

Abstract: Rectified Flow text-to-image models surpass diffusion models in image quality and text alignment, but adapting ReFlow for real-image editing remains challenging. We propose a new real-image editing method for ReFlow by analyzing the intermediate representations of multimodal transformer blocks and identifying three key features. To extract these features from real images with sufficient structural… ▽ More Rectified Flow text-to-image models surpass diffusion models in image quality and text alignment, but adapting ReFlow for real-image editing remains challenging. We propose a new real-image editing method for ReFlow by analyzing the intermediate representations of multimodal transformer blocks and identifying three key features. To extract these features from real images with sufficient structural preservation, we leverage mid-step latent, which is inverted only up to the mid-step. We then adapt attention during injection to improve editability and enhance alignment to the target text. Our method is training-free, requires no user-provided mask, and can be applied even without a source prompt. Extensive experiments on two benchmarks with nine baselines demonstrate its superior performance over prior methods, further validated by human evaluations confirming a strong user preference for our approach. △ Less

Submitted 2 July, 2025; originally announced July 2025.

Comments: Published at ICCV 2025. Project page: https://wlaud1001.github.io/ReFlex/

arXiv:2507.01415 [pdf, ps, other]

Randomized subspace correction methods for convex optimization

Authors: Boou Jiang, Jongho Park, Jinchao Xu

Abstract: This paper introduces an abstract framework for randomized subspace correction methods for convex optimization, which unifies and generalizes a broad class of existing algorithms, including domain decomposition, multigrid, and block coordinate descent methods. We provide a convergence rate analysis ranging from minimal assumptions to more practical settings, such as sharpness and strong convexity.… ▽ More This paper introduces an abstract framework for randomized subspace correction methods for convex optimization, which unifies and generalizes a broad class of existing algorithms, including domain decomposition, multigrid, and block coordinate descent methods. We provide a convergence rate analysis ranging from minimal assumptions to more practical settings, such as sharpness and strong convexity. While most existing studies on block coordinate descent methods focus on nonoverlapping decompositions and smooth or strongly convex problems, our framework extends to more general settings involving arbitrary space decompositions, inexact local solvers, and problems with limited smoothness or convexity. The proposed framework is broadly applicable to convex optimization problems arising in areas such as nonlinear partial differential equations, imaging, and data science. △ Less

Submitted 2 July, 2025; originally announced July 2025.

Comments: 21 pages, 0 figures

MSC Class: 90C25; 65N55; 65J05; 90C06

arXiv:2507.01249 [pdf, ps, other]

Search for an Axion-Like Particle in $B\rightarrow K^{(*)} a (\rightarrowγγ)$ Decays at Belle

Authors: Belle, Belle II Collaborations, :, I. Adachi, L. Aggarwal, H. Ahmed, Y. Ahn, H. Aihara, N. Akopov, S. Alghamdi, M. Alhakami, A. Aloisio, N. Althubiti, K. Amos, M. Angelsmark, N. Anh Ky, C. Antonioli, D. M. Asner, H. Atmacan, T. Aushev, V. Aushev, M. Aversano, R. Ayad, V. Babu, H. Bae , et al. (400 additional authors not shown)

Abstract: We report a search for an axion-like particle $a$ in $B\rightarrow K^{(*)} a (\rightarrowγγ)$ decays using data collected with the Belle detector at the KEKB asymmetric energy electron-positron collider. The search is based on a $711 \mathrm{fb^{-1}}$ data sample collected at the $Υ4S$ resonance energy, corresponding to a sample of $772\times10^6$ $Υ4S$ events. In this study, we search for the dec… ▽ More We report a search for an axion-like particle $a$ in $B\rightarrow K^{(*)} a (\rightarrowγγ)$ decays using data collected with the Belle detector at the KEKB asymmetric energy electron-positron collider. The search is based on a $711 \mathrm{fb^{-1}}$ data sample collected at the $Υ4S$ resonance energy, corresponding to a sample of $772\times10^6$ $Υ4S$ events. In this study, we search for the decay of the axion-like particle into a pair of photons, $a \rightarrow γγ$. We scan the two-photon invariant mass in the range $0.16\ \mathrm{GeV/}c^2-4.50\ \mathrm{GeV}/c^2$ for the $K$ modes and $0.16\ \mathrm{GeV/}c^2-4.20\ \mathrm{GeV}/c^2$ for the $K^{*}$ modes. No significant signal is observed in any of the modes, and 90\% confidence level upper limits are established on the coupling to the $W$ boson, $g_aW$, as a function of $a$ mass. The limits range from $3 \times 10^{-6} \mathrm{GeV}^{-1}$ to $3 \times 10^{-5} \mathrm{GeV}^{-1}$, improving the current constraints on $g_aW$ by a factor of two over the most stringent previous experimental results. △ Less

Submitted 3 July, 2025; v1 submitted 1 July, 2025; originally announced July 2025.

Comments: 26 pages, 15 Figures

Report number: Belle II Preprint: 2025-017 KEK Preprint: 2025-16

arXiv:2507.00726 [pdf, ps, other]

Can Large Language Models Develop Strategic Reasoning? Post-training Insights from Learning Chess

Authors: Dongyoon Hwang, Hojoon Lee, Jaegul Choo, Dongmin Park, Jongho Park

Abstract: While reinforcement learning (RL) for large language models (LLMs) has shown promise in mathematical reasoning, strategic reasoning for LLMs using RL remains largely unexplored. We investigate whether LLMs can develop strategic reasoning capabilities through RL in chess. To this end, we leverage a chess-pretrained action-value network to provide dense reward on the LLM's output move quality, which… ▽ More While reinforcement learning (RL) for large language models (LLMs) has shown promise in mathematical reasoning, strategic reasoning for LLMs using RL remains largely unexplored. We investigate whether LLMs can develop strategic reasoning capabilities through RL in chess. To this end, we leverage a chess-pretrained action-value network to provide dense reward on the LLM's output move quality, which can be seen as a form of knowledge distillation. Our experiments show that our distillation-based dense rewards often outperform sparse binary rewards. However, surprisingly, all models plateau far below expert levels. We provide SFT and RL ablations on chess reasoning training and find evidence that this limitation stems from a deficit in the pretrained models' internal understanding of chess--a deficit which RL alone may not be able to fully overcome. △ Less

Submitted 2 July, 2025; v1 submitted 1 July, 2025; originally announced July 2025.

Comments: 27 pages

arXiv:2507.00480 [pdf, ps, other]

Posterior Inference in Latent Space for Scalable Constrained Black-box Optimization

Authors: Kiyoung Om, Kyuil Sim, Taeyoung Yun, Hyeongyu Kang, Jinkyoo Park

Abstract: Optimizing high-dimensional black-box functions under black-box constraints is a pervasive task in a wide range of scientific and engineering problems. These problems are typically harder than unconstrained problems due to hard-to-find feasible regions. While Bayesian optimization (BO) methods have been developed to solve such problems, they often struggle with the curse of dimensionality. Recentl… ▽ More Optimizing high-dimensional black-box functions under black-box constraints is a pervasive task in a wide range of scientific and engineering problems. These problems are typically harder than unconstrained problems due to hard-to-find feasible regions. While Bayesian optimization (BO) methods have been developed to solve such problems, they often struggle with the curse of dimensionality. Recently, generative model-based approaches have emerged as a promising alternative for constrained optimization. However, they suffer from poor scalability and are vulnerable to mode collapse, particularly when the target distribution is highly multi-modal. In this paper, we propose a new framework to overcome these challenges. Our method iterates through two stages. First, we train flow-based models to capture the data distribution and surrogate models that predict both function values and constraint violations with uncertainty quantification. Second, we cast the candidate selection problem as a posterior inference problem to effectively search for promising candidates that have high objective values while not violating the constraints. During posterior inference, we find that the posterior distribution is highly multi-modal and has a large plateau due to constraints, especially when constraint feedback is given as binary indicators of feasibility. To mitigate this issue, we amortize the sampling from the posterior distribution in the latent space of flow-based models, which is much smoother than that in the data space. We empirically demonstrate that our method achieves superior performance on various synthetic and real-world constrained black-box optimization tasks. Our code is publicly available \href{https://github.com/umkiyoung/CiBO}{here}. △ Less

Submitted 1 July, 2025; originally announced July 2025.

Comments: 25 pages, 11 figures, 5 tables. Equal contribution by Kiyoung Om, Kyuil Sim, and Taeyoung Yun

arXiv:2507.00198 [pdf, ps, other]

Exploring AR Label Placements in Visually Cluttered Scenarios

Authors: Ji Hwan Park, Braden Roper, Amirhossein Arezoumand, Tien Tran

Abstract: We investigate methods for placing labels in AR environments that have visually cluttered scenes. As the number of items increases in a scene within the user' FOV, it is challenging to effectively place labels based on existing label placement guidelines. To address this issue, we implemented three label placement techniques for in-view objects for AR applications. We specifically target a scenari… ▽ More We investigate methods for placing labels in AR environments that have visually cluttered scenes. As the number of items increases in a scene within the user' FOV, it is challenging to effectively place labels based on existing label placement guidelines. To address this issue, we implemented three label placement techniques for in-view objects for AR applications. We specifically target a scenario, where various items of different types are scattered within the user's field of view, and multiple items of the same type are situated close together. We evaluate three placement techniques for three target tasks. Our study shows that using a label to spatially group the same types of items is beneficial for identifying, comparing, and summarizing data. △ Less

Submitted 30 June, 2025; originally announced July 2025.

arXiv:2506.23552 [pdf, ps, other]

JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching

Authors: Mingi Kwon, Joonghyuk Shin, Jaeseok Jung, Jaesik Park, Youngjung Uh

Abstract: The intrinsic link between facial motion and speech is often overlooked in generative modeling, where talking head synthesis and text-to-speech (TTS) are typically addressed as separate tasks. This paper introduces JAM-Flow, a unified framework to simultaneously synthesize and condition on both facial motion and speech. Our approach leverages flow matching and a novel Multi-Modal Diffusion Transfo… ▽ More The intrinsic link between facial motion and speech is often overlooked in generative modeling, where talking head synthesis and text-to-speech (TTS) are typically addressed as separate tasks. This paper introduces JAM-Flow, a unified framework to simultaneously synthesize and condition on both facial motion and speech. Our approach leverages flow matching and a novel Multi-Modal Diffusion Transformer (MM-DiT) architecture, integrating specialized Motion-DiT and Audio-DiT modules. These are coupled via selective joint attention layers and incorporate key architectural choices, such as temporally aligned positional embeddings and localized joint attention masking, to enable effective cross-modal interaction while preserving modality-specific strengths. Trained with an inpainting-style objective, JAM-Flow supports a wide array of conditioning inputs-including text, reference audio, and reference motion-facilitating tasks such as synchronized talking head generation from text, audio-driven animation, and much more, within a single, coherent model. JAM-Flow significantly advances multi-modal generative modeling by providing a practical solution for holistic audio-visual synthesis. project page: https://joonghyuk.com/jamflow-web △ Less

Submitted 30 June, 2025; originally announced June 2025.

Comments: project page: https://joonghyuk.com/jamflow-web Under review. Preprint published on arXiv

arXiv:2506.23530 [pdf, ps, other]

Investigation of resonant layer response in electron viscosity regime

Authors: Yeongsun Lee, Jace Waybright, Jong-Kyu Park

Abstract: We present a supplementary study of previous work in Waybright and Park [Phys. Plasmas 31, 022502 (2024)] which demonstrates a substantial effect of electron viscosity on the resonant layer response to non-axisymmetric magnetic perturbations. A main refinement is to include a curl element of electron viscosity in the generalized Ohm's law. The refinement reveals a resonant layer response in the El… ▽ More We present a supplementary study of previous work in Waybright and Park [Phys. Plasmas 31, 022502 (2024)] which demonstrates a substantial effect of electron viscosity on the resonant layer response to non-axisymmetric magnetic perturbations. A main refinement is to include a curl element of electron viscosity in the generalized Ohm's law. The refinement reveals a resonant layer response in the Electron Viscosity (EV) regime corresponding to slowly rotating and highly viscous plasmas. △ Less

Submitted 30 June, 2025; originally announced June 2025.

arXiv:2506.23529 [pdf, ps, other]

When Test-Time Adaptation Meets Self-Supervised Models

Authors: Jisu Han, Jihee Park, Dongyoon Han, Wonjun Hwang

Abstract: Training on test-time data enables deep learning models to adapt to dynamic environmental changes, enhancing their practical applicability. Online adaptation from source to target domains is promising but it remains highly reliant on the performance of source pretrained model. In this paper, we investigate whether test-time adaptation (TTA) methods can continuously improve models trained via self-… ▽ More Training on test-time data enables deep learning models to adapt to dynamic environmental changes, enhancing their practical applicability. Online adaptation from source to target domains is promising but it remains highly reliant on the performance of source pretrained model. In this paper, we investigate whether test-time adaptation (TTA) methods can continuously improve models trained via self-supervised learning (SSL) without relying on source pretraining. We introduce a self-supervised TTA protocol after observing that existing TTA approaches struggle when directly applied to self-supervised models with low accuracy on the source domain. Furthermore, we propose a collaborative learning framework that integrates SSL and TTA models, leveraging contrastive learning and knowledge distillation for stepwise representation refinement. We validate our method on diverse self-supervised models, including DINO, MoCo, and iBOT, across TTA benchmarks. Extensive experiments validate the effectiveness of our approach in SSL, showing that it achieves competitive performance even without source pretraining. △ Less

Submitted 30 June, 2025; originally announced June 2025.

Comments: 15 pages, 7 figures

arXiv:2506.23518 [pdf, ps, other]

WAVE: Warp-Based View Guidance for Consistent Novel View Synthesis Using a Single Image

Authors: Jiwoo Park, Tae Eun Choi, Youngjun Jun, Seong Jae Hwang

Abstract: Generating high-quality novel views of a scene from a single image requires maintaining structural coherence across different views, referred to as view consistency. While diffusion models have driven advancements in novel view synthesis, they still struggle to preserve spatial continuity across views. Diffusion models have been combined with 3D models to address the issue, but such approaches lac… ▽ More Generating high-quality novel views of a scene from a single image requires maintaining structural coherence across different views, referred to as view consistency. While diffusion models have driven advancements in novel view synthesis, they still struggle to preserve spatial continuity across views. Diffusion models have been combined with 3D models to address the issue, but such approaches lack efficiency due to their complex multi-step pipelines. This paper proposes a novel view-consistent image generation method which utilizes diffusion models without additional modules. Our key idea is to enhance diffusion models with a training-free method that enables adaptive attention manipulation and noise reinitialization by leveraging view-guided warping to ensure view consistency. Through our comprehensive metric framework suitable for novel-view datasets, we show that our method improves view consistency across various diffusion models, demonstrating its broader applicability. △ Less

Submitted 30 June, 2025; originally announced June 2025.

arXiv:2506.22694 [pdf, ps, other]

VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs

Authors: Raghavv Goel, Sudhanshu Agrawal, Mukul Gagrani, Junyoung Park, Yifan Zao, He Zhang, Tian Liu, Yiping Yang, Xin Yuan, Jiuyan Lu, Chris Lott, Mingu Lee

Abstract: In this paper, we introduce a simple training-free technique to improve the performance of drafter-based speculative decoding (SpD) methods that incorporates language modeling head (LM head) during drafting process. A drafter-based speculative decoding leverages one or more smaller language models, a.k.a. drafters or draft models, to sample a draft sequence or tree consisting of multiple tokens, f… ▽ More In this paper, we introduce a simple training-free technique to improve the performance of drafter-based speculative decoding (SpD) methods that incorporates language modeling head (LM head) during drafting process. A drafter-based speculative decoding leverages one or more smaller language models, a.k.a. drafters or draft models, to sample a draft sequence or tree consisting of multiple tokens, followed by verification by a base LLM, a target model, accepting a subset as its valid generation. As it is usually considered that the speculative decoding requires one-to-one mapping between vocabularies of the target model and the draft model, it has been natural to share the vocabulary between them, or even share the LM head as in EAGLE or Medusa. We first identify that this draft token sampling scheme inherently contains an unnecessary inference overhead in drafting, especially for some target LLMs with very large vocabularies. Then, we propose a simple technique, VocabTrim, to mitigate the drafting overhead to improve the generation speed in memory-bound environment. VocabTrim reconstructs the drafter LM head to contain only a limited set of tokens, selected by the most frequently sampled from the vocabulary of the target model. While limiting the vocabulary in drafting slightly degrades the acceptance rate, it significantly reduces the drafting latency in memory-bound process which is often the case on edge devices, resulting in higher memory-bound speed up (MBSU). We show that our method can boost the memory-bound speed-up for Llama-3 models on Spec-Bench, specifically by 16% for Llama-3.2-3B-Instruct. △ Less

Submitted 3 July, 2025; v1 submitted 27 June, 2025; originally announced June 2025.

Comments: 8 pages, 4 figures, 5 tables, accepted at ICML 2025 workshop on Efficient Systems for Foundational Models

arXiv:2506.21944 [pdf, ps, other]

Ranking dynamics in movies and music

Authors: Hyun-Woo Lee, Gerardo Iñiguez, Hang-Hyun Jo, Hye Jin Park

Abstract: Ranking systems are widely used to simplify and interpret complex data across diverse domains, from economic indicators and sports scores to online content popularity. While previous studies including the Zipf's law have focused on the static, aggregated properties of ranks, in recent years researchers have begun to uncover generic features in their temporal dynamics. In this work, we introduce an… ▽ More Ranking systems are widely used to simplify and interpret complex data across diverse domains, from economic indicators and sports scores to online content popularity. While previous studies including the Zipf's law have focused on the static, aggregated properties of ranks, in recent years researchers have begun to uncover generic features in their temporal dynamics. In this work, we introduce and study a series of system-level indices that quantify the compositional changes in ranking lists over time, and also characterize the temporal ranking trajectories of individual items' ranking dynamics. We apply our method to analyze ranking dynamics of movies from the over-the-top services, including Netflix, as well as that of music items in Spotify charts. We find that newly released movies or music items influence most the system-level compositional changes of ranking lists; the highest ranks of items are strongly correlated with their lifetimes in the lists more than their first and last ranks. Our findings offer a novel lens to understand collective ranking dynamics and provide a basis for comparing fluctuation patterns across various ordered systems. △ Less

Submitted 27 June, 2025; originally announced June 2025.

arXiv:2506.21896 [pdf, ps, other]

Focus on the Experts: Co-designing an Augmented Reality Eye-Gaze Tracking System with Surgical Trainees to Improve Endoscopic Instruction

Authors: Jumanh Atoum, Jinkyung Park, Mamtaj Akter, Nicholas Kavoussi, Pamela Wisniewski, Jie Ying Wu

Abstract: The current apprenticeship model for surgical training requires a high level of supervision, which does not scale well to meet the growing need for more surgeons. Many endoscopic procedures are directly taught in the operating room (OR) while the attending surgeon and trainee operate on patients. The need to prioritize patient care limits the trainees' opportunities to experiment and receive feedb… ▽ More The current apprenticeship model for surgical training requires a high level of supervision, which does not scale well to meet the growing need for more surgeons. Many endoscopic procedures are directly taught in the operating room (OR) while the attending surgeon and trainee operate on patients. The need to prioritize patient care limits the trainees' opportunities to experiment and receive feedback on their performance. Augmented reality (AR) has the potential to increase efficiency in endoscopic surgical training, but additional research is critical to understanding the needs of surgical trainees to inform the design of AR training systems. Therefore, we worked with 18 surgical trainees to understand the strengths, limitations, and unmet needs of their current training environment and to co-design an AR eye-gaze tracking system based on their preferences. Trainees emphasized the need to practice the 2D to 3D mapping needed to properly familiarize oneself with the anatomy of patients to prepare for real surgery. The trainees felt that an AR-based eye gaze tracking system would be a useful supplemental training method that would improve their learning in OR cases without detracting from patient care. To tailor the AR system to their needs, they co-designed features to improve their ability to track the attending surgeon's eye gaze and to provide a real-time, interactive system. Our results are valuable in shaping the endoscopic training modules by generating user-informed guidelines to design future collaborative AR-based eye-gaze tracking systems. △ Less

Submitted 27 June, 2025; originally announced June 2025.

arXiv:2506.21595 [pdf, ps, other]

Thunder-LLM: Efficiently Adapting LLMs to Korean with Minimal Resources

Authors: Jinpyo Kim, Gyeongje Cho, Chanwoo Park, Jongwon Park, Jongmin Kim, Yeonkyoun So, Jaejin Lee

Abstract: Since state-of-the-art LLMs often underperform in languages other than English or Chinese, improving the capability of LLMs in new languages has become an essential task. Moreover, LLMs' entire end-to-end training process remains largely unknown to the public due to proprietary reasons, technical complexity, inconsistent documentation, and ethical considerations. The complete picture remains a clo… ▽ More Since state-of-the-art LLMs often underperform in languages other than English or Chinese, improving the capability of LLMs in new languages has become an essential task. Moreover, LLMs' entire end-to-end training process remains largely unknown to the public due to proprietary reasons, technical complexity, inconsistent documentation, and ethical considerations. The complete picture remains a closely guarded secret within the industry. This paper presents methods to adapt an existing English-based LLM to Korean in a low-budget scenario. We describe the entire end-to-end process: collecting Korean datasets, preprocessing the data, training the model, creating downstream benchmarks, and conducting evaluations. The evaluation results indicate that our method can effectively and cost-efficiently add new language capabilities to existing LLMs. Our new bilingual models, Thunder-LLM and Thunder-LLM-Ins, achieve superior Korean performance compared to state-of-the-art models while utilizing minimal data and computational resources. We share our comprehensive experience and make the code publicly available. △ Less

Submitted 18 June, 2025; originally announced June 2025.

Comments: Submitted to ARR 2025 May cycle

arXiv:2506.21556 [pdf, ps, other]

VAT-KG: Knowledge-Intensive Multimodal Knowledge Graph Dataset for Retrieval-Augmented Generation

Authors: Hyeongcheol Park, MinHyuk Jang, Ha Dam Baek, Gyusam Chang, Jiyoung Seo, Jiwan Park, Hogun Park, Sangpil Kim

Abstract: Multimodal Knowledge Graphs (MMKGs), which represent explicit knowledge across multiple modalities, play a pivotal role by complementing the implicit knowledge of Multimodal Large Language Models (MLLMs) and enabling more grounded reasoning via Retrieval Augmented Generation (RAG). However, existing MMKGs are generally limited in scope: they are often constructed by augmenting pre-existing knowled… ▽ More Multimodal Knowledge Graphs (MMKGs), which represent explicit knowledge across multiple modalities, play a pivotal role by complementing the implicit knowledge of Multimodal Large Language Models (MLLMs) and enabling more grounded reasoning via Retrieval Augmented Generation (RAG). However, existing MMKGs are generally limited in scope: they are often constructed by augmenting pre-existing knowledge graphs, which restricts their knowledge, resulting in outdated or incomplete knowledge coverage, and they often support only a narrow range of modalities, such as text and visual information. These limitations reduce their extensibility and applicability to a broad range of multimodal tasks, particularly as the field shifts toward richer modalities such as video and audio in recent MLLMs. Therefore, we propose the Visual-Audio-Text Knowledge Graph (VAT-KG), the first concept-centric and knowledge-intensive multimodal knowledge graph that covers visual, audio, and text information, where each triplet is linked to multimodal data and enriched with detailed descriptions of concepts. Specifically, our construction pipeline ensures cross-modal knowledge alignment between multimodal data and fine-grained semantics through a series of stringent filtering and alignment steps, enabling the automatic generation of MMKGs from any multimodal dataset. We further introduce a novel multimodal RAG framework that retrieves detailed concept-level knowledge in response to queries from arbitrary modalities. Experiments on question answering tasks across various modalities demonstrate the effectiveness of VAT-KG in supporting MLLMs, highlighting its practical value in unifying and leveraging multimodal knowledge. △ Less

Submitted 11 June, 2025; originally announced June 2025.

Comments: Project Page: https://vatkg.github.io/

arXiv:2506.21174 [pdf]

Performance improvement of spatial semantic segmentation with enriched audio features and agent-based error correction for DCASE 2025 Challenge Task 4

Authors: Jongyeon Park, Joonhee Lee, Do-Hyeon Lim, Hong Kook Kim, Hyeongcheol Geum, Jeong Eun Lim

Abstract: This technical report presents submission systems for Task 4 of the DCASE 2025 Challenge. This model incorporates additional audio features (spectral roll-off and chroma features) into the embedding feature extracted from the mel-spectral feature to im-prove the classification capabilities of an audio-tagging model in the spatial semantic segmentation of sound scenes (S5) system. This approach is… ▽ More This technical report presents submission systems for Task 4 of the DCASE 2025 Challenge. This model incorporates additional audio features (spectral roll-off and chroma features) into the embedding feature extracted from the mel-spectral feature to im-prove the classification capabilities of an audio-tagging model in the spatial semantic segmentation of sound scenes (S5) system. This approach is motivated by the fact that mixed audio often contains subtle cues that are difficult to capture with mel-spectrograms alone. Thus, these additional features offer alterna-tive perspectives for the model. Second, an agent-based label correction system is applied to the outputs processed by the S5 system. This system reduces false positives, improving the final class-aware signal-to-distortion ratio improvement (CA-SDRi) metric. Finally, we refine the training dataset to enhance the classi-fication accuracy of low-performing classes by removing irrele-vant samples and incorporating external data. That is, audio mix-tures are generated from a limited number of data points; thus, even a small number of out-of-class data points could degrade model performance. The experiments demonstrate that the submit-ted systems employing these approaches relatively improve CA-SDRi by up to 14.7% compared to the baseline of DCASE 2025 Challenge Task 4. △ Less

Submitted 26 June, 2025; originally announced June 2025.

Comments: DCASE 2025 challenge Task4, 5 pages

arXiv:2506.21143 [pdf, ps, other]

$\mathbf{O}(D,D)$-Symmetric Box Operator and $α^{\prime}$-Corrections with Riemann Curvature

Authors: Kawon Lee, Jeong-Hyuck Park

Abstract: Within the framework of Double Field Theory, we construct an $\mathbf{O}(D,D)$-symmetric d'Alembertian, or box operator, that is applicable to tensors of arbitrary rank. Parameterized by the Riemannian metric and the $B$-field, the operator naturally incorporates the Riemann curvature tensor and the $H$-flux. When applied to the massless string sector, it produces a consistent stringy wave equatio… ▽ More Within the framework of Double Field Theory, we construct an $\mathbf{O}(D,D)$-symmetric d'Alembertian, or box operator, that is applicable to tensors of arbitrary rank. Parameterized by the Riemannian metric and the $B$-field, the operator naturally incorporates the Riemann curvature tensor and the $H$-flux. When applied to the massless string sector, it produces a consistent stringy wave equation under an $\mathbf{O}(D,D)$-symmetric harmonic gauge condition. Furthermore, the one-loop integral of the massive string modes, whose kinetic terms are governed by the box operator, yields Riemann curvature $α^{\prime}$-corrections. Yet, the momentum integral generically breaks the $\mathbf{O}(D,D)$ symmetry. △ Less

Submitted 26 June, 2025; originally announced June 2025.

Comments: 7+8 pages

arXiv:2506.21021 [pdf, ps, other]

Identification of Noise-Associated Glitches in KAGRA O3GK with Hveto

Authors: T. Akutsu, M. Ando, M. Aoumi, A. Araya, Y. Aso, L. Baiotti, R. Bajpai, K. Cannon, A. H. -Y. Chen, D. Chen, H. Chen, A. Chiba, C. Chou, M. Eisenmann, K. Endo, T. Fujimori, S. Garg, D. Haba, S. Haino, R. Harada, H. Hayakawa, K. Hayama, S. Fujii, Y. Himemoto, N. Hirata , et al. (127 additional authors not shown)

Abstract: Transient noise ("glitches") in gravitational wave detectors can mimic or obscure true signals, significantly reducing detection sensitivity. Identifying and excluding glitch-contaminated data segments is therefore crucial for enhancing the performance of gravitational-wave searches. We perform a noise analysis of the KAGRA data obtained during the O3GK observation. Our analysis is performed with… ▽ More Transient noise ("glitches") in gravitational wave detectors can mimic or obscure true signals, significantly reducing detection sensitivity. Identifying and excluding glitch-contaminated data segments is therefore crucial for enhancing the performance of gravitational-wave searches. We perform a noise analysis of the KAGRA data obtained during the O3GK observation. Our analysis is performed with hierarchical veto (Hveto) which identifies noises based on the statistical time correlation between the main channel and the auxiliary channels. A total of 2,531 noises were vetoed by 28 auxiliary channels with the configuration (i.e., signal-to-noise threshold set to 8) that we chose for Hveto. We identify vetoed events as glitches on the spectrogram via visual examination after plotting them with Q-transformation. By referring to the Gravity Spy project, we categorize 2,354 glitches into six types: blip, helix, scratchy, and scattered light, which correspond to those listed in Gravity Spy, and dot and line, which are not found in the Gravity Spy classification and are thus named based on their spectrogram morphology in KAGRA data. The remaining 177 glitches are determined not to belong to any of these six types. We show how the KAGRA glitch types are related to each subsystem of KAGRA. To investigate the possible correlation between the main channel and the round winner - an auxiliary channel statistically associated with the main channel for vetoing purposes - we visually examine the similarity or difference in the glitch pattern on the spectrogram. We compare the qualitative correlation found through visual examination with coherence, which is known to provide quantitative measurement for the correlation between the main channel and each auxiliary channel. Our comprehensive noise analysis will help improve the data quality of KAGRA by applying it to future KAGRA observation data. △ Less

Submitted 26 June, 2025; originally announced June 2025.

Comments: To appear in Progress of Theoretical and Experimental Physics (PTEP), accepted June 2025

arXiv:2506.19697 [pdf, ps, other]

Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models

Authors: Jungwoo Park, Taewhoo Lee, Chanwoong Yoon, Hyeon Hwang, Jaewoo Kang

Abstract: Extreme activation outliers in Large Language Models (LLMs) critically degrade quantization performance, hindering efficient on-device deployment. While channel-wise operations and adaptive gradient scaling are recognized causes, practical mitigation remains challenging. We introduce Outlier-Safe Pre-Training (OSP), a practical guideline that proactively prevents outlier formation rather than rely… ▽ More Extreme activation outliers in Large Language Models (LLMs) critically degrade quantization performance, hindering efficient on-device deployment. While channel-wise operations and adaptive gradient scaling are recognized causes, practical mitigation remains challenging. We introduce Outlier-Safe Pre-Training (OSP), a practical guideline that proactively prevents outlier formation rather than relying on post-hoc mitigation. OSP combines three key innovations: (1) the Muon optimizer, eliminating privileged bases while maintaining training efficiency; (2) Single-Scale RMSNorm, preventing channel-wise amplification; and (3) a learnable embedding projection, redistributing activation magnitudes originating from embedding matrices. We validate OSP by training a 1.4B-parameter model on 1 trillion tokens, which is the first production-scale LLM trained without such outliers. Under aggressive 4-bit quantization, our OSP model achieves a 35.7 average score across 10 benchmarks (compared to 26.5 for an Adam-trained model), with only a 2% training overhead. Remarkably, OSP models exhibit near-zero excess kurtosis (0.04) compared to extreme values (1818.56) in standard models, fundamentally altering LLM quantization behavior. Our work demonstrates that outliers are not inherent to LLMs but are consequences of training strategies, paving the way for more efficient LLM deployment. The source code and pretrained checkpoints are available at https://github.com/dmis-lab/Outlier-Safe-Pre-Training. △ Less

Submitted 24 June, 2025; originally announced June 2025.

arXiv:2506.19451 [pdf, ps, other]

Low-Complexity Semantic Packet Aggregation for Token Communication via Lookahead Search

Authors: Seunghun Lee, Jihong Park, Jinho Choi, Hyuncheol Park

Abstract: Tokens are fundamental processing units of generative AI (GenAI) and large language models (LLMs), and token communication (TC) is essential for enabling remote AI-generate content (AIGC) and wireless LLM applications. Unlike traditional bits, each of which is independently treated, the semantics of each token depends on its surrounding context tokens. This inter-token dependency makes TC vulnerab… ▽ More Tokens are fundamental processing units of generative AI (GenAI) and large language models (LLMs), and token communication (TC) is essential for enabling remote AI-generate content (AIGC) and wireless LLM applications. Unlike traditional bits, each of which is independently treated, the semantics of each token depends on its surrounding context tokens. This inter-token dependency makes TC vulnerable to outage channels, where the loss of a single token can significantly distort the original message semantics. Motivated by this, this paper focuses on optimizing token packetization to maximize the average token similarity (ATS) between the original and received token messages under outage channels. Due to inter-token dependency, this token grouping problem is combinatorial, with complexity growing exponentially with message length. To address this, we propose a novel framework of semantic packet aggregation with lookahead search (SemPA-Look), built on two core ideas. First, it introduces the residual semantic score (RSS) as a token-level surrogate for the message-level ATS, allowing robust semantic preservation even when a certain token packet is lost. Second, instead of full search, SemPA-Look applies a lookahead search-inspired algorithm that samples intra-packet token candidates without replacement (fixed depth), conditioned on inter-packet token candidates sampled with replacement (fixed width), thereby achieving linear complexity. Experiments on a remote AIGC task with the MS-COCO dataset (text captioned images) demonstrate that SemPA-Look achieves high ATS and LPIPS scores comparable to exhaustive search, while reducing computational complexity by up to 40$\times$. Compared to other linear-complexity algorithms such as the genetic algorithm (GA), SemPA-Look achieves 10$\times$ lower complexity, demonstrating its practicality for remote AIGC and other TC applications. △ Less

Submitted 24 June, 2025; originally announced June 2025.

arXiv:2506.19389 [pdf, ps, other]

Emergence of Text Readability in Vision Language Models

Authors: Jaeyoo Park, Sanghyuk Chun, Wonjae Kim, Sangdoo Yun, Bohyung Han

Abstract: We investigate how the ability to recognize textual content within images emerges during the training of Vision-Language Models (VLMs). Our analysis reveals a critical phenomenon: the ability to read textual information in a given image \textbf{(text readability)} emerges abruptly after substantial training iterations, in contrast to semantic content understanding which develops gradually from the… ▽ More We investigate how the ability to recognize textual content within images emerges during the training of Vision-Language Models (VLMs). Our analysis reveals a critical phenomenon: the ability to read textual information in a given image \textbf{(text readability)} emerges abruptly after substantial training iterations, in contrast to semantic content understanding which develops gradually from the early stages of training. This delayed emergence may reflect how contrastive learning tends to initially prioritize general semantic understanding, with text-specific symbolic processing developing later. Interestingly, the ability to match images with rendered text develops even slower, indicating a deeper need for semantic integration. These findings highlight the need for tailored training strategies to accelerate robust text comprehension in VLMs, laying the groundwork for future research on optimizing multimodal learning. △ Less

Submitted 24 June, 2025; originally announced June 2025.

Comments: EVAL-FoMo Workshop @ CVPR 2025

arXiv:2506.19144 [pdf, ps, other]

Posterior Contraction for Sparse Neural Networks in Besov Spaces with Intrinsic Dimensionality

Authors: Kyeongwon Lee, Lizhen Lin, Jaewoo Park, Seonghyun Jeong

Abstract: This work establishes that sparse Bayesian neural networks achieve optimal posterior contraction rates over anisotropic Besov spaces and their hierarchical compositions. These structures reflect the intrinsic dimensionality of the underlying function, thereby mitigating the curse of dimensionality. Our analysis shows that Bayesian neural networks equipped with either sparse or continuous shrinkage… ▽ More This work establishes that sparse Bayesian neural networks achieve optimal posterior contraction rates over anisotropic Besov spaces and their hierarchical compositions. These structures reflect the intrinsic dimensionality of the underlying function, thereby mitigating the curse of dimensionality. Our analysis shows that Bayesian neural networks equipped with either sparse or continuous shrinkage priors attain the optimal rates which are dependent on the intrinsic dimension of the true structures. Moreover, we show that these priors enable rate adaptation, allowing the posterior to contract at the optimal rate even when the smoothness level of the true function is unknown. The proposed framework accommodates a broad class of functions, including additive and multiplicative Besov functions as special cases. These results advance the theoretical foundations of Bayesian neural networks and provide rigorous justification for their practical effectiveness in high-dimensional, structured estimation problems. △ Less

Submitted 23 June, 2025; originally announced June 2025.

Showing 1–50 of 4,668 results for author: Park, J