Skip to main content

Showing 1–50 of 1,745 results for author: Park, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.06233  [pdf, ps, other

    cs.CV

    Learning to Track Any Points from Human Motion

    Authors: Inès Hyeonsu Kim, Seokju Cho, Jahyeok Koo, Junghyun Park, Jiahui Huang, Joon-Young Lee, Seungryong Kim

    Abstract: Human motion, with its inherent complexities, such as non-rigid deformations, articulated movements, clothing distortions, and frequent occlusions caused by limbs or other individuals, provides a rich and challenging source of supervision that is crucial for training robust and generalizable point trackers. Despite the suitability of human motion, acquiring extensive training data for point tracki… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

    Comments: Project Page: https://cvlab-kaist.github.io/AnthroTAP/

  2. arXiv:2507.06133  [pdf, ps, other

    cs.CE

    Bridging Sequential Deep Operator Network and Video Diffusion: Residual Refinement of Spatio-Temporal PDE Solutions

    Authors: Jaewan Park, Farid Ahmed, Kazuma Kobayashi, Seid Koric, Syed Bahauddin Alam, Iwona Jasiuk, Diab Abueidda

    Abstract: Video-diffusion models have recently set the standard in video generation, inpainting, and domain translation thanks to their training stability and high perceptual fidelity. Building on these strengths, we repurpose conditional video diffusion as a physics surrogate for spatio-temporal fields governed by partial differential equations (PDEs). Our two-stage surrogate first applies a Sequential Dee… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

  3. arXiv:2507.05822  [pdf, ps, other

    cs.CV

    Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models

    Authors: L'ea Dubois, Klaus Schmidt, Chengyu Wang, Ji-Hoon Park, Lin Wang, Santiago Munoz

    Abstract: Current video understanding models excel at recognizing "what" is happening but fall short in high-level cognitive tasks like causal reasoning and future prediction, a limitation rooted in their lack of commonsense world knowledge. To bridge this cognitive gap, we propose a novel framework that synergistically fuses a powerful Vision Foundation Model (VFM) for deep visual perception with a Large L… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

    Comments: 22 pages, 4 figures

    MSC Class: CS ACM Class: I.2.10

  4. arXiv:2507.05673  [pdf, ps, other

    cs.CV

    R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding

    Authors: Joonhyung Park, Peng Tang, Sagnik Das, Srikar Appalaraju, Kunwar Yashraj Singh, R. Manmatha, Shabnam Ghadar

    Abstract: Visual agent models for automating human activities on Graphical User Interfaces (GUIs) have emerged as a promising research direction, driven by advances in large Vision Language Models (VLMs). A critical challenge in GUI automation is the precise grounding of interface elements across diverse platforms. Existing vision-only GUI agents directly ground elements from large and cluttered screenshots… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

    Comments: ACL 2025; 17 pages

  5. arXiv:2507.04482  [pdf, ps, other

    cs.CV

    A Training-Free Style-Personalization via Scale-wise Autoregressive Model

    Authors: Kyoungmin Lee, Jihun Park, Jongmin Gim, Wonhyeok Choi, Kyumin Hwang, Jaeyeul Kim, Sunghoon Im

    Abstract: We present a training-free framework for style-personalized image generation that controls content and style information during inference using a scale-wise autoregressive model. Our method employs a three-path design--content, style, and generation--each guided by a corresponding text prompt, enabling flexible and efficient control over image semantics without any additional training. A central c… ▽ More

    Submitted 6 July, 2025; originally announced July 2025.

    Comments: 13 pages, 10 figures

  6. Handling Korean Out-of-Vocabulary Words with Phoneme Representation Learning

    Authors: Nayeon Kim, Eojin Jeon, Jun-Hyung Park, SangKeun Lee

    Abstract: In this study, we introduce KOPL, a novel framework for handling Korean OOV words with Phoneme representation Learning. Our work is based on the linguistic property of Korean as a phonemic script, the high correlation between phonemes and letters. KOPL incorporates phoneme and word representations for Korean OOV words, facilitating Korean OOV word representations to capture both text and phoneme i… ▽ More

    Submitted 5 July, 2025; originally announced July 2025.

    Journal ref: Advances in Knowledge Discovery and Data Mining. PAKDD 2025

  7. arXiv:2507.03660  [pdf, ps, other

    cs.LG

    When Network Architecture Meets Physics: Deep Operator Learning for Coupled Multiphysics

    Authors: Kazuma Kobayashi, Jaewan Park, Qibang Liu, Seid Koric, Diab Abueidda, Syed Bahauddin Alam

    Abstract: Scientific applications increasingly demand real-time surrogate models that can capture the behavior of strongly coupled multiphysics systems driven by multiple input functions, such as in thermo-mechanical and electro-thermal processes. While neural operator frameworks, such as Deep Operator Networks (DeepONets), have shown considerable success in single-physics settings, their extension to multi… ▽ More

    Submitted 4 July, 2025; originally announced July 2025.

  8. arXiv:2507.03114  [pdf, ps, other

    cs.DC

    Characterizing Compute-Communication Overlap in GPU-Accelerated Distributed Deep Learning: Performance and Power Implications

    Authors: Seonho Lee, Jihwan Oh, Junkyum Kim, Seokjin Go, Jongse Park, Divya Mahajan

    Abstract: This paper provides an in-depth characterization of GPU-accelerated systems, to understand the interplay between overlapping computation and communication which is commonly employed in distributed training settings. Due to the large size of models, distributing them across multiple devices is required. Overlapping strategies, which enable concurrent computation and communication, are critical for… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  9. arXiv:2507.01496  [pdf, ps, other

    cs.CV

    ReFlex: Text-Guided Editing of Real Images in Rectified Flow via Mid-Step Feature Extraction and Attention Adaptation

    Authors: Jimyeong Kim, Jungwon Park, Yeji Song, Nojun Kwak, Wonjong Rhee

    Abstract: Rectified Flow text-to-image models surpass diffusion models in image quality and text alignment, but adapting ReFlow for real-image editing remains challenging. We propose a new real-image editing method for ReFlow by analyzing the intermediate representations of multimodal transformer blocks and identifying three key features. To extract these features from real images with sufficient structural… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Published at ICCV 2025. Project page: https://wlaud1001.github.io/ReFlex/

  10. arXiv:2507.00726  [pdf, ps, other

    cs.AI cs.LG

    Can Large Language Models Develop Strategic Reasoning? Post-training Insights from Learning Chess

    Authors: Dongyoon Hwang, Hojoon Lee, Jaegul Choo, Dongmin Park, Jongho Park

    Abstract: While reinforcement learning (RL) for large language models (LLMs) has shown promise in mathematical reasoning, strategic reasoning for LLMs using RL remains largely unexplored. We investigate whether LLMs can develop strategic reasoning capabilities through RL in chess. To this end, we leverage a chess-pretrained action-value network to provide dense reward on the LLM's output move quality, which… ▽ More

    Submitted 2 July, 2025; v1 submitted 1 July, 2025; originally announced July 2025.

    Comments: 27 pages

  11. arXiv:2507.00480  [pdf, ps, other

    cs.LG stat.ML

    Posterior Inference in Latent Space for Scalable Constrained Black-box Optimization

    Authors: Kiyoung Om, Kyuil Sim, Taeyoung Yun, Hyeongyu Kang, Jinkyoo Park

    Abstract: Optimizing high-dimensional black-box functions under black-box constraints is a pervasive task in a wide range of scientific and engineering problems. These problems are typically harder than unconstrained problems due to hard-to-find feasible regions. While Bayesian optimization (BO) methods have been developed to solve such problems, they often struggle with the curse of dimensionality. Recentl… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: 25 pages, 11 figures, 5 tables. Equal contribution by Kiyoung Om, Kyuil Sim, and Taeyoung Yun

  12. arXiv:2507.00198  [pdf, ps, other

    cs.HC

    Exploring AR Label Placements in Visually Cluttered Scenarios

    Authors: Ji Hwan Park, Braden Roper, Amirhossein Arezoumand, Tien Tran

    Abstract: We investigate methods for placing labels in AR environments that have visually cluttered scenes. As the number of items increases in a scene within the user' FOV, it is challenging to effectively place labels based on existing label placement guidelines. To address this issue, we implemented three label placement techniques for in-view objects for AR applications. We specifically target a scenari… ▽ More

    Submitted 30 June, 2025; originally announced July 2025.

  13. arXiv:2506.23552  [pdf, ps, other

    cs.CV cs.SD eess.AS

    JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching

    Authors: Mingi Kwon, Joonghyuk Shin, Jaeseok Jung, Jaesik Park, Youngjung Uh

    Abstract: The intrinsic link between facial motion and speech is often overlooked in generative modeling, where talking head synthesis and text-to-speech (TTS) are typically addressed as separate tasks. This paper introduces JAM-Flow, a unified framework to simultaneously synthesize and condition on both facial motion and speech. Our approach leverages flow matching and a novel Multi-Modal Diffusion Transfo… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: project page: https://joonghyuk.com/jamflow-web Under review. Preprint published on arXiv

  14. arXiv:2506.23529  [pdf, ps, other

    cs.CV cs.LG

    When Test-Time Adaptation Meets Self-Supervised Models

    Authors: Jisu Han, Jihee Park, Dongyoon Han, Wonjun Hwang

    Abstract: Training on test-time data enables deep learning models to adapt to dynamic environmental changes, enhancing their practical applicability. Online adaptation from source to target domains is promising but it remains highly reliant on the performance of source pretrained model. In this paper, we investigate whether test-time adaptation (TTA) methods can continuously improve models trained via self-… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: 15 pages, 7 figures

  15. arXiv:2506.23518  [pdf, ps, other

    cs.CV

    WAVE: Warp-Based View Guidance for Consistent Novel View Synthesis Using a Single Image

    Authors: Jiwoo Park, Tae Eun Choi, Youngjun Jun, Seong Jae Hwang

    Abstract: Generating high-quality novel views of a scene from a single image requires maintaining structural coherence across different views, referred to as view consistency. While diffusion models have driven advancements in novel view synthesis, they still struggle to preserve spatial continuity across views. Diffusion models have been combined with 3D models to address the issue, but such approaches lac… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  16. arXiv:2506.22694  [pdf, ps, other

    cs.CL

    VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs

    Authors: Raghavv Goel, Sudhanshu Agrawal, Mukul Gagrani, Junyoung Park, Yifan Zao, He Zhang, Tian Liu, Yiping Yang, Xin Yuan, Jiuyan Lu, Chris Lott, Mingu Lee

    Abstract: In this paper, we introduce a simple training-free technique to improve the performance of drafter-based speculative decoding (SpD) methods that incorporates language modeling head (LM head) during drafting process. A drafter-based speculative decoding leverages one or more smaller language models, a.k.a. drafters or draft models, to sample a draft sequence or tree consisting of multiple tokens, f… ▽ More

    Submitted 3 July, 2025; v1 submitted 27 June, 2025; originally announced June 2025.

    Comments: 8 pages, 4 figures, 5 tables, accepted at ICML 2025 workshop on Efficient Systems for Foundational Models

  17. arXiv:2506.21896  [pdf, ps, other

    cs.HC

    Focus on the Experts: Co-designing an Augmented Reality Eye-Gaze Tracking System with Surgical Trainees to Improve Endoscopic Instruction

    Authors: Jumanh Atoum, Jinkyung Park, Mamtaj Akter, Nicholas Kavoussi, Pamela Wisniewski, Jie Ying Wu

    Abstract: The current apprenticeship model for surgical training requires a high level of supervision, which does not scale well to meet the growing need for more surgeons. Many endoscopic procedures are directly taught in the operating room (OR) while the attending surgeon and trainee operate on patients. The need to prioritize patient care limits the trainees' opportunities to experiment and receive feedb… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

  18. arXiv:2506.21595  [pdf, ps, other

    cs.CL

    Thunder-LLM: Efficiently Adapting LLMs to Korean with Minimal Resources

    Authors: Jinpyo Kim, Gyeongje Cho, Chanwoo Park, Jongwon Park, Jongmin Kim, Yeonkyoun So, Jaejin Lee

    Abstract: Since state-of-the-art LLMs often underperform in languages other than English or Chinese, improving the capability of LLMs in new languages has become an essential task. Moreover, LLMs' entire end-to-end training process remains largely unknown to the public due to proprietary reasons, technical complexity, inconsistent documentation, and ethical considerations. The complete picture remains a clo… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

    Comments: Submitted to ARR 2025 May cycle

  19. arXiv:2506.21556  [pdf, ps, other

    cs.CL

    VAT-KG: Knowledge-Intensive Multimodal Knowledge Graph Dataset for Retrieval-Augmented Generation

    Authors: Hyeongcheol Park, MinHyuk Jang, Ha Dam Baek, Gyusam Chang, Jiyoung Seo, Jiwan Park, Hogun Park, Sangpil Kim

    Abstract: Multimodal Knowledge Graphs (MMKGs), which represent explicit knowledge across multiple modalities, play a pivotal role by complementing the implicit knowledge of Multimodal Large Language Models (MLLMs) and enabling more grounded reasoning via Retrieval Augmented Generation (RAG). However, existing MMKGs are generally limited in scope: they are often constructed by augmenting pre-existing knowled… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: Project Page: https://vatkg.github.io/

  20. arXiv:2506.21174  [pdf

    eess.AS cs.LG

    Performance improvement of spatial semantic segmentation with enriched audio features and agent-based error correction for DCASE 2025 Challenge Task 4

    Authors: Jongyeon Park, Joonhee Lee, Do-Hyeon Lim, Hong Kook Kim, Hyeongcheol Geum, Jeong Eun Lim

    Abstract: This technical report presents submission systems for Task 4 of the DCASE 2025 Challenge. This model incorporates additional audio features (spectral roll-off and chroma features) into the embedding feature extracted from the mel-spectral feature to im-prove the classification capabilities of an audio-tagging model in the spatial semantic segmentation of sound scenes (S5) system. This approach is… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

    Comments: DCASE 2025 challenge Task4, 5 pages

  21. arXiv:2506.19697  [pdf, ps, other

    cs.LG cs.AI cs.CL

    Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models

    Authors: Jungwoo Park, Taewhoo Lee, Chanwoong Yoon, Hyeon Hwang, Jaewoo Kang

    Abstract: Extreme activation outliers in Large Language Models (LLMs) critically degrade quantization performance, hindering efficient on-device deployment. While channel-wise operations and adaptive gradient scaling are recognized causes, practical mitigation remains challenging. We introduce Outlier-Safe Pre-Training (OSP), a practical guideline that proactively prevents outlier formation rather than rely… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  22. arXiv:2506.19451  [pdf, ps, other

    eess.SP cs.LG

    Low-Complexity Semantic Packet Aggregation for Token Communication via Lookahead Search

    Authors: Seunghun Lee, Jihong Park, Jinho Choi, Hyuncheol Park

    Abstract: Tokens are fundamental processing units of generative AI (GenAI) and large language models (LLMs), and token communication (TC) is essential for enabling remote AI-generate content (AIGC) and wireless LLM applications. Unlike traditional bits, each of which is independently treated, the semantics of each token depends on its surrounding context tokens. This inter-token dependency makes TC vulnerab… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  23. arXiv:2506.19389  [pdf, ps, other

    cs.CV

    Emergence of Text Readability in Vision Language Models

    Authors: Jaeyoo Park, Sanghyuk Chun, Wonjae Kim, Sangdoo Yun, Bohyung Han

    Abstract: We investigate how the ability to recognize textual content within images emerges during the training of Vision-Language Models (VLMs). Our analysis reveals a critical phenomenon: the ability to read textual information in a given image \textbf{(text readability)} emerges abruptly after substantial training iterations, in contrast to semantic content understanding which develops gradually from the… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

    Comments: EVAL-FoMo Workshop @ CVPR 2025

  24. arXiv:2506.19144  [pdf, ps, other

    stat.ML cs.LG

    Posterior Contraction for Sparse Neural Networks in Besov Spaces with Intrinsic Dimensionality

    Authors: Kyeongwon Lee, Lizhen Lin, Jaewoo Park, Seonghyun Jeong

    Abstract: This work establishes that sparse Bayesian neural networks achieve optimal posterior contraction rates over anisotropic Besov spaces and their hierarchical compositions. These structures reflect the intrinsic dimensionality of the underlying function, thereby mitigating the curse of dimensionality. Our analysis shows that Bayesian neural networks equipped with either sparse or continuous shrinkage… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

  25. arXiv:2506.18497  [pdf, ps, other

    cond-mat.mtrl-sci cs.LG

    Leveraging neural network interatomic potentials for a foundation model of chemistry

    Authors: So Yeon Kim, Yang Jeong Park, Ju Li

    Abstract: Large-scale foundation models, including neural network interatomic potentials (NIPs) in computational materials science, have demonstrated significant potential. However, despite their success in accelerating atomistic simulations, NIPs face challenges in directly predicting electronic properties and often require coupling to higher-scale models or extensive simulations for macroscopic properties… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

    Comments: 29pages, 10 figures

  26. arXiv:2506.17896  [pdf, ps, other

    cs.CV cs.AI

    EgoWorld: Translating Exocentric View to Egocentric View using Rich Exocentric Observations

    Authors: Junho Park, Andrew Sangwoo Ye, Taein Kwon

    Abstract: Egocentric vision is essential for both human and machine visual understanding, particularly in capturing the detailed hand-object interactions needed for manipulation tasks. Translating third-person views into first-person views significantly benefits augmented reality (AR), virtual reality (VR) and robotics applications. However, current exocentric-to-egocentric translation methods are limited b… ▽ More

    Submitted 22 June, 2025; originally announced June 2025.

    Comments: Project Page: https://redorangeyellowy.github.io/EgoWorld/

  27. arXiv:2506.17707  [pdf, ps, other

    cs.CV cs.AI cs.MM

    Programmable-Room: Interactive Textured 3D Room Meshes Generation Empowered by Large Language Models

    Authors: Jihyun Kim, Junho Park, Kyeongbo Kong, Suk-Ju Kang

    Abstract: We present Programmable-Room, a framework which interactively generates and edits a 3D room mesh, given natural language instructions. For precise control of a room's each attribute, we decompose the challenging task into simpler steps such as creating plausible 3D coordinates for room meshes, generating panorama images for the texture, constructing 3D meshes by integrating the coordinates and pan… ▽ More

    Submitted 21 June, 2025; originally announced June 2025.

    Comments: Accepted by IEEE Transactions on Multimedia

  28. arXiv:2506.16754  [pdf, ps, other

    cs.LG cs.AI cs.SI

    Metapath-based Hyperbolic Contrastive Learning for Heterogeneous Graph Embedding

    Authors: Jongmin Park, Seunghoon Han, Won-Yong Shin, Sungsu Lim

    Abstract: The hyperbolic space, characterized by a constant negative curvature and exponentially expanding space, aligns well with the structural properties of heterogeneous graphs. However, although heterogeneous graphs inherently possess diverse power-law structures, most hyperbolic heterogeneous graph embedding models rely on a single hyperbolic space. This approach may fail to effectively capture the di… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

    Comments: 14 pages, 9 figures

  29. arXiv:2506.16741  [pdf, ps, other

    eess.AS cs.AI

    RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching

    Authors: Hyun Joon Park, Jeongmin Liu, Jin Sob Kim, Jeong Yeol Yang, Sung Won Han, Eunwoo Song

    Abstract: We introduce RapFlow-TTS, a rapid and high-fidelity TTS acoustic model that leverages velocity consistency constraints in flow matching (FM) training. Although ordinary differential equation (ODE)-based TTS generation achieves natural-quality speech, it typically requires a large number of generation steps, resulting in a trade-off between quality and inference speed. To address this challenge, Ra… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

    Comments: Accepted on Interspeech 2025

  30. arXiv:2506.16444  [pdf, ps, other

    cs.CL cs.AR cs.DB

    REIS: A High-Performance and Energy-Efficient Retrieval System with In-Storage Processing

    Authors: Kangqi Chen, Andreas Kosmas Kakolyris, Rakesh Nadig, Manos Frouzakis, Nika Mansouri Ghiasi, Yu Liang, Haiyu Mao, Jisung Park, Mohammad Sadrosadati, Onur Mutlu

    Abstract: Large Language Models (LLMs) face an inherent challenge: their knowledge is confined to the data that they have been trained on. To overcome this issue, Retrieval-Augmented Generation (RAG) complements the static training-derived knowledge of LLMs with an external knowledge repository. RAG consists of three stages: indexing, retrieval, and generation. The retrieval stage of RAG becomes a significa… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

    Comments: Extended version of our publication at the 52nd International Symposium on Computer Architecture (ISCA-52), 2025

    ACM Class: H.3.3; I.2.7

  31. arXiv:2506.15831  [pdf, ps, other

    cs.DB

    Adaptive Anomaly Detection in the Presence of Concept Drift: Extended Report

    Authors: Jongjun Park, Fei Chiang, Mostafa Milani

    Abstract: The presence of concept drift poses challenges for anomaly detection in time series. While anomalies are caused by undesirable changes in the data, differentiating abnormal changes from varying normal behaviours is difficult due to differing frequencies of occurrence, varying time intervals when normal patterns occur, and identifying similarity thresholds to separate the boundary between normal vs… ▽ More

    Submitted 28 June, 2025; v1 submitted 18 June, 2025; originally announced June 2025.

    Comments: Extended version (to be updated)

  32. arXiv:2506.14657  [pdf, ps, other

    eess.AS cs.AR

    ASAP-FE: Energy-Efficient Feature Extraction Enabling Multi-Channel Keyword Spotting on Edge Processors

    Authors: Jongin Choi, Jina Park, Woojoo Lee, Jae-Jin Lee, Massoud Pedram

    Abstract: Multi-channel keyword spotting (KWS) has become crucial for voice-based applications in edge environments. However, its substantial computational and energy requirements pose significant challenges. We introduce ASAP-FE (Agile Sparsity-Aware Parallelized-Feature Extractor), a hardware-oriented front-end designed to address these challenges. Our framework incorporates three key innovations: (1) Hal… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

    Comments: 7 pages, 11 figures, ISLPED 2025

  33. arXiv:2506.14107  [pdf, ps, other

    cs.DC cs.CV

    Déjà Vu: Efficient Video-Language Query Engine with Learning-based Inter-Frame Computation Reuse

    Authors: Jinwoo Hwang, Daeun Kim, Sangyeop Lee, Yoonsung Kim, Guseul Heo, Hojoon Kim, Yunseok Jeong, Tadiwos Meaza, Eunhyeok Park, Jeongseob Ahn, Jongse Park

    Abstract: Recently, Video-Language Models (VideoLMs) have demonstrated remarkable capabilities, offering significant potential for flexible and powerful video query systems. These models typically rely on Vision Transformers (ViTs), which process video frames individually to extract visual embeddings. However, generating embeddings for large-scale videos requires ViT inferencing across numerous frames, posi… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: Accepted to 2025 VLDB

  34. arXiv:2506.13754  [pdf, ps, other

    cs.LG cs.AI cs.CV

    VideoPDE: Unified Generative PDE Solving via Video Inpainting Diffusion Models

    Authors: Edward Li, Zichen Wang, Jiahe Huang, Jeong Joon Park

    Abstract: We present a unified framework for solving partial differential equations (PDEs) using video-inpainting diffusion transformer models. Unlike existing methods that devise specialized strategies for either forward or inverse problems under full or partial observation, our approach unifies these tasks under a single, flexible generative framework. Specifically, we recast PDE-solving as a generalized… ▽ More

    Submitted 16 June, 2025; v1 submitted 16 June, 2025; originally announced June 2025.

    Comments: Project page: https://videopde.github.io/

  35. arXiv:2506.13298  [pdf, ps, other

    cs.CV cs.AI

    Fair Generation without Unfair Distortions: Debiasing Text-to-Image Generation with Entanglement-Free Attention

    Authors: Jeonghoon Park, Juyoung Lee, Chaeyeon Chung, Jaeseong Lee, Jaegul Choo, Jindong Gu

    Abstract: Recent advancements in diffusion-based text-to-image (T2I) models have enabled the generation of high-quality and photorealistic images from text descriptions. However, they often exhibit societal biases related to gender, race, and socioeconomic status, thereby reinforcing harmful stereotypes and shaping public perception in unintended ways. While existing bias mitigation methods demonstrate effe… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  36. arXiv:2506.12945  [pdf, ps, other

    cs.CV

    Metropolis-Hastings Sampling for 3D Gaussian Reconstruction

    Authors: Hyunjin Kim, Haebeom Jung, Jaesik Park

    Abstract: We propose an adaptive sampling framework for 3D Gaussian Splatting (3DGS) that leverages comprehensive multi-view photometric error signals within a unified Metropolis-Hastings approach. Traditional 3DGS methods heavily rely on heuristic-based density-control mechanisms (e.g., cloning, splitting, and pruning), which can lead to redundant computations or the premature removal of beneficial Gaussia… ▽ More

    Submitted 15 June, 2025; originally announced June 2025.

    Comments: Project Page: https://hjhyunjinkim.github.io/MH-3DGS

  37. arXiv:2506.12413  [pdf, ps, other

    cs.CV

    Domain Generalization for Person Re-identification: A Survey Towards Domain-Agnostic Person Matching

    Authors: Hyeonseo Lee, Juhyun Park, Jihyong Oh, Chanho Eom

    Abstract: Person Re-identification (ReID) aims to retrieve images of the same individual captured across non-overlapping camera views, making it a critical component of intelligent surveillance systems. Traditional ReID methods assume that the training and test domains share similar characteristics and primarily focus on learning discriminative features within a given domain. However, they often fail to gen… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

    Comments: Please visit our project page at https://github.com/PerceptualAI-Lab/Awesome-Domain-Generalizable-Person-Re-ID

  38. arXiv:2506.11474  [pdf, ps, other

    cs.CL

    Med-PRM: Medical Reasoning Models with Stepwise, Guideline-verified Process Rewards

    Authors: Jaehoon Yun, Jiwoong Sohn, Jungwoo Park, Hyunjae Kim, Xiangru Tang, Yanjun Shao, Yonghoe Koo, Minhyeok Ko, Qingyu Chen, Mark Gerstein, Michael Moor, Jaewoo Kang

    Abstract: Large language models have shown promise in clinical decision making, but current approaches struggle to localize and correct errors at specific steps of the reasoning process. This limitation is critical in medicine, where identifying and addressing reasoning errors is essential for accurate diagnosis and effective patient care. We introduce Med-PRM, a process reward modeling framework that lever… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

  39. arXiv:2506.11115  [pdf, other

    cs.CL cs.AI

    Incorporating Domain Knowledge into Materials Tokenization

    Authors: Yerim Oh, Jun-Hyung Park, Junho Kim, SungHo Kim, SangKeun Lee

    Abstract: While language models are increasingly utilized in materials science, typical models rely on frequency-centric tokenization methods originally developed for natural language processing. However, these methods frequently produce excessive fragmentation and semantic loss, failing to maintain the structural and semantic integrity of material concepts. To address this issue, we propose MATTER, a novel… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  40. arXiv:2506.09993  [pdf, ps, other

    cs.CV cs.AI cs.LG

    Text-Aware Image Restoration with Diffusion Models

    Authors: Jaewon Min, Jin Hyeon Kim, Paul Hyunbin Cho, Jaeeun Lee, Jihye Park, Minkyu Park, Sangpil Kim, Hyunhee Park, Seungryong Kim

    Abstract: Image restoration aims to recover degraded images. However, existing diffusion-based restoration methods, despite great success in natural image restoration, often struggle to faithfully reconstruct textual regions in degraded images. Those methods frequently generate plausible but incorrect text-like patterns, a phenomenon we refer to as text-image hallucination. In this paper, we introduce Text-… ▽ More

    Submitted 3 July, 2025; v1 submitted 11 June, 2025; originally announced June 2025.

    Comments: Project page: https://cvlab-kaist.github.io/TAIR/

  41. arXiv:2506.09883  [pdf, ps, other

    cs.CV cs.AI

    3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation

    Authors: Seonho Lee, Jiho Choi, Inha Kang, Jiwook Kim, Junsung Park, Hyunjung Shim

    Abstract: Vision-Language Models (VLMs) have shown remarkable performance on diverse visual and linguistic tasks, yet they remain fundamentally limited in their understanding of 3D spatial structures. We propose Geometric Distillation, a lightweight, annotation-free fine-tuning framework that injects human-inspired geometric cues into pretrained VLMs without modifying their architecture. By distilling (1) s… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

  42. arXiv:2506.08725  [pdf

    cs.HC cs.LG

    Stop Misusing t-SNE and UMAP for Visual Analytics

    Authors: Hyeon Jeon, Jeongin Park, Sungbok Shin, Jinwook Seo

    Abstract: Misuses of t-SNE and UMAP in visual analytics have become increasingly common. For example, although t-SNE and UMAP projections often do not faithfully reflect true distances between clusters, practitioners frequently use them to investigate inter-cluster relationships. In this paper, we bring this issue to the surface and comprehensively investigate why such misuse occurs and how to prevent it. W… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

    Comments: 9 pages

  43. arXiv:2506.08059  [pdf, ps, other

    q-bio.QM cs.AI cs.LG

    CaliciBoost: Performance-Driven Evaluation of Molecular Representations for Caco-2 Permeability Prediction

    Authors: Huong Van Le, Weibin Ren, Junhong Kim, Yukyung Yun, Young Bin Park, Young Jun Kim, Bok Kyung Han, Inho Choi, Jong IL Park, Hwi-Yeol Yun, Jae-Mun Choi

    Abstract: Caco-2 permeability serves as a critical in vitro indicator for predicting the oral absorption of drug candidates during early-stage drug discovery. To enhance the accuracy and efficiency of computational predictions, we systematically investigated the impact of eight molecular feature representation types including 2D/3D descriptors, structural fingerprints, and deep learning-based embeddings com… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

    Comments: 49 pages, 11 figures

  44. arXiv:2506.07744  [pdf, ps, other

    cs.LG cs.AI cs.RO

    Graph-Assisted Stitching for Offline Hierarchical Reinforcement Learning

    Authors: Seungho Baek, Taegeon Park, Jongchan Park, Seungjun Oh, Yusung Kim

    Abstract: Existing offline hierarchical reinforcement learning methods rely on high-level policy learning to generate subgoal sequences. However, their efficiency degrades as task horizons increase, and they lack effective strategies for stitching useful state transitions across different trajectories. We propose Graph-Assisted Stitching (GAS), a novel framework that formulates subgoal selection as a graph… ▽ More

    Submitted 7 July, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

    Comments: ICML 2025

  45. arXiv:2506.07719  [pdf, ps, other

    cs.CL

    Multilingual Grammatical Error Annotation: Combining Language-Agnostic Framework with Language-Specific Flexibility

    Authors: Mengyang Qiu, Tran Minh Nguyen, Zihao Huang, Zelong Li, Yang Gu, Qingyu Gao, Siliang Liu, Jungyeul Park

    Abstract: Grammatical Error Correction (GEC) relies on accurate error annotation and evaluation, yet existing frameworks, such as $\texttt{errant}$, face limitations when extended to typologically diverse languages. In this paper, we introduce a standardized, modular framework for multilingual grammatical error annotation. Our approach combines a language-agnostic foundation with structured language-specifi… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

    Comments: BEA2025

  46. arXiv:2506.07643  [pdf, ps, other

    cs.CV

    Synthetic Visual Genome

    Authors: Jae Sung Park, Zixian Ma, Linjie Li, Chenhao Zheng, Cheng-Yu Hsieh, Ximing Lu, Khyathi Chandu, Quan Kong, Norimasa Kobori, Ali Farhadi, Yejin Choi, Ranjay Krishna

    Abstract: Reasoning over visual relationships-spatial, functional, interactional, social, etc.-is considered to be a fundamental component of human cognition. Yet, despite the major advances in visual comprehension in multimodal language models (MLMs), precise reasoning over relationships and their generations remains a challenge. We introduce ROBIN: an MLM instruction-tuned with densely annotated relations… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

    Comments: CVPR 2025

  47. arXiv:2506.07464  [pdf, ps, other

    cs.CV cs.AI

    DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO

    Authors: Jinyoung Park, Jeehye Na, Jinyoung Kim, Hyunwoo J. Kim

    Abstract: Recent works have demonstrated the effectiveness of reinforcement learning (RL)-based post-training in enhancing the reasoning capabilities of large language models (LLMs). In particular, Group Relative Policy Optimization (GRPO) has shown impressive success by employing a PPO-style reinforcement algorithm with group-based normalized rewards. However, the application of GRPO to Video Large Languag… ▽ More

    Submitted 12 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

    Comments: Work in progress

  48. arXiv:2506.07416  [pdf, ps, other

    cs.LG cs.AI

    LiteVLM: A Low-Latency Vision-Language Model Inference Pipeline for Resource-Constrained Environments

    Authors: Jin Huang, Yuchao Jin, Le An, Josh Park

    Abstract: This paper introduces an efficient Vision-Language Model (VLM) pipeline specifically optimized for deployment on embedded devices, such as those used in robotics and autonomous driving. The pipeline significantly reduces the computational overhead by jointly leveraging patch selection to filter irrelevant camera views, a token selection module to reduce input sequence length for the LLM, and specu… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  49. arXiv:2506.06803  [pdf, ps, other

    cs.CY

    Spatial Disparities in Fire Shelter Accessibility: Capacity Challenges in the Palisades and Eaton Fires

    Authors: Su Yeon Han, Yubin Lee, Jooyoung Yoo, Jeon-Young Kang, Jinwoo Park, Soe W. Myint, Eunsang Cho, Xin Gu, Joon-Seok Kim

    Abstract: The increasing frequency and severity of wildfire in California, exacerbated by prolonged drought and environmental changes, pose significant challenges to urban community resilience and equitable emergency response. The study investigates issues of accessibility to shelters during the Palisades and Eaton Fires which started in January 2025 in Southern California that led to over 180,000 displacem… ▽ More

    Submitted 7 June, 2025; originally announced June 2025.

    Comments: 35 pages, 11 figures

  50. arXiv:2506.05473  [pdf, ps, other

    cs.CV

    S2GO: Streaming Sparse Gaussian Occupancy Prediction

    Authors: Jinhyung Park, Yihan Hu, Chensheng Peng, Wenzhao Zheng, Kris Kitani, Wei Zhan

    Abstract: Despite the demonstrated efficiency and performance of sparse query-based representations for perception, state-of-the-art 3D occupancy prediction methods still rely on voxel-based or dense Gaussian-based 3D representations. However, dense representations are slow, and they lack flexibility in capturing the temporal dynamics of driving scenes. Distinct from prior work, we instead summarize the sce… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.