Skip to main content

Showing 1–50 of 555 results for author: Hwang, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.05418  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning

    Authors: Jaedong Hwang, Kumar Tanmay, Seok-Jin Lee, Ayush Agrawal, Hamid Palangi, Kumar Ayush, Ila Fiete, Paul Pu Liang

    Abstract: Large Language Models (LLMs) have achieved strong performance in domains like mathematics, factual QA, and code generation, yet their multilingual reasoning capabilities in these tasks remain underdeveloped. Especially for low-resource languages such as Swahili or Thai, LLMs can often misinterpret prompts or default to reasoning in English. This implicit bias toward high-resource languages undermi… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

  2. arXiv:2506.23518  [pdf, ps, other

    cs.CV

    WAVE: Warp-Based View Guidance for Consistent Novel View Synthesis Using a Single Image

    Authors: Jiwoo Park, Tae Eun Choi, Youngjun Jun, Seong Jae Hwang

    Abstract: Generating high-quality novel views of a scene from a single image requires maintaining structural coherence across different views, referred to as view consistency. While diffusion models have driven advancements in novel view synthesis, they still struggle to preserve spatial continuity across views. Diffusion models have been combined with 3D models to address the issue, but such approaches lac… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  3. arXiv:2506.21039  [pdf, ps, other

    cs.LG cs.AI

    Strict Subgoal Execution: Reliable Long-Horizon Planning in Hierarchical Reinforcement Learning

    Authors: Jaebak Hwang, Sanghyeon Lee, Jeongmo Kim, Seungyul Han

    Abstract: Long-horizon goal-conditioned tasks pose fundamental challenges for reinforcement learning (RL), particularly when goals are distant and rewards are sparse. While hierarchical and graph-based methods offer partial solutions, they often suffer from subgoal infeasibility and inefficient planning. We introduce Strict Subgoal Execution (SSE), a graph-based hierarchical RL framework that enforces singl… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

    Comments: 9 technical page followed by references and appendix

  4. arXiv:2506.20066  [pdf, ps, other

    cs.CV

    ToSA: Token Merging with Spatial Awareness

    Authors: Hsiang-Wei Huang, Wenhao Chai, Kuang-Ming Chen, Cheng-Yen Yang, Jenq-Neng Hwang

    Abstract: Token merging has emerged as an effective strategy to accelerate Vision Transformers (ViT) by reducing computational costs. However, existing methods primarily rely on the visual token's feature similarity for token merging, overlooking the potential of integrating spatial information, which can serve as a reliable criterion for token merging in the early layers of ViT, where the visual tokens onl… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

    Comments: Accepted by IROS 2025

  5. arXiv:2506.15596  [pdf, ps, other

    cs.CV

    Mono-Modalizing Extremely Heterogeneous Multi-Modal Medical Image Registration

    Authors: Kyobin Choo, Hyunkyung Han, Jinyeong Kim, Chanyong Yoon, Seong Jae Hwang

    Abstract: In clinical practice, imaging modalities with functional characteristics, such as positron emission tomography (PET) and fractional anisotropy (FA), are often aligned with a structural reference (e.g., MRI, CT) for accurate interpretation or group analysis, necessitating multi-modal deformable image registration (DIR). However, due to the extreme heterogeneity of these modalities compared to stand… ▽ More

    Submitted 30 June, 2025; v1 submitted 18 June, 2025; originally announced June 2025.

    Comments: 11 pages, 3 figures, 2 tables, Accepted at Medical Image Computing and Computer Assisted Intervention (MICCAI) 2025

    ACM Class: I.4.5; I.4.9; J.3

  6. arXiv:2506.15380  [pdf, ps, other

    cs.RO

    Efficient Navigation Among Movable Obstacles using a Mobile Manipulator via Hierarchical Policy Learning

    Authors: Taegeun Yang, Jiwoo Hwang, Jeil Jeong, Minsung Yoon, Sung-Eui Yoon

    Abstract: We propose a hierarchical reinforcement learning (HRL) framework for efficient Navigation Among Movable Obstacles (NAMO) using a mobile manipulator. Our approach combines interaction-based obstacle property estimation with structured pushing strategies, facilitating the dynamic manipulation of unforeseen obstacles while adhering to a pre-planned global path. The high-level policy generates pushing… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

    Comments: 8 pages, 6 figures, Accepted to IROS 2025. Supplementary Video: https://youtu.be/sZ8_z7sYVP0

  7. arXiv:2506.14107  [pdf, ps, other

    cs.DC cs.CV

    Déjà Vu: Efficient Video-Language Query Engine with Learning-based Inter-Frame Computation Reuse

    Authors: Jinwoo Hwang, Daeun Kim, Sangyeop Lee, Yoonsung Kim, Guseul Heo, Hojoon Kim, Yunseok Jeong, Tadiwos Meaza, Eunhyeok Park, Jeongseob Ahn, Jongse Park

    Abstract: Recently, Video-Language Models (VideoLMs) have demonstrated remarkable capabilities, offering significant potential for flexible and powerful video query systems. These models typically rely on Vision Transformers (ViTs), which process video frames individually to extract visual embeddings. However, generating embeddings for large-scale videos requires ViT inferencing across numerous frames, posi… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: Accepted to 2025 VLDB

  8. arXiv:2506.11877  [pdf, ps, other

    cs.LG cs.AI

    Robust Molecular Property Prediction via Densifying Scarce Labeled Data

    Authors: Jina Kim, Jeffrey Willette, Bruno Andreis, Sung Ju Hwang

    Abstract: A widely recognized limitation of molecular prediction models is their reliance on structures observed in the training data, resulting in poor generalization to out-of-distribution compounds. Yet in drug discovery, the compounds most critical for advancing research often lie beyond the training set, making the bias toward the training data particularly problematic. This mismatch introduces substan… ▽ More

    Submitted 7 July, 2025; v1 submitted 13 June, 2025; originally announced June 2025.

  9. arXiv:2506.10412  [pdf, ps, other

    cs.LG cs.AI cs.CL

    Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series

    Authors: Ching Chang, Jeehyun Hwang, Yidan Shi, Haixin Wang, Wen-Chih Peng, Tien-Fu Chen, Wei Wang

    Abstract: Time series data in real-world applications such as healthcare, climate modeling, and finance are often irregular, multimodal, and messy, with varying sampling rates, asynchronous modalities, and pervasive missingness. However, existing benchmarks typically assume clean, regularly sampled, unimodal data, creating a significant gap between research and real-world deployment. We introduce Time-IMM,… ▽ More

    Submitted 23 June, 2025; v1 submitted 12 June, 2025; originally announced June 2025.

    Comments: This paper is currently under review

  10. arXiv:2506.07466  [pdf, other

    cs.IR

    Leveraging Historical and Current Interests for Continual Sequential Recommendation

    Authors: Gyuseok Lee, Hyunsik Yoo, Junyoung Hwang, SeongKu Kang, Hwanjo Yu

    Abstract: Sequential recommendation models based on the Transformer architecture show superior performance in harnessing long-range dependencies within user behavior via self-attention. However, naively updating them on continuously arriving non-stationary data streams incurs prohibitive computation costs or leads to catastrophic forgetting. To address this, we propose Continual Sequential Transformer for R… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  11. arXiv:2506.07177  [pdf, ps, other

    cs.CV cs.AI

    Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models

    Authors: Sangwon Jang, Taekyung Ki, Jaehyeong Jo, Jaehong Yoon, Soo Ye Kim, Zhe Lin, Sung Ju Hwang

    Abstract: Advancements in diffusion models have significantly improved video quality, directing attention to fine-grained controllability. However, many existing methods depend on fine-tuning large-scale video models for specific tasks, which becomes increasingly impractical as model sizes continue to grow. In this work, we present Frame Guidance, a training-free guidance for controllable video generation b… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

    Comments: Project page: https://frame-guidance-video.github.io/

  12. arXiv:2506.05211  [pdf

    cs.CY cs.AI

    Intentionally Unintentional: GenAI Exceptionalism and the First Amendment

    Authors: David Atkinson, Jena D. Hwang, Jacob Morrison

    Abstract: This paper challenges the assumption that courts should grant First Amendment protections to outputs from large generative AI models, such as GPT-4 and Gemini. We argue that because these models lack intentionality, their outputs do not constitute speech as understood in the context of established legal precedent, so there can be no speech to protect. Furthermore, if the model outputs are not spee… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

  13. arXiv:2506.04704  [pdf, ps, other

    cs.CV cs.AI

    HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model

    Authors: Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilcahe Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang

    Abstract: Despite emerging efforts to enhance the safety of Vision-Language Models (VLMs), current approaches face two main shortcomings. 1) Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content, often overlooking contextually unsafe outcomes from seemingly benign pairs. This narrow coverage leaves VLMs vulnerable to jailbreak attacks in… ▽ More

    Submitted 11 June, 2025; v1 submitted 5 June, 2025; originally announced June 2025.

    Comments: Project page: https://youngwanlee.github.io/holisafe

  14. arXiv:2506.04288  [pdf, ps, other

    cs.LG

    Backbone Augmented Training for Adaptations

    Authors: Jae Wan Park, Junhyeok Kim, Youngjun Jun, Hyunah Ko, Seong Jae Hwang

    Abstract: Adaptations facilitate efficient training of large backbone models, including diffusion models for image generation and transformer-based language models. While various adaptation techniques enhance performance with minimal computational resources, limited adaptation data often leads to challenges in training. To address this, we focus on the enormous amount of backbone data used to pre-train the… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

  15. arXiv:2506.03610  [pdf, other

    cs.AI

    Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

    Authors: Dongmin Park, Minkyu Kim, Beongjun Choi, Junhyuck Kim, Keon Lee, Jonghyun Lee, Inkyu Park, Byeong-Uk Lee, Jaeyoung Hwang, Jaewoo Ahn, Ameya S. Mahabaleshwarkar, Bilal Kartal, Pritam Biswas, Yoshi Suhara, Kangwook Lee, Jaewoong Cho

    Abstract: Large Language Model (LLM) agents are reshaping the game industry, particularly with more intelligent and human-preferable game characters. However, existing game benchmarks fall short of practical needs: they lack evaluations of diverse LLM capabilities across various game genres, studies of agentic modules crucial for complex gameplay, and fine-tuning datasets for aligning pre-trained LLMs into… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

  16. arXiv:2506.00910  [pdf, other

    cs.LG cs.AI

    PCoreSet: Effective Active Learning through Knowledge Distillation from Vision-Language Models

    Authors: Seongjae Kang, Dong Bok Lee, Hyungjoon Jang, Dongseop Kim, Sung Ju Hwang

    Abstract: Knowledge distillation (KD) is a widely used framework for training compact, task-specific models by leveraging the knowledge of teacher models. However, its application to active learning (AL), which aims to minimize annotation costs through iterative sample selection, remains underexplored. This gap stems from the fact that KD typically assumes access to sufficient labeled data, whereas AL opera… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

    Comments: 35 pages, 30 figures

  17. arXiv:2506.00344  [pdf, ps, other

    cs.CL cs.AI

    Efficient Latent Semantic Clustering for Scaling Test-Time Computation of LLMs

    Authors: Sungjae Lee, Hoyoung Kim, Jeongyeon Hwang, Eunhyeok Park, Jungseul Ok

    Abstract: Scaling test-time computation--generating and analyzing multiple or sequential outputs for a single input--has become a promising strategy for improving the reliability and quality of large language models (LLMs), as evidenced by advances in uncertainty quantification and multi-step reasoning. A key shared component is semantic clustering, which groups outputs that differ in form but convey the sa… ▽ More

    Submitted 30 May, 2025; originally announced June 2025.

  18. arXiv:2506.00195  [pdf, ps, other

    cs.CL cs.AI cs.HC

    Let Them Down Easy! Contextual Effects of LLM Guardrails on User Perceptions and Preferences

    Authors: Mingqian Zheng, Wenjia Hu, Patrick Zhao, Motahhare Eslami, Jena D. Hwang, Faeze Brahman, Carolyn Rose, Maarten Sap

    Abstract: Current LLMs are trained to refuse potentially harmful input queries regardless of whether users actually had harmful intents, causing a tradeoff between safety and user experience. Through a study of 480 participants evaluating 3,840 query-response pairs, we examine how different refusal strategies affect user perceptions across varying motivations. Our findings reveal that response strategy larg… ▽ More

    Submitted 30 May, 2025; originally announced June 2025.

  19. arXiv:2505.24139  [pdf, ps, other

    cs.CV cs.AI

    S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Modelwith Spatio-Temporal Visual Representation

    Authors: Yichen Xie, Runsheng Xu, Tong He, Jyh-Jing Hwang, Katie Luo, Jingwei Ji, Hubert Lin, Letian Chen, Yiren Lu, Zhaoqi Leng, Dragomir Anguelov, Mingxing Tan

    Abstract: The latest advancements in multi-modal large language models (MLLMs) have spurred a strong renewed interest in end-to-end motion planning approaches for autonomous driving. Many end-to-end approaches rely on human annotations to learn intermediate perception and prediction tasks, while purely self-supervised approaches--which directly learn from sensor inputs to generate planning trajectories with… ▽ More

    Submitted 3 June, 2025; v1 submitted 29 May, 2025; originally announced May 2025.

    Comments: Accepted by CVPR2025; Project website: s4-driver.github.io

  20. arXiv:2505.23032  [pdf, ps, other

    cs.LG cs.AI

    Bayesian Neural Scaling Law Extrapolation with Prior-Data Fitted Networks

    Authors: Dongwoo Lee, Dong Bok Lee, Steven Adriaensen, Juho Lee, Sung Ju Hwang, Frank Hutter, Seon Joo Kim, Hae Beom Lee

    Abstract: Scaling has been a major driver of recent advancements in deep learning. Numerous empirical studies have found that scaling laws often follow the power-law and proposed several variants of power-law functions to predict the scaling behavior at larger scales. However, existing methods mostly rely on point estimation and do not quantify uncertainty, which is crucial for real-world applications invol… ▽ More

    Submitted 15 June, 2025; v1 submitted 28 May, 2025; originally announced May 2025.

    Comments: Accepted to ICML 2025

  21. arXiv:2505.20211  [pdf, other

    cs.LG cs.AI

    Parameter-Efficient Fine-Tuning with Column Space Projection

    Authors: Junseo Hwang, Wonguk Cho, Taesup Kim

    Abstract: Fine-tuning large language models (LLMs) with minimal computational overhead is essential for efficiently adapting them to downstream tasks under resource constraints. Parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), facilitate this by updating only a small subset of parameters. However, recent studies show that LoRA diverges from full fine-tuning (Full FT) in it… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  22. arXiv:2505.19764  [pdf, ps, other

    cs.LG cs.AI

    Agentic Predictor: Performance Prediction for Agentic Workflows via Multi-View Encoding

    Authors: Patara Trirat, Wonyong Jeong, Sung Ju Hwang

    Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but optimizing LLM-based agentic systems remains challenging due to the vast search space of agent configurations, prompting strategies, and communication patterns. Existing approaches often rely on heuristic-based tuning or exhaustive evaluation, which can be computationally expensive and suboptimal. This… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

    Comments: Code will be available at https://github.com/DeepAuto-AI/agentic-predictor

  23. arXiv:2505.19602  [pdf, ps, other

    cs.LG

    Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression

    Authors: Kunjun Li, Zigeng Chen, Cheng-Yen Yang, Jenq-Neng Hwang

    Abstract: Visual Autoregressive (VAR) modeling has garnered significant attention for its innovative next-scale prediction approach, which yields substantial improvements in efficiency, scalability, and zero-shot generalization. Nevertheless, the coarse-to-fine methodology inherent in VAR results in exponential growth of the KV cache during inference, causing considerable memory consumption and computationa… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  24. arXiv:2505.19197  [pdf, ps, other

    cs.AI

    Structuring the Unstructured: A Multi-Agent System for Extracting and Querying Financial KPIs and Guidance

    Authors: Chanyeol Choi, Alejandro Lopez-Lira, Yongjae Lee, Jihoon Kwon, Minjae Kim, Juneha Hwang, Minsoo Ha, Chaewoon Kim, Jaeseon Ha, Suyeol Yun, Jin Kim

    Abstract: Extracting structured and quantitative insights from unstructured financial filings is essential in investment research, yet remains time-consuming and resource-intensive. Conventional approaches in practice rely heavily on labor-intensive manual processes, limiting scalability and delaying the research workflow. In this paper, we propose an efficient and scalable method for accurately extracting… ▽ More

    Submitted 26 June, 2025; v1 submitted 25 May, 2025; originally announced May 2025.

    Comments: 7 pages, FinIR'25

  25. arXiv:2505.18111  [pdf, ps, other

    cs.CV

    Adapting SAM 2 for Visual Object Tracking: 1st Place Solution for MMVPR Challenge Multi-Modal Tracking

    Authors: Cheng-Yen Yang, Hsiang-Wei Huang, Pyong-Kun Kim, Chien-Kai Kuo, Jui-Wei Chang, Kwang-Ju Kim, Chung-I Huang, Jenq-Neng Hwang

    Abstract: We present an effective approach for adapting the Segment Anything Model 2 (SAM2) to the Visual Object Tracking (VOT) task. Our method leverages the powerful pre-trained capabilities of SAM2 and incorporates several key techniques to enhance its performance in VOT applications. By combining SAM2 with our proposed optimizations, we achieved a first place AUC score of 89.4 on the 2024 ICPR Multi-mod… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

    Comments: Accepted by ICPR Multi-Modal Visual Pattern Recognition Workshop

  26. arXiv:2505.17612  [pdf, other

    cs.CL cs.AI

    Distilling LLM Agent into Small Models with Retrieval and Code Tools

    Authors: Minki Kang, Jongwon Jeong, Seanie Lee, Jaewoong Cho, Sung Ju Hwang

    Abstract: Large language models (LLMs) excel at complex reasoning tasks but remain computationally expensive, limiting their practical deployment. To address this, recent works have focused on distilling reasoning capabilities into smaller language models (sLMs) using chain-of-thought (CoT) traces from teacher LLMs. However, this approach struggles in scenarios requiring rare factual knowledge or precise co… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

    Comments: preprint, v1

  27. arXiv:2505.12879  [pdf, ps, other

    stat.ML cs.LG

    Spline Dimensional Decomposition with Interpolation-based Optimal Knot Selection for Stochastic Dynamic Analysis

    Authors: Yeonsu Kim, Junhan Lee, Bingran Wang, John T. Hwang, Dongjin Lee

    Abstract: Forward uncertainty quantification in dynamical systems is challenging due to non-smooth or locally oscillating nonlinear behaviors. Spline dimensional decomposition (SDD) addresses such nonlinearity by partitioning input coordinates via knot placement, but its accuracy is highly sensitive to internal knot locations. Optimizing knots using sequential quadratic programming is effective, yet computa… ▽ More

    Submitted 16 June, 2025; v1 submitted 19 May, 2025; originally announced May 2025.

    Comments: 28 pages, 15 figures

  28. arXiv:2505.12805  [pdf, other

    cs.LG cs.AI

    FedSVD: Adaptive Orthogonalization for Private Federated Learning with LoRA

    Authors: Seanie Lee, Sangwoo Park, Dong Bok Lee, Dominik Wagner, Haebin Seong, Tobias Bocklet, Juho Lee, Sung Ju Hwang

    Abstract: Low-Rank Adaptation (LoRA), which introduces a product of two trainable low-rank matrices into frozen pre-trained weights, is widely used for efficient fine-tuning of language models in federated learning (FL). However, when combined with differentially private stochastic gradient descent (DP-SGD), LoRA faces substantial noise amplification: DP-SGD perturbs per-sample gradients, and the matrix mul… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    Comments: preprint

  29. arXiv:2505.12233  [pdf, ps, other

    eess.IV cs.CV

    PRETI: Patient-Aware Retinal Foundation Model via Metadata-Guided Representation Learning

    Authors: Yeonkyung Lee, Woojung Han, Youngjun Jun, Hyeonmin Kim, Jungkyung Cho, Seong Jae Hwang

    Abstract: Retinal foundation models have significantly advanced retinal image analysis by leveraging self-supervised learning to reduce dependence on labeled data while achieving strong generalization. Many recent approaches enhance retinal image understanding using report supervision, but obtaining clinical reports is often costly and challenging. In contrast, metadata (e.g., age, gender) is widely availab… ▽ More

    Submitted 18 May, 2025; originally announced May 2025.

    Comments: MICCAI2025 early accept

  30. arXiv:2505.12116  [pdf, ps, other

    cs.CL

    A Multi-Task Benchmark for Abusive Language Detection in Low-Resource Settings

    Authors: Fitsum Gaim, Hoyun Song, Huije Lee, Changgeon Ko, Eui Jun Hwang, Jong C. Park

    Abstract: Content moderation research has recently made significant advances, but still fails to serve the majority of the world's languages due to the lack of resources, leaving millions of vulnerable users to online hostility. This work presents a large-scale human-annotated multi-task benchmark dataset for abusive language detection in Tigrinya social media with joint annotations for three tasks: abusive… ▽ More

    Submitted 17 May, 2025; originally announced May 2025.

    ACM Class: I.2.7

  31. arXiv:2505.11254  [pdf, ps, other

    cs.LG

    Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction

    Authors: Jeffrey Willette, Heejun Lee, Sung Ju Hwang

    Abstract: The attention mechanism of a transformer has a quadratic complexity, leading to high inference costs and latency for long sequences. However, attention matrices are mostly sparse, which implies that many entries may be omitted from computation for efficient inference. Sparse attention inference methods aim to reduce this computational burden; however, they also come with a troublesome performance… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

  32. arXiv:2505.09666  [pdf, ps, other

    cs.CL cs.AI cs.LG

    System Prompt Optimization with Meta-Learning

    Authors: Yumin Choi, Jinheon Baek, Sung Ju Hwang

    Abstract: Large Language Models (LLMs) have shown remarkable capabilities, with optimizing their input prompts playing a pivotal role in maximizing their performance. However, while LLM prompts consist of both the task-agnostic system prompts and task-specific user prompts, existing work on prompt optimization has focused on user prompts specific to individual queries or tasks, and largely overlooked the sy… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

  33. arXiv:2505.07675  [pdf, other

    cs.LG cs.AI cs.CV

    Simple Semi-supervised Knowledge Distillation from Vision-Language Models via $\mathbf{\texttt{D}}$ual-$\mathbf{\texttt{H}}$ead $\mathbf{\texttt{O}}$ptimization

    Authors: Seongjae Kang, Dong Bok Lee, Hyungjoon Jang, Sung Ju Hwang

    Abstract: Vision-language models (VLMs) have achieved remarkable success across diverse tasks by leveraging rich textual information with minimal labeled data. However, deploying such large models remains challenging, particularly in resource-constrained environments. Knowledge distillation (KD) offers a well-established solution to this problem; however, recent KD approaches from VLMs often involve multi-s… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

    Comments: 41 pages, 19 figures, preprint

  34. arXiv:2505.01583  [pdf, ps, other

    cs.CV cs.AI

    TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action

    Authors: Jen-Hao Cheng, Vivian Wang, Huayu Wang, Huapeng Zhou, Yi-Hao Peng, Hou-I Liu, Hsiang-Wei Huang, Kuang-Ming Chen, Cheng-Yen Yang, Wenhao Chai, Yi-Ling Chen, Vibhav Vineet, Qin Cai, Jenq-Neng Hwang

    Abstract: Understanding causal event relationships and achieving fine-grained temporal grounding in videos remain challenging for vision-language models. Existing methods either compress video tokens to reduce temporal resolution, or treat videos as unsegmented streams, which obscures fine-grained event boundaries and limits the modeling of causal dependencies. We propose TEMPURA (Temporal Event Masked Pred… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

  35. arXiv:2504.20734  [pdf, other

    cs.CL cs.AI cs.CV cs.IR cs.LG

    UniversalRAG: Retrieval-Augmented Generation over Corpora of Diverse Modalities and Granularities

    Authors: Woongyeong Yeo, Kangsan Kim, Soyeong Jeong, Jinheon Baek, Sung Ju Hwang

    Abstract: Retrieval-Augmented Generation (RAG) has shown substantial promise in improving factual accuracy by grounding model responses with external knowledge relevant to queries. However, most existing RAG approaches are limited to a text-only corpus, and while recent efforts have extended RAG to other modalities such as images and videos, they typically operate over a single modality-specific corpus. In… ▽ More

    Submitted 19 May, 2025; v1 submitted 29 April, 2025; originally announced April 2025.

    Comments: Project page : https://universalrag.github.io

  36. arXiv:2504.20408  [pdf, other

    cs.LG cs.AI math.NA physics.comp-ph

    FourierSpecNet: Neural Collision Operator Approximation Inspired by the Fourier Spectral Method for Solving the Boltzmann Equation

    Authors: Jae Yong Lee, Gwang Jae Jung, Byung Chan Lim, Hyung Ju Hwang

    Abstract: The Boltzmann equation, a fundamental model in kinetic theory, describes the evolution of particle distribution functions through a nonlinear, high-dimensional collision operator. However, its numerical solution remains computationally demanding, particularly for inelastic collisions and high-dimensional velocity domains. In this work, we propose the Fourier Neural Spectral Network (FourierSpecNet… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

    Comments: 27 pages, 11 figures

    MSC Class: 68T20; 35Q20; 35B40; 82C40

  37. arXiv:2504.17219  [pdf, other

    cs.LG cs.AI cs.CR

    Enhancing Variational Autoencoders with Smooth Robust Latent Encoding

    Authors: Hyomin Lee, Minseon Kim, Sangwon Jang, Jongheon Jeong, Sung Ju Hwang

    Abstract: Variational Autoencoders (VAEs) have played a key role in scaling up diffusion-based generative models, as in Stable Diffusion, yet questions regarding their robustness remain largely underexplored. Although adversarial training has been an established technique for enhancing robustness in predictive models, it has been overlooked for generative models due to concerns about potential fidelity degr… ▽ More

    Submitted 23 April, 2025; originally announced April 2025.

    Comments: Under review

  38. arXiv:2504.17192  [pdf, other

    cs.CL

    Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

    Authors: Minju Seo, Jinheon Baek, Seongyun Lee, Sung Ju Hwang

    Abstract: Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at understanding scientific documents and generating high-quality code. Inspired by this, we introduce PaperCoder, a multi-agent… ▽ More

    Submitted 18 May, 2025; v1 submitted 23 April, 2025; originally announced April 2025.

  39. arXiv:2504.11393  [pdf, other

    cs.LG cs.CL

    DataDecide: How to Predict Best Pretraining Data with Small Experiments

    Authors: Ian Magnusson, Nguyen Tai, Ben Bogin, David Heineman, Jena D. Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, Noah A. Smith, Pang Wei Koh, Jesse Dodge

    Abstract: Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making decisions from observed performance at small scale most accurately predict the datasets that yield the best large models? To empower open exploration of this question, we release models, data, and eval… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

  40. arXiv:2504.10861  [pdf, other

    cs.CL

    Ai2 Scholar QA: Organized Literature Synthesis with Attribution

    Authors: Amanpreet Singh, Joseph Chee Chang, Chloe Anastasiades, Dany Haddad, Aakanksha Naik, Amber Tanaka, Angele Zamarron, Cecile Nguyen, Jena D. Hwang, Jason Dunkleberger, Matt Latzke, Smita Rao, Jaron Lochner, Rob Evans, Rodney Kinney, Daniel S. Weld, Doug Downey, Sergey Feldman

    Abstract: Retrieval-augmented generation is increasingly effective in answering scientific questions from literature, but many state-of-the-art systems are expensive and closed-source. We introduce Ai2 Scholar QA, a free online scientific question answering application. To facilitate research, we make our entire pipeline public: as a customizable open-source Python package and interactive web app, along wit… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

    Comments: 7 pages

  41. arXiv:2504.08398  [pdf, other

    cs.AR cs.LG

    MixDiT: Accelerating Image Diffusion Transformer Inference with Mixed-Precision MX Quantization

    Authors: Daeun Kim, Jinwoo Hwang, Changhun Oh, Jongse Park

    Abstract: Diffusion Transformer (DiT) has driven significant progress in image generation tasks. However, DiT inferencing is notoriously compute-intensive and incurs long latency even on datacenter-scale GPUs, primarily due to its iterative nature and heavy reliance on GEMM operations inherent to its encoder-based structure. To address the challenge, prior work has explored quantization, but achieving low-p… ▽ More

    Submitted 11 April, 2025; originally announced April 2025.

  42. arXiv:2504.02012  [pdf, other

    cs.LG

    Instruction-Guided Autoregressive Neural Network Parameter Generation

    Authors: Soro Bedionita, Bruno Andreis, Song Chong, Sung Ju Hwang

    Abstract: Learning to generate neural network parameters conditioned on task descriptions and architecture specifications is pivotal for advancing model adaptability and transfer learning. Existing methods especially those based on diffusion models suffer from limited scalability to large architectures, rigidity in handling varying network depths, and disjointed parameter generation that undermines inter-la… ▽ More

    Submitted 2 April, 2025; originally announced April 2025.

  43. arXiv:2503.22168  [pdf, other

    cs.CV

    Spatial Transport Optimization by Repositioning Attention Map for Training-Free Text-to-Image Synthesis

    Authors: Woojung Han, Yeonkyung Lee, Chanyoung Kim, Kwanghyun Park, Seong Jae Hwang

    Abstract: Diffusion-based text-to-image (T2I) models have recently excelled in high-quality image generation, particularly in a training-free manner, enabling cost-effective adaptability and generalization across diverse tasks. However, while the existing methods have been continuously focusing on several challenges, such as "missing objects" and "mismatched attributes," another critical issue of "mislocate… ▽ More

    Submitted 28 March, 2025; originally announced March 2025.

    Comments: CVPR2025

  44. arXiv:2503.20823  [pdf, other

    cs.CR

    Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy

    Authors: Joonhyun Jeong, Seyun Bae, Yeonsung Jung, Jaeryong Hwang, Eunho Yang

    Abstract: Despite the remarkable versatility of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) to generalize across both language and vision tasks, LLMs and MLLMs have shown vulnerability to jailbreaking, generating textual outputs that undermine safety, ethical, and bias standards when exposed to harmful or sensitive inputs. With the recent advancement of safety alignment via preference-tuning fr… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

    Comments: Accepted at CVPR2025

  45. arXiv:2503.19385  [pdf, other

    cs.CV cs.LG

    Inference-Time Scaling for Flow Models via Stochastic Generation and Rollover Budget Forcing

    Authors: Jaihoon Kim, Taehoon Yoon, Jisung Hwang, Minhyuk Sung

    Abstract: We propose an inference-time scaling approach for pretrained flow models. Recently, inference-time scaling has gained significant attention in LLMs and diffusion models, improving sample quality or better aligning outputs with user preferences by leveraging additional computation. For diffusion models, particle sampling has allowed more efficient scaling due to the stochasticity at intermediate de… ▽ More

    Submitted 28 May, 2025; v1 submitted 25 March, 2025; originally announced March 2025.

    Comments: Project page: https://flow-inference-time-scaling.github.io/

  46. Design of Seamless Multi-modal Interaction Framework for Intelligent Virtual Agents in Wearable Mixed Reality Environment

    Authors: Ghazanfar Ali, Hong-Quan Le, Junho Kim, Seoung-won Hwang, Jae-In Hwang

    Abstract: In this paper, we present the design of a multimodal interaction framework for intelligent virtual agents in wearable mixed reality environments, especially for interactive applications at museums, botanical gardens, and similar places. These places need engaging and no-repetitive digital content delivery to maximize user involvement. An intelligent virtual agent is a promising mode for both purpo… ▽ More

    Submitted 24 March, 2025; originally announced March 2025.

    Comments: 6 pages, 14 Figures, Computer Animation and Social Agents (CASA 2019)

    Journal ref: CASA 2019: Proceedings of the 32nd International Conference on Computer Animation and Social Agents - Year 2019 - Pages 47 - 52

  47. arXiv:2503.18642  [pdf, other

    eess.IV cs.CV

    Rethinking Glaucoma Calibration: Voting-Based Binocular and Metadata Integration

    Authors: Taejin Jeong, Joohyeok Kim, Jaehoon Joo, Yeonwoo Jung, Hyeonmin Kim, Seong Jae Hwang

    Abstract: Glaucoma is an incurable ophthalmic disease that damages the optic nerve, leads to vision loss, and ranks among the leading causes of blindness worldwide. Diagnosing glaucoma typically involves fundus photography, optical coherence tomography (OCT), and visual field testing. However, the high cost of OCT often leads to reliance on fundus photography and visual field testing, both of which exhibit… ▽ More

    Submitted 24 March, 2025; originally announced March 2025.

  48. arXiv:2503.14271  [pdf, other

    cs.NI cs.MM

    Video Streaming with Kairos: An MPC-Based ABR with Streaming-Aware Throughput Prediction

    Authors: Ziyu Zhong, Mufan Liu, Le Yang, Yifan Wang, Yiling Xu, Jenq-Neng Hwang

    Abstract: In this paper, we present Kairos, a model predictive control (MPC)-based adaptive bitrate (ABR) scheme that integrates streaming-aware throughput predictions to enhance video streaming quality. Kairos features an attention-based throughput predictor with buffer-aware uncertainty control, improving prediction accuracy and adaptability to network conditions. Specifically, we introduce a multi-time a… ▽ More

    Submitted 18 March, 2025; originally announced March 2025.

  49. arXiv:2503.12524  [pdf, other

    cs.CL cs.AI

    EXAONE Deep: Reasoning Enhanced Language Models

    Authors: LG AI Research, Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Yemuk Choi, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Kijeong Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Hyosang Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee , et al. (7 additional authors not shown)

    Abstract: We present EXAONE Deep series, which exhibits superior capabilities in various reasoning tasks, including math and coding benchmarks. We train our models mainly on the reasoning-specialized dataset that incorporates long streams of thought processes. Evaluation results show that our smaller models, EXAONE Deep 2.4B and 7.8B, outperform other models of comparable size, while the largest model, EXAO… ▽ More

    Submitted 19 March, 2025; v1 submitted 16 March, 2025; originally announced March 2025.

    Comments: arXiv admin note: substantial text overlap with arXiv:2412.04862, arXiv:2408.03541

  50. arXiv:2503.10055  [pdf, other

    cs.CV eess.IV

    Fourier Decomposition for Explicit Representation of 3D Point Cloud Attributes

    Authors: Donghyun Kim, Hyunah Ko, Chanyoung Kim, Seong Jae Hwang

    Abstract: While 3D point clouds are widely utilized across various vision applications, their irregular and sparse nature make them challenging to handle. In response, numerous encoding approaches have been proposed to capture the rich semantic information of point clouds. Yet, a critical limitation persists: a lack of consideration for colored point clouds which are more capable 3D representations as they… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.