Search | arXiv e-print repository

arXiv:2507.04166 [pdf]

Governance and Technological Challenge in Digital Solidarity Economies: A Case Study of a Collaborative Transportation Platform in South Korea

Authors: Jeongone Seo, Tawfiq Ammari

Abstract: South Korea's City P illustrates how lofty goals of digital solidarity can falter when challenged by local governance realities. Drawing on Hansmann's ownership theory, collaborative governance concepts, and platform cooperativism, we conducted a qualitative case study involving policy documents, independent assessments, and 11 in-depth interviews with residents, officials, and technology develope… ▽ More South Korea's City P illustrates how lofty goals of digital solidarity can falter when challenged by local governance realities. Drawing on Hansmann's ownership theory, collaborative governance concepts, and platform cooperativism, we conducted a qualitative case study involving policy documents, independent assessments, and 11 in-depth interviews with residents, officials, and technology developers. Findings reveal a marked disconnect between the initiative's stated emphasis on community co-ownership and the actual power dynamics that largely favored government agencies and external firms. Although blockchain and integrated digital tools were meant to enhance transparency and inclusivity, stakeholders--especially elderly residents--experienced confusion and mistrust. We argue that genuine collaboration in digital solidarity economies requires not only robust technical designs but also culturally resonant ownership structures, substantive inclusion of local voices, and transparent governance mechanisms. The City P case underscores the necessity of addressing heterogeneous digital capacities, aligning funding and incentives with grassroots empowerment, and mitigating performative participation to ensure meaningful and sustainable outcomes in community-based digital innovation. △ Less

Submitted 5 July, 2025; originally announced July 2025.

Comments: 31 pages, 3 figures, under journal review

arXiv:2507.02225 [pdf, ps, other]

Metric Design != Metric Behavior: Improving Metric Selection for the Unbiased Evaluation of Dimensionality Reduction

Authors: Jiyeon Bae, Hyeon Jeon, Jinwook Seo

Abstract: Evaluating the accuracy of dimensionality reduction (DR) projections in preserving the structure of high-dimensional data is crucial for reliable visual analytics. Diverse evaluation metrics targeting different structural characteristics have thus been developed. However, evaluations of DR projections can become biased if highly correlated metrics--those measuring similar structural characteristic… ▽ More Evaluating the accuracy of dimensionality reduction (DR) projections in preserving the structure of high-dimensional data is crucial for reliable visual analytics. Diverse evaluation metrics targeting different structural characteristics have thus been developed. However, evaluations of DR projections can become biased if highly correlated metrics--those measuring similar structural characteristics--are inadvertently selected, favoring DR techniques that emphasize those characteristics. To address this issue, we propose a novel workflow that reduces bias in the selection of evaluation metrics by clustering metrics based on their empirical correlations rather than on their intended design characteristics alone. Our workflow works by computing metric similarity using pairwise correlations, clustering metrics to minimize overlap, and selecting a representative metric from each cluster. Quantitative experiments demonstrate that our approach improves the stability of DR evaluation, which indicates that our workflow contributes to mitigating evaluation bias. △ Less

Submitted 2 July, 2025; originally announced July 2025.

Comments: IEEE VIS 2025 (short paper)

arXiv:2506.23102 [pdf, ps, other]

MedRegion-CT: Region-Focused Multimodal LLM for Comprehensive 3D CT Report Generation

Authors: Sunggu Kyung, Jinyoung Seo, Hyunseok Lim, Dongyeong Kim, Hyungbin Park, Jimin Sung, Jihyun Kim, Wooyoung Jo, Yoojin Nam, Namkug Kim

Abstract: The recent release of RadGenome-Chest CT has significantly advanced CT-based report generation. However, existing methods primarily focus on global features, making it challenging to capture region-specific details, which may cause certain abnormalities to go unnoticed. To address this, we propose MedRegion-CT, a region-focused Multi-Modal Large Language Model (MLLM) framework, featuring three key… ▽ More The recent release of RadGenome-Chest CT has significantly advanced CT-based report generation. However, existing methods primarily focus on global features, making it challenging to capture region-specific details, which may cause certain abnormalities to go unnoticed. To address this, we propose MedRegion-CT, a region-focused Multi-Modal Large Language Model (MLLM) framework, featuring three key innovations. First, we introduce Region Representative ($R^2$) Token Pooling, which utilizes a 2D-wise pretrained vision model to efficiently extract 3D CT features. This approach generates global tokens representing overall slice features and region tokens highlighting target areas, enabling the MLLM to process comprehensive information effectively. Second, a universal segmentation model generates pseudo-masks, which are then processed by a mask encoder to extract region-centric features. This allows the MLLM to focus on clinically relevant regions, using six predefined region masks. Third, we leverage segmentation results to extract patient-specific attributions, including organ size, diameter, and locations. These are converted into text prompts, enriching the MLLM's understanding of patient-specific contexts. To ensure rigorous evaluation, we conducted benchmark experiments on report generation using the RadGenome-Chest CT. MedRegion-CT achieved state-of-the-art performance, outperforming existing methods in natural language generation quality and clinical relevance while maintaining interpretability. The code for our framework is publicly available. △ Less

Submitted 29 June, 2025; originally announced June 2025.

Comments: 14 pages, 5 figures, submitted to ICCV 2025

arXiv:2506.21556 [pdf, ps, other]

VAT-KG: Knowledge-Intensive Multimodal Knowledge Graph Dataset for Retrieval-Augmented Generation

Authors: Hyeongcheol Park, MinHyuk Jang, Ha Dam Baek, Gyusam Chang, Jiyoung Seo, Jiwan Park, Hogun Park, Sangpil Kim

Abstract: Multimodal Knowledge Graphs (MMKGs), which represent explicit knowledge across multiple modalities, play a pivotal role by complementing the implicit knowledge of Multimodal Large Language Models (MLLMs) and enabling more grounded reasoning via Retrieval Augmented Generation (RAG). However, existing MMKGs are generally limited in scope: they are often constructed by augmenting pre-existing knowled… ▽ More Multimodal Knowledge Graphs (MMKGs), which represent explicit knowledge across multiple modalities, play a pivotal role by complementing the implicit knowledge of Multimodal Large Language Models (MLLMs) and enabling more grounded reasoning via Retrieval Augmented Generation (RAG). However, existing MMKGs are generally limited in scope: they are often constructed by augmenting pre-existing knowledge graphs, which restricts their knowledge, resulting in outdated or incomplete knowledge coverage, and they often support only a narrow range of modalities, such as text and visual information. These limitations reduce their extensibility and applicability to a broad range of multimodal tasks, particularly as the field shifts toward richer modalities such as video and audio in recent MLLMs. Therefore, we propose the Visual-Audio-Text Knowledge Graph (VAT-KG), the first concept-centric and knowledge-intensive multimodal knowledge graph that covers visual, audio, and text information, where each triplet is linked to multimodal data and enriched with detailed descriptions of concepts. Specifically, our construction pipeline ensures cross-modal knowledge alignment between multimodal data and fine-grained semantics through a series of stringent filtering and alignment steps, enabling the automatic generation of MMKGs from any multimodal dataset. We further introduce a novel multimodal RAG framework that retrieves detailed concept-level knowledge in response to queries from arbitrary modalities. Experiments on question answering tasks across various modalities demonstrate the effectiveness of VAT-KG in supporting MLLMs, highlighting its practical value in unifying and leveraging multimodal knowledge. △ Less

Submitted 11 June, 2025; originally announced June 2025.

Comments: Project Page: https://vatkg.github.io/

arXiv:2506.19217 [pdf, ps, other]

MedErr-CT: A Visual Question Answering Benchmark for Identifying and Correcting Errors in CT Reports

Authors: Sunggu Kyung, Hyungbin Park, Jinyoung Seo, Jimin Sung, Jihyun Kim, Dongyeong Kim, Wooyoung Jo, Yoojin Nam, Sangah Park, Taehee Kwon, Sang Min Lee, Namkug Kim

Abstract: Computed Tomography (CT) plays a crucial role in clinical diagnosis, but the growing demand for CT examinations has raised concerns about diagnostic errors. While Multimodal Large Language Models (MLLMs) demonstrate promising comprehension of medical knowledge, their tendency to produce inaccurate information highlights the need for rigorous validation. However, existing medical visual question an… ▽ More Computed Tomography (CT) plays a crucial role in clinical diagnosis, but the growing demand for CT examinations has raised concerns about diagnostic errors. While Multimodal Large Language Models (MLLMs) demonstrate promising comprehension of medical knowledge, their tendency to produce inaccurate information highlights the need for rigorous validation. However, existing medical visual question answering (VQA) benchmarks primarily focus on simple visual recognition tasks, lacking clinical relevance and failing to assess expert-level knowledge. We introduce MedErr-CT, a novel benchmark for evaluating medical MLLMs' ability to identify and correct errors in CT reports through a VQA framework. The benchmark includes six error categories - four vision-centric errors (Omission, Insertion, Direction, Size) and two lexical error types (Unit, Typo) - and is organized into three task levels: classification, detection, and correction. Using this benchmark, we quantitatively assess the performance of state-of-the-art 3D medical MLLMs, revealing substantial variation in their capabilities across different error types. Our benchmark contributes to the development of more reliable and clinically applicable MLLMs, ultimately helping reduce diagnostic errors and improve accuracy in clinical practice. The code and datasets are available at https://github.com/babbu3682/MedErr-CT. △ Less

Submitted 23 June, 2025; originally announced June 2025.

Comments: 14 pages, 5 figures, submitted to CVPR 2025

arXiv:2506.14820 [pdf, other]

doi 10.2312/evs.20251087

Navigating High-Dimensional Backstage: A Guide for Exploring Literature for the Reliable Use of Dimensionality Reduction

Authors: Hyeon Jeon, Hyunwook Lee, Yun-Hsin Kuo, Taehyun Yang, Daniel Archambault, Sungahn Ko, Takanori Fujiwara, Kwan-Liu Ma, Jinwook Seo

Abstract: Visual analytics using dimensionality reduction (DR) can easily be unreliable for various reasons, e.g., inherent distortions in representing the original data. The literature has thus proposed a wide range of methodologies to make DR-based visual analytics reliable. However, the diversity and extensiveness of the literature can leave novice analysts and researchers uncertain about where to begin… ▽ More Visual analytics using dimensionality reduction (DR) can easily be unreliable for various reasons, e.g., inherent distortions in representing the original data. The literature has thus proposed a wide range of methodologies to make DR-based visual analytics reliable. However, the diversity and extensiveness of the literature can leave novice analysts and researchers uncertain about where to begin and proceed. To address this problem, we propose a guide for reading papers for reliable visual analytics with DR. Relying on the previous classification of the relevant literature, our guide helps both practitioners to (1) assess their current DR expertise and (2) identify papers that will further enhance their understanding. Interview studies with three experts in DR and data visualizations validate the significance, comprehensiveness, and usefulness of our guide. △ Less

Submitted 10 June, 2025; originally announced June 2025.

Comments: EG/VGTC EuroVis 2025 Short paper

arXiv:2506.13697 [pdf, ps, other]

Vid-CamEdit: Video Camera Trajectory Editing with Generative Rendering from Estimated Geometry

Authors: Junyoung Seo, Jisang Han, Jaewoo Jung, Siyoon Jin, Joungbin Lee, Takuya Narihira, Kazumi Fukuda, Takashi Shibuya, Donghoon Ahn, Shoukang Hu, Seungryong Kim, Yuki Mitsufuji

Abstract: We introduce Vid-CamEdit, a novel framework for video camera trajectory editing, enabling the re-synthesis of monocular videos along user-defined camera paths. This task is challenging due to its ill-posed nature and the limited multi-view video data for training. Traditional reconstruction methods struggle with extreme trajectory changes, and existing generative models for dynamic novel view synt… ▽ More We introduce Vid-CamEdit, a novel framework for video camera trajectory editing, enabling the re-synthesis of monocular videos along user-defined camera paths. This task is challenging due to its ill-posed nature and the limited multi-view video data for training. Traditional reconstruction methods struggle with extreme trajectory changes, and existing generative models for dynamic novel view synthesis cannot handle in-the-wild videos. Our approach consists of two steps: estimating temporally consistent geometry, and generative rendering guided by this geometry. By integrating geometric priors, the generative model focuses on synthesizing realistic details where the estimated geometry is uncertain. We eliminate the need for extensive 4D training data through a factorized fine-tuning framework that separately trains spatial and temporal components using multi-view image and video data. Our method outperforms baselines in producing plausible videos from novel camera trajectories, especially in extreme extrapolation scenarios on real-world footage. △ Less

Submitted 16 June, 2025; originally announced June 2025.

Comments: Our project page can be found at https://cvlab-kaist.github.io/Vid-CamEdit/

arXiv:2506.10197 [pdf]

Intergenerational AI Literacy in Korean Immigrant Families: Interpretive Gatekeeping Meets Convenient Critical Deferment

Authors: Jeongone Seo, Ryan Womack, Tawfiq Ammari

Abstract: As artificial intelligence (AI) becomes deeply integrated into family life, immigrant families must navigate unique intergenerational, linguistic, and cultural challenges. This study examines how Korean immigrant families in the United States negotiate the use of AI tools such as ChatGPT and smart assistants in their homes. Through 20 semi-structured interviews with parents and teens, we identify… ▽ More As artificial intelligence (AI) becomes deeply integrated into family life, immigrant families must navigate unique intergenerational, linguistic, and cultural challenges. This study examines how Korean immigrant families in the United States negotiate the use of AI tools such as ChatGPT and smart assistants in their homes. Through 20 semi-structured interviews with parents and teens, we identify two key practices that shape their engagement: interpretive gatekeeping, where parents mediate their children's AI use through a lens of cultural and ethical values, and convenient critical deferment, where teens strategically postpone critical evaluation of AI for immediate academic and social utility. These intertwined practices challenge conventional, skills-based models of AI literacy, revealing it instead as a dynamic and relational practice co-constructed through ongoing family negotiation. We contribute to information science and HCI by offering a new conceptual extension for intergenerational AI literacy and providing design implications for more equitable, culturally attuned, and family-centered AI systems. △ Less

Submitted 11 June, 2025; originally announced June 2025.

arXiv:2506.08725 [pdf]

Stop Misusing t-SNE and UMAP for Visual Analytics

Authors: Hyeon Jeon, Jeongin Park, Sungbok Shin, Jinwook Seo

Abstract: Misuses of t-SNE and UMAP in visual analytics have become increasingly common. For example, although t-SNE and UMAP projections often do not faithfully reflect true distances between clusters, practitioners frequently use them to investigate inter-cluster relationships. In this paper, we bring this issue to the surface and comprehensively investigate why such misuse occurs and how to prevent it. W… ▽ More Misuses of t-SNE and UMAP in visual analytics have become increasingly common. For example, although t-SNE and UMAP projections often do not faithfully reflect true distances between clusters, practitioners frequently use them to investigate inter-cluster relationships. In this paper, we bring this issue to the surface and comprehensively investigate why such misuse occurs and how to prevent it. We conduct a literature review of 114 papers to verify the prevalence of the misuse and analyze the reasonings behind it. We then execute an interview study to uncover practitioners' implicit motivations for using these techniques -- rationales often undisclosed in the literature. Our findings indicate that misuse of t-SNE and UMAP primarily stems from limited discourse on their appropriate use in visual analytics. We conclude by proposing future directions and concrete action items to promote more reasonable use of DR. △ Less

Submitted 10 June, 2025; originally announced June 2025.

Comments: 9 pages

arXiv:2506.08644 [pdf, other]

Semi-gradient DICE for Offline Constrained Reinforcement Learning

Authors: Woosung Kim, JunHo Seo, Jongmin Lee, Byung-Jun Lee

Abstract: Stationary Distribution Correction Estimation (DICE) addresses the mismatch between the stationary distribution induced by a policy and the target distribution required for reliable off-policy evaluation (OPE) and policy optimization. DICE-based offline constrained RL particularly benefits from the flexibility of DICE, as it simultaneously maximizes return while estimating costs in offline setting… ▽ More Stationary Distribution Correction Estimation (DICE) addresses the mismatch between the stationary distribution induced by a policy and the target distribution required for reliable off-policy evaluation (OPE) and policy optimization. DICE-based offline constrained RL particularly benefits from the flexibility of DICE, as it simultaneously maximizes return while estimating costs in offline settings. However, we have observed that recent approaches designed to enhance the offline RL performance of the DICE framework inadvertently undermine its ability to perform OPE, making them unsuitable for constrained RL scenarios. In this paper, we identify the root cause of this limitation: their reliance on a semi-gradient optimization, which solves a fundamentally different optimization problem and results in failures in cost estimation. Building on these insights, we propose a novel method to enable OPE and constrained RL through semi-gradient DICE. Our method ensures accurate cost estimation and achieves state-of-the-art performance on the offline constrained RL benchmark, DSRL. △ Less

Submitted 10 June, 2025; originally announced June 2025.

Comments: Constrained Offline Reinforcement Learning

arXiv:2506.04283 [pdf, ps, other]

SSIMBaD: Sigma Scaling with SSIM-Guided Balanced Diffusion for AnimeFace Colorization

Authors: Junpyo Seo, Hanbin Koo, Jieun Yook, Byung-Ro Moon

Abstract: We propose a novel diffusion-based framework for automatic colorization of Anime-style facial sketches. Our method preserves the structural fidelity of the input sketch while effectively transferring stylistic attributes from a reference image. Unlike traditional approaches that rely on predefined noise schedules - which often compromise perceptual consistency -- our framework builds on continuous… ▽ More We propose a novel diffusion-based framework for automatic colorization of Anime-style facial sketches. Our method preserves the structural fidelity of the input sketch while effectively transferring stylistic attributes from a reference image. Unlike traditional approaches that rely on predefined noise schedules - which often compromise perceptual consistency -- our framework builds on continuous-time diffusion models and introduces SSIMBaD (Sigma Scaling with SSIM-Guided Balanced Diffusion). SSIMBaD applies a sigma-space transformation that aligns perceptual degradation, as measured by structural similarity (SSIM), in a linear manner. This scaling ensures uniform visual difficulty across timesteps, enabling more balanced and faithful reconstructions. Experiments on a large-scale Anime face dataset demonstrate that our method outperforms state-of-the-art models in both pixel accuracy and perceptual quality, while generalizing to diverse styles. Code is available at github.com/Giventicket/SSIMBaD-Sigma-Scaling-with-SSIM-Guided-Balanced-Diffusion-for-AnimeFace-Colorization △ Less

Submitted 4 June, 2025; originally announced June 2025.

Comments: 10 pages, rest of the pages are appendix

arXiv:2505.23806 [pdf, ps, other]

MedOrchestra: A Hybrid Cloud-Local LLM Approach for Clinical Data Interpretation

Authors: Sihyeon Lee, Hyunjoo Song, Jong-chan Lee, Yoon Jin Lee, Boram Lee, Hee-Eon Lim, Dongyeong Kim, Jinwook Seo, Bohyoung Kim

Abstract: Deploying large language models (LLMs) in clinical settings faces critical trade-offs: cloud LLMs, with their extensive parameters and superior performance, pose risks to sensitive clinical data privacy, while local LLMs preserve privacy but often fail at complex clinical interpretation tasks. We propose MedOrchestra, a hybrid framework where a cloud LLM decomposes complex clinical tasks into mana… ▽ More Deploying large language models (LLMs) in clinical settings faces critical trade-offs: cloud LLMs, with their extensive parameters and superior performance, pose risks to sensitive clinical data privacy, while local LLMs preserve privacy but often fail at complex clinical interpretation tasks. We propose MedOrchestra, a hybrid framework where a cloud LLM decomposes complex clinical tasks into manageable subtasks and prompt generation, while a local LLM executes these subtasks in a privacy-preserving manner. Without accessing clinical data, the cloud LLM generates and validates subtask prompts using clinical guidelines and synthetic test cases. The local LLM executes subtasks locally and synthesizes outputs generated by the cloud LLM. We evaluate MedOrchestra on pancreatic cancer staging using 100 radiology reports under NCCN guidelines. On free-text reports, MedOrchestra achieves 70.21% accuracy, outperforming local model baselines (without guideline: 48.94%, with guideline: 56.59%) and board-certified clinicians (gastroenterologists: 59.57%, surgeons: 65.96%, radiologists: 55.32%). On structured reports, MedOrchestra reaches 85.42% accuracy, showing clear superiority across all settings. △ Less

Submitted 27 May, 2025; originally announced May 2025.

arXiv:2505.23400 [pdf, ps, other]

Bridging Geometric and Semantic Foundation Models for Generalized Monocular Depth Estimation

Authors: Sanggyun Ma, Wonjoon Choi, Jihun Park, Jaeyeul Kim, Seunghun Lee, Jiwan Seo, Sunghoon Im

Abstract: We present Bridging Geometric and Semantic (BriGeS), an effective method that fuses geometric and semantic information within foundation models to enhance Monocular Depth Estimation (MDE). Central to BriGeS is the Bridging Gate, which integrates the complementary strengths of depth and segmentation foundation models. This integration is further refined by our Attention Temperature Scaling techniqu… ▽ More We present Bridging Geometric and Semantic (BriGeS), an effective method that fuses geometric and semantic information within foundation models to enhance Monocular Depth Estimation (MDE). Central to BriGeS is the Bridging Gate, which integrates the complementary strengths of depth and segmentation foundation models. This integration is further refined by our Attention Temperature Scaling technique. It finely adjusts the focus of the attention mechanisms to prevent over-concentration on specific features, thus ensuring balanced performance across diverse inputs. BriGeS capitalizes on pre-trained foundation models and adopts a strategy that focuses on training only the Bridging Gate. This method significantly reduces resource demands and training time while maintaining the model's ability to generalize effectively. Extensive experiments across multiple challenging datasets demonstrate that BriGeS outperforms state-of-the-art methods in MDE for complex scenes, effectively handling intricate structures and overlapping objects. △ Less

Submitted 29 May, 2025; originally announced May 2025.

arXiv:2505.21467 [pdf, ps, other]

Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion

Authors: Zhanqiu Hu, Jian Meng, Yash Akhauri, Mohamed S. Abdelfattah, Jae-sun Seo, Zhiru Zhang, Udit Gupta

Abstract: Diffusion language models offer parallel token generation and inherent bidirectionality, promising more efficient and powerful sequence modeling compared to autoregressive approaches. However, state-of-the-art diffusion models (e.g., Dream 7B, LLaDA 8B) suffer from slow inference. While they match the quality of similarly sized Autoregressive (AR) Models (e.g., Qwen2.5 7B, Llama3 8B), their iterat… ▽ More Diffusion language models offer parallel token generation and inherent bidirectionality, promising more efficient and powerful sequence modeling compared to autoregressive approaches. However, state-of-the-art diffusion models (e.g., Dream 7B, LLaDA 8B) suffer from slow inference. While they match the quality of similarly sized Autoregressive (AR) Models (e.g., Qwen2.5 7B, Llama3 8B), their iterative denoising requires multiple full-sequence forward passes, resulting in high computational costs and latency, particularly for long input prompts and long-context scenarios. Furthermore, parallel token generation introduces token incoherence problems, and current sampling heuristics suffer from significant quality drops with decreasing denoising steps. We address these limitations with two training-free techniques. First, we propose FreeCache, a Key-Value (KV) approximation caching technique that reuses stable KV projections across denoising steps, effectively reducing the computational cost of DLM inference. Second, we introduce Guided Diffusion, a training-free method that uses a lightweight pretrained autoregressive model to supervise token unmasking, dramatically reducing the total number of denoising iterations without sacrificing quality. We conduct extensive evaluations on open-source reasoning benchmarks, and our combined methods deliver up to a 34x end-to-end speedup without compromising accuracy. For the first time, diffusion language models achieve a comparable and even faster latency as the widely adopted autoregressive models. Our work successfully paved the way for scaling up the diffusion language model to a broader scope of applications across different domains. △ Less

Submitted 27 May, 2025; originally announced May 2025.

arXiv:2505.18326 [pdf, ps, other]

Pragmatic Disengagement and Culturally Situated Non Use Older Korean Immigrants Strategies for Navigating Digital Noise

Authors: Jeongone Seo, Tawfiq Ammari

Abstract: Older immigrant adults often face layered barriers to digital participation, including language exclusion, generational divides, and emotional fatigue. This study examines how older Korean immigrants in the greater NYC area selectively engage with digital tools such as smartphones, YouTube, and AI platforms. Using a community-based participatory research (CBPR) framework and 22 semi-structured int… ▽ More Older immigrant adults often face layered barriers to digital participation, including language exclusion, generational divides, and emotional fatigue. This study examines how older Korean immigrants in the greater NYC area selectively engage with digital tools such as smartphones, YouTube, and AI platforms. Using a community-based participatory research (CBPR) framework and 22 semi-structured interviews, we identify two key practices: pragmatic disengagement, where users avoid emotionally taxing or culturally misaligned content, and interdependent navigation, where digital use is shaped through reliance on family or community support. These strategies challenge deficit-oriented narratives of non-use, showing how disengagement can be thoughtful, protective, and culturally situated. We contribute to CSCW by expanding theories of non-use and algorithmic resistance and by offering design and policy recommendations to support more dignified, culturally attuned digital engagement for aging immigrant populations. △ Less

Submitted 23 May, 2025; originally announced May 2025.

arXiv:2505.00779 [pdf, other]

Uncertainty-aware Latent Safety Filters for Avoiding Out-of-Distribution Failures

Authors: Junwon Seo, Kensuke Nakamura, Andrea Bajcsy

Abstract: Recent advances in generative world models have enabled classical safe control methods, such as Hamilton-Jacobi (HJ) reachability, to generalize to complex robotic systems operating directly from high-dimensional sensor observations. However, obtaining comprehensive coverage of all safety-critical scenarios during world model training is extremely challenging. As a result, latent safety filters bu… ▽ More Recent advances in generative world models have enabled classical safe control methods, such as Hamilton-Jacobi (HJ) reachability, to generalize to complex robotic systems operating directly from high-dimensional sensor observations. However, obtaining comprehensive coverage of all safety-critical scenarios during world model training is extremely challenging. As a result, latent safety filters built on top of these models may miss novel hazards and even fail to prevent known ones, overconfidently misclassifying risky out-of-distribution (OOD) situations as safe. To address this, we introduce an uncertainty-aware latent safety filter that proactively steers robots away from both known and unseen failures. Our key idea is to use the world model's epistemic uncertainty as a proxy for identifying unseen potential hazards. We propose a principled method to detect OOD world model predictions by calibrating an uncertainty threshold via conformal prediction. By performing reachability analysis in an augmented state space-spanning both the latent representation and the epistemic uncertainty-we synthesize a latent safety filter that can reliably safeguard arbitrary policies from both known and unseen safety hazards. In simulation and hardware experiments on vision-based control tasks with a Franka manipulator, we show that our uncertainty-aware safety filter preemptively detects potential unsafe scenarios and reliably proposes safe, in-distribution actions. Video results can be found on the project website at https://cmu-intentlab.github.io/UNISafe △ Less

Submitted 1 May, 2025; originally announced May 2025.

arXiv:2504.17080 [pdf, other]

Geometric Formulation of Unified Force-Impedance Control on SE(3) for Robotic Manipulators

Authors: Joohwan Seo, Nikhil Potu Surya Prakash, Soomi Lee, Arvind Kruthiventy, Megan Teng, Jongeun Choi, Roberto Horowitz

Abstract: In this paper, we present an impedance control framework on the SE(3) manifold, which enables force tracking while guaranteeing passivity. Building upon the unified force-impedance control (UFIC) and our previous work on geometric impedance control (GIC), we develop the geometric unified force impedance control (GUFIC) to account for the SE(3) manifold structure in the controller formulation using… ▽ More In this paper, we present an impedance control framework on the SE(3) manifold, which enables force tracking while guaranteeing passivity. Building upon the unified force-impedance control (UFIC) and our previous work on geometric impedance control (GIC), we develop the geometric unified force impedance control (GUFIC) to account for the SE(3) manifold structure in the controller formulation using a differential geometric perspective. As in the case of the UFIC, the GUFIC utilizes energy tank augmentation for both force-tracking and impedance control to guarantee the manipulator's passivity relative to external forces. This ensures that the end effector maintains safe contact interaction with uncertain environments and tracks a desired interaction force. Moreover, we resolve a non-causal implementation problem in the UFIC formulation by introducing velocity and force fields. Due to its formulation on SE(3), the proposed GUFIC inherits the desirable SE(3) invariance and equivariance properties of the GIC, which helps increase sample efficiency in machine learning applications where a learning algorithm is incorporated into the control law. The proposed control law is validated in a simulation environment under scenarios requiring tracking an SE(3) trajectory, incorporating both position and orientation, while exerting a force on a surface. The codes are available at https://github.com/Joohwan-Seo/GUFIC_mujoco. △ Less

Submitted 23 April, 2025; originally announced April 2025.

Comments: Submitted to Control Decision Conference (CDC) 2025

arXiv:2504.09859 [pdf, other]

Can VLMs Assess Similarity Between Graph Visualizations?

Authors: Seokweon Jung, Hyeon Jeon, Jeongmin Rhee, Jinwook Seo

Abstract: Graph visualizations have been studied for tasks such as clustering and temporal analysis, but how these visual similarities relate to established graph similarity measures remains unclear. In this paper, we explore the potential of Vision Language Models (VLMs) to approximate human-like perception of graph similarity. We generate graph datasets of various sizes and densities and compare VLM-deriv… ▽ More Graph visualizations have been studied for tasks such as clustering and temporal analysis, but how these visual similarities relate to established graph similarity measures remains unclear. In this paper, we explore the potential of Vision Language Models (VLMs) to approximate human-like perception of graph similarity. We generate graph datasets of various sizes and densities and compare VLM-derived visual similarity scores with feature-based measures. Our findings indicate VLMs can assess graph similarity in a manner similar to feature-based measures, even though differences among the measures exist. In future work, we plan to extend our research by conducting experiments on human visual graph perception. △ Less

Submitted 14 April, 2025; originally announced April 2025.

arXiv:2504.06264 [pdf, other]

D^2USt3R: Enhancing 3D Reconstruction with 4D Pointmaps for Dynamic Scenes

Authors: Jisang Han, Honggyu An, Jaewoo Jung, Takuya Narihira, Junyoung Seo, Kazumi Fukuda, Chaehyun Kim, Sunghwan Hong, Yuki Mitsufuji, Seungryong Kim

Abstract: We address the task of 3D reconstruction in dynamic scenes, where object motions degrade the quality of previous 3D pointmap regression methods, such as DUSt3R, originally designed for static 3D scene reconstruction. Although these methods provide an elegant and powerful solution in static settings, they struggle in the presence of dynamic motions that disrupt alignment based solely on camera pose… ▽ More We address the task of 3D reconstruction in dynamic scenes, where object motions degrade the quality of previous 3D pointmap regression methods, such as DUSt3R, originally designed for static 3D scene reconstruction. Although these methods provide an elegant and powerful solution in static settings, they struggle in the presence of dynamic motions that disrupt alignment based solely on camera poses. To overcome this, we propose D^2USt3R that regresses 4D pointmaps that simultaneiously capture both static and dynamic 3D scene geometry in a feed-forward manner. By explicitly incorporating both spatial and temporal aspects, our approach successfully encapsulates spatio-temporal dense correspondence to the proposed 4D pointmaps, enhancing downstream tasks. Extensive experimental evaluations demonstrate that our proposed approach consistently achieves superior reconstruction performance across various datasets featuring complex motions. △ Less

Submitted 8 April, 2025; originally announced April 2025.

Comments: project page: https://cvlab-kaist.github.io/DDUSt3R/

arXiv:2504.03979 [pdf, other]

Structured Extraction of Process Structure Properties Relationships in Materials Science

Authors: Amit K Verma, Zhisong Zhang, Junwon Seo, Robin Kuo, Runbo Jiang, Emma Strubell, Anthony D Rollett

Abstract: With the advent of large language models (LLMs), the vast unstructured text within millions of academic papers is increasingly accessible for materials discovery, although significant challenges remain. While LLMs offer promising few- and zero-shot learning capabilities, particularly valuable in the materials domain where expert annotations are scarce, general-purpose LLMs often fail to address ke… ▽ More With the advent of large language models (LLMs), the vast unstructured text within millions of academic papers is increasingly accessible for materials discovery, although significant challenges remain. While LLMs offer promising few- and zero-shot learning capabilities, particularly valuable in the materials domain where expert annotations are scarce, general-purpose LLMs often fail to address key materials-specific queries without further adaptation. To bridge this gap, fine-tuning LLMs on human-labeled data is essential for effective structured knowledge extraction. In this study, we introduce a novel annotation schema designed to extract generic process-structure-properties relationships from scientific literature. We demonstrate the utility of this approach using a dataset of 128 abstracts, with annotations drawn from two distinct domains: high-temperature materials (Domain I) and uncertainty quantification in simulating materials microstructure (Domain II). Initially, we developed a conditional random field (CRF) model based on MatBERT, a domain-specific BERT variant, and evaluated its performance on Domain I. Subsequently, we compared this model with a fine-tuned LLM (GPT-4o from OpenAI) under identical conditions. Our results indicate that fine-tuning LLMs can significantly improve entity extraction performance over the BERT-CRF baseline on Domain I. However, when additional examples from Domain II were incorporated, the performance of the BERT-CRF model became comparable to that of the GPT-4o model. These findings underscore the potential of our schema for structured knowledge extraction and highlight the complementary strengths of both modeling approaches. △ Less

Submitted 4 April, 2025; originally announced April 2025.

Comments: 16 pages, 3 figures, 13 table

arXiv:2503.13180 [pdf, other]

GC-Fed: Gradient Centralized Federated Learning with Partial Client Participation

Authors: Jungwon Seo, Ferhat Ozgur Catak, Chunming Rong, Kibeom Hong, Minhoe Kim

Abstract: Federated Learning (FL) enables privacy-preserving multi-source information fusion (MSIF) but is challenged by client drift in highly heterogeneous data settings. Many existing drift-mitigation strategies rely on reference-based techniques--such as gradient adjustments or proximal loss--that use historical snapshots (e.g., past gradients or previous global models) as reference points. When only a… ▽ More Federated Learning (FL) enables privacy-preserving multi-source information fusion (MSIF) but is challenged by client drift in highly heterogeneous data settings. Many existing drift-mitigation strategies rely on reference-based techniques--such as gradient adjustments or proximal loss--that use historical snapshots (e.g., past gradients or previous global models) as reference points. When only a subset of clients participates in each training round, these historical references may not accurately capture the overall data distribution, leading to unstable training. In contrast, our proposed Gradient Centralized Federated Learning (GC-Fed) employs a hyperplane as a historically independent reference point to guide local training and enhance inter-client alignment. GC-Fed comprises two complementary components: Local GC, which centralizes gradients during local training, and Global GC, which centralizes updates during server aggregation. In our hybrid design, Local GC is applied to feature-extraction layers to harmonize client contributions, while Global GC refines classifier layers to stabilize round-wise performance. Theoretical analysis and extensive experiments on benchmark FL tasks demonstrate that GC-Fed effectively mitigates client drift and achieves up to a 20% improvement in accuracy under heterogeneous and partial participation conditions. △ Less

Submitted 20 March, 2025; v1 submitted 17 March, 2025; originally announced March 2025.

arXiv:2503.12806 [pdf, other]

AV-Surf: Surface-Enhanced Geometry-Aware Novel-View Acoustic Synthesis

Authors: Hadam Baek, Hannie Shin, Jiyoung Seo, Chanwoo Kim, Saerom Kim, Hyeongbok Kim, Sangpil Kim

Abstract: Accurately modeling sound propagation with complex real-world environments is essential for Novel View Acoustic Synthesis (NVAS). While previous studies have leveraged visual perception to estimate spatial acoustics, the combined use of surface normal and structural details from 3D representations in acoustic modeling has been underexplored. Given their direct impact on sound wave reflections and… ▽ More Accurately modeling sound propagation with complex real-world environments is essential for Novel View Acoustic Synthesis (NVAS). While previous studies have leveraged visual perception to estimate spatial acoustics, the combined use of surface normal and structural details from 3D representations in acoustic modeling has been underexplored. Given their direct impact on sound wave reflections and propagation, surface normals should be jointly modeled with structural details to achieve accurate spatial acoustics. In this paper, we propose a surface-enhanced geometry-aware approach for NVAS to improve spatial acoustic modeling. To achieve this, we exploit geometric priors, such as image, depth map, surface normals, and point clouds obtained using a 3D Gaussian Splatting (3DGS) based framework. We introduce a dual cross-attention-based transformer integrating geometrical constraints into frequency query to understand the surroundings of the emitter. Additionally, we design a ConvNeXt-based spectral features processing network called Spectral Refinement Network (SRN) to synthesize realistic binaural audio. Experimental results on the RWAVS and SoundSpace datasets highlight the necessity of our approach, as it surpasses existing methods in novel view acoustic synthesis. △ Less

Submitted 17 March, 2025; originally announced March 2025.

arXiv:2503.09829 [pdf, other]

SE(3)-Equivariant Robot Learning and Control: A Tutorial Survey

Authors: Joohwan Seo, Soochul Yoo, Junwoo Chang, Hyunseok An, Hyunwoo Ryu, Soomi Lee, Arvind Kruthiventy, Jongeun Choi, Roberto Horowitz

Abstract: Recent advances in deep learning and Transformers have driven major breakthroughs in robotics by employing techniques such as imitation learning, reinforcement learning, and LLM-based multimodal perception and decision-making. However, conventional deep learning and Transformer models often struggle to process data with inherent symmetries and invariances, typically relying on large datasets or ex… ▽ More Recent advances in deep learning and Transformers have driven major breakthroughs in robotics by employing techniques such as imitation learning, reinforcement learning, and LLM-based multimodal perception and decision-making. However, conventional deep learning and Transformer models often struggle to process data with inherent symmetries and invariances, typically relying on large datasets or extensive data augmentation. Equivariant neural networks overcome these limitations by explicitly integrating symmetry and invariance into their architectures, leading to improved efficiency and generalization. This tutorial survey reviews a wide range of equivariant deep learning and control methods for robotics, from classic to state-of-the-art, with a focus on SE(3)-equivariant models that leverage the natural 3D rotational and translational symmetries in visual robotic manipulation and control design. Using unified mathematical notation, we begin by reviewing key concepts from group theory, along with matrix Lie groups and Lie algebras. We then introduce foundational group-equivariant neural network design and show how the group-equivariance can be obtained through their structure. Next, we discuss the applications of SE(3)-equivariant neural networks in robotics in terms of imitation learning and reinforcement learning. The SE(3)-equivariant control design is also reviewed from the perspective of geometric control. Finally, we highlight the challenges and future directions of equivariant methods in developing more robust, sample-efficient, and multi-modal real-world robotic systems. △ Less

Submitted 23 April, 2025; v1 submitted 12 March, 2025; originally announced March 2025.

Comments: Accepted to International Journcal of Control, Automation and Systems (IJCAS)

arXiv:2503.07415 [pdf, other]

doi 10.1145/3706599.3719817

"Sighted People Have Their Pick Of The Litter": Unpacking The Need For Digital Mental Health (DMH) Tracking Services With And For The Blind Community

Authors: Omar Khan, JooYoung Seo

Abstract: The proliferation of digital mental health (DMH) tracking services promises personalized support, yet accessibility barriers limit equal access. This study investigates blind community experiences with DMH tracking services across the United States as a step toward inclusive health technology design. Working with blind advocacy organizations, we distributed a cross-sectional observational survey (… ▽ More The proliferation of digital mental health (DMH) tracking services promises personalized support, yet accessibility barriers limit equal access. This study investigates blind community experiences with DMH tracking services across the United States as a step toward inclusive health technology design. Working with blind advocacy organizations, we distributed a cross-sectional observational survey (n = 93) and analyzed open-ended responses using Norman and Skinner's eHealth Literacy framework. Our findings reveal significant challenges in navigation, content interpretation, and overall user experience, which impede the blind community's effective engagement with DMH tools. Results highlight the need for adaptive interfaces, accessible tracking strategies, and voice-guided interactions. These insights inform design recommendations for developers and policymakers, promoting more inclusive mental health technologies. By prioritizing accessibility, we make forward progress in ensuring that DMH tracking services fulfill their potential to support mental well-being across diverse user groups, fostering digital equality in mental health care. △ Less

Submitted 10 March, 2025; originally announced March 2025.

Comments: Accepted to CHI 2025

arXiv:2503.06491 [pdf, other]

MoFE: Mixture of Frozen Experts Architecture

Authors: Jean Seo, Jaeyoon Kim, Hyopil Shin

Abstract: We propose the Mixture of Frozen Experts (MoFE) architecture, which integrates Parameter-efficient Fine-tuning (PEFT) and the Mixture of Experts (MoE) architecture to enhance both training efficiency and model scalability. By freezing the Feed Forward Network (FFN) layers within the MoE framework, MoFE significantly reduces the number of trainable parameters, improving training efficiency while st… ▽ More We propose the Mixture of Frozen Experts (MoFE) architecture, which integrates Parameter-efficient Fine-tuning (PEFT) and the Mixture of Experts (MoE) architecture to enhance both training efficiency and model scalability. By freezing the Feed Forward Network (FFN) layers within the MoE framework, MoFE significantly reduces the number of trainable parameters, improving training efficiency while still allowing for effective knowledge transfer from the expert models. This facilitates the creation of models proficient in multiple domains. We conduct experiments to evaluate the trade-offs between performance and efficiency, compare MoFE with other PEFT methodologies, assess the impact of domain expertise in the constituent models, and determine the optimal training strategy. The results show that, although there may be some trade-offs in performance, the efficiency gains are substantial, making MoFE a reasonable solution for real-world, resource-constrained environments. △ Less

Submitted 9 March, 2025; originally announced March 2025.

Comments: NAACL 2025 Industry

arXiv:2503.06413 [pdf, other]

Swift Hydra: Self-Reinforcing Generative Framework for Anomaly Detection with Multiple Mamba Models

Authors: Nguyen Do, Truc Nguyen, Malik Hassanaly, Raed Alharbi, Jung Taek Seo, My T. Thai

Abstract: Despite a plethora of anomaly detection models developed over the years, their ability to generalize to unseen anomalies remains an issue, particularly in critical systems. This paper aims to address this challenge by introducing Swift Hydra, a new framework for training an anomaly detection method based on generative AI and reinforcement learning (RL). Through featuring an RL policy that operates… ▽ More Despite a plethora of anomaly detection models developed over the years, their ability to generalize to unseen anomalies remains an issue, particularly in critical systems. This paper aims to address this challenge by introducing Swift Hydra, a new framework for training an anomaly detection method based on generative AI and reinforcement learning (RL). Through featuring an RL policy that operates on the latent variables of a generative model, the framework synthesizes novel and diverse anomaly samples that are capable of bypassing a detection model. These generated synthetic samples are, in turn, used to augment the detection model, further improving its ability to handle challenging anomalies. Swift Hydra also incorporates Mamba models structured as a Mixture of Experts (MoE) to enable scalable adaptation of the number of Mamba experts based on data complexity, effectively capturing diverse feature distributions without increasing the model's inference time. Empirical evaluations on ADBench benchmark demonstrate that Swift Hydra outperforms other state-of-the-art anomaly detection models while maintaining a relatively short inference time. From these results, our research highlights a new and auspicious paradigm of integrating RL and generative AI for advancing anomaly detection. △ Less

Submitted 24 March, 2025; v1 submitted 8 March, 2025; originally announced March 2025.

Journal ref: In The Thirteenth International Conference on Learning Representations, 2025

arXiv:2503.04807 [pdf, other]

Call for Rigor in Reporting Quality of Instruction Tuning Data

Authors: Hyeonseok Moon, Jaehyung Seo, Heuiseok Lim

Abstract: Instruction tuning is crucial for adapting large language models (LLMs) to align with user intentions. Numerous studies emphasize the significance of the quality of instruction tuning (IT) data, revealing a strong correlation between IT data quality and the alignment performance of LLMs. In these studies, the quality of IT data is typically assessed by evaluating the performance of LLMs trained wi… ▽ More Instruction tuning is crucial for adapting large language models (LLMs) to align with user intentions. Numerous studies emphasize the significance of the quality of instruction tuning (IT) data, revealing a strong correlation between IT data quality and the alignment performance of LLMs. In these studies, the quality of IT data is typically assessed by evaluating the performance of LLMs trained with that data. However, we identified a prevalent issue in such practice: hyperparameters for training models are often selected arbitrarily without adequate justification. We observed significant variations in hyperparameters applied across different studies, even when training the same model with the same data. In this study, we demonstrate the potential problems arising from this practice and emphasize the need for careful consideration in verifying data quality. Through our experiments on the quality of LIMA data and a selected set of 1,000 Alpaca data points, we demonstrate that arbitrary hyperparameter decisions can make any arbitrary conclusion. △ Less

Submitted 15 May, 2025; v1 submitted 3 March, 2025; originally announced March 2025.

Comments: Accepted to the ACL2025-main

arXiv:2503.01097 [pdf, other]

Measuring the Validity of Clustering Validation Datasets

Authors: Hyeon Jeon, Michaël Aupetit, DongHwa Shin, Aeri Cho, Seokhyeon Park, Jinwook Seo

Abstract: Clustering techniques are often validated using benchmark datasets where class labels are used as ground-truth clusters. However, depending on the datasets, class labels may not align with the actual data clusters, and such misalignment hampers accurate validation. Therefore, it is essential to evaluate and compare datasets regarding their cluster-label matching (CLM), i.e., how well their class l… ▽ More Clustering techniques are often validated using benchmark datasets where class labels are used as ground-truth clusters. However, depending on the datasets, class labels may not align with the actual data clusters, and such misalignment hampers accurate validation. Therefore, it is essential to evaluate and compare datasets regarding their cluster-label matching (CLM), i.e., how well their class labels match actual clusters. Internal validation measures (IVMs), like Silhouette, can compare CLM over different labeling of the same dataset, but are not designed to do so across different datasets. We thus introduce Adjusted IVMs as fast and reliable methods to evaluate and compare CLM across datasets. We establish four axioms that require validation measures to be independent of data properties not related to cluster structure (e.g., dimensionality, dataset size). Then, we develop standardized protocols to convert any IVM to satisfy these axioms, and use these protocols to adjust six widely used IVMs. Quantitative experiments (1) verify the necessity and effectiveness of our protocols and (2) show that adjusted IVMs outperform the competitors, including standard IVMs, in accurately evaluating CLM both within and across datasets. We also show that the datasets can be filtered or improved using our method to form more reliable benchmarks for clustering validation. △ Less

Submitted 2 March, 2025; originally announced March 2025.

Comments: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

arXiv:2502.15826 [pdf, other]

CoME: An Unlearning-based Approach to Conflict-free Model Editing

Authors: Dahyun Jung, Jaehyung Seo, Jaewook Lee, Chanjun Park, Heuiseok Lim

Abstract: Large language models (LLMs) often retain outdated or incorrect information from pre-training, which undermines their reliability. While model editing methods have been developed to address such errors without full re-training, they frequently suffer from knowledge conflicts, where outdated information interferes with new knowledge. In this work, we propose Conflict-free Model Editing (CoME), a no… ▽ More Large language models (LLMs) often retain outdated or incorrect information from pre-training, which undermines their reliability. While model editing methods have been developed to address such errors without full re-training, they frequently suffer from knowledge conflicts, where outdated information interferes with new knowledge. In this work, we propose Conflict-free Model Editing (CoME), a novel framework that enhances the accuracy of knowledge updates in LLMs by selectively removing outdated knowledge. CoME leverages unlearning to mitigate knowledge interference, allowing new information to be integrated without compromising relevant linguistic features. Through experiments on GPT-J and LLaMA-3 using Counterfact and ZsRE datasets, we demonstrate that CoME improves both editing accuracy and model reliability when applied to existing editing methods. Our results highlight that the targeted removal of outdated knowledge is crucial for enhancing model editing effectiveness and maintaining the model's generative performance. △ Less

Submitted 19 February, 2025; originally announced February 2025.

Comments: Accepted to NAACL 2025 main conference

arXiv:2502.15395 [pdf, ps, other]

Beyond Tools: Understanding How Heavy Users Integrate LLMs into Everyday Tasks and Decision-Making

Authors: Eunhye Kim, Kiroong Choe, Minju Yoo, Sadat Shams Chowdhury, Jinwook Seo

Abstract: Large language models (LLMs) are increasingly used for both everyday and specialized tasks. While HCI research focuses on domain-specific applications, little is known about how heavy users integrate LLMs into everyday decision-making. Through qualitative interviews with heavy LLM users (n=7) who employ these systems for both intuitive and analytical thinking tasks, our findings show that particip… ▽ More Large language models (LLMs) are increasingly used for both everyday and specialized tasks. While HCI research focuses on domain-specific applications, little is known about how heavy users integrate LLMs into everyday decision-making. Through qualitative interviews with heavy LLM users (n=7) who employ these systems for both intuitive and analytical thinking tasks, our findings show that participants use LLMs for social validation, self-regulation, and interpersonal guidance, seeking to build self-confidence and optimize cognitive resources. These users viewed LLMs either as rational, consistent entities or average human decision-makers. Our findings suggest that heavy LLM users develop nuanced interaction patterns beyond simple delegation, highlighting the need to reconsider how we study LLM integration in decision-making processes. △ Less

Submitted 16 April, 2025; v1 submitted 21 February, 2025; originally announced February 2025.

arXiv:2502.12560 [pdf, other]

How does a Language-Specific Tokenizer affect LLMs?

Authors: Jean Seo, Jaeyoon Kim, SungJoo Byun, Hyopil Shin

Abstract: The necessity of language-specific tokenizers intuitively appears crucial for effective natural language processing, yet empirical analyses on their significance and underlying reasons are lacking. This study explores how language-specific tokenizers influence the behavior of Large Language Models predominantly trained with English text data, through the case study of Korean. The research unfolds… ▽ More The necessity of language-specific tokenizers intuitively appears crucial for effective natural language processing, yet empirical analyses on their significance and underlying reasons are lacking. This study explores how language-specific tokenizers influence the behavior of Large Language Models predominantly trained with English text data, through the case study of Korean. The research unfolds in two main stages: (1) the development of a Korean-specific extended tokenizer and (2) experiments to compare models with the basic tokenizer and the extended tokenizer through various Next Token Prediction tasks. Our in-depth analysis reveals that the extended tokenizer decreases confidence in incorrect predictions during generation and reduces cross-entropy in complex tasks, indicating a tendency to produce less nonsensical outputs. Consequently, the extended tokenizer provides stability during generation, potentially leading to higher performance in downstream tasks. △ Less

Submitted 21 February, 2025; v1 submitted 18 February, 2025; originally announced February 2025.

arXiv:2502.08690 [pdf, other]

Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation

Authors: Hoigi Seo, Wongi Jeong, Jae-sun Seo, Se Young Chun

Abstract: Large-scale text encoders in text-to-image (T2I) diffusion models have demonstrated exceptional performance in generating high-quality images from textual prompts. Unlike denoising modules that rely on multiple iterative steps, text encoders require only a single forward pass to produce text embeddings. However, despite their minimal contribution to total inference time and floating-point operatio… ▽ More Large-scale text encoders in text-to-image (T2I) diffusion models have demonstrated exceptional performance in generating high-quality images from textual prompts. Unlike denoising modules that rely on multiple iterative steps, text encoders require only a single forward pass to produce text embeddings. However, despite their minimal contribution to total inference time and floating-point operations (FLOPs), text encoders demand significantly higher memory usage, up to eight times more than denoising modules. To address this inefficiency, we propose Skip and Re-use layers (Skrr), a simple yet effective pruning strategy specifically designed for text encoders in T2I diffusion models. Skrr exploits the inherent redundancy in transformer blocks by selectively skipping or reusing certain layers in a manner tailored for T2I tasks, thereby reducing memory consumption without compromising performance. Extensive experiments demonstrate that Skrr maintains image quality comparable to the original model even under high sparsity levels, outperforming existing blockwise pruning methods. Furthermore, Skrr achieves state-of-the-art memory efficiency while preserving performance across multiple evaluation metrics, including the FID, CLIP, DreamSim, and GenEval scores. △ Less

Submitted 12 February, 2025; originally announced February 2025.

arXiv:2502.05802 [pdf, other]

doi 10.1016/j.eswa.2025.127822

Kalman Filter-Based Distributed Gaussian Process for Unknown Scalar Field Estimation in Wireless Sensor Networks

Authors: Jaemin Seo, Geunsik Bae, Hyondong Oh

Abstract: In this letter, we propose an online scalar field estimation algorithm of unknown environments using a distributed Gaussian process (DGP) framework in wireless sensor networks (WSNs). While the kernel-based Gaussian process (GP) has been widely employed for estimating unknown scalar fields, its centralized nature is not well-suited for handling a large amount of data from WSNs. To overcome the lim… ▽ More In this letter, we propose an online scalar field estimation algorithm of unknown environments using a distributed Gaussian process (DGP) framework in wireless sensor networks (WSNs). While the kernel-based Gaussian process (GP) has been widely employed for estimating unknown scalar fields, its centralized nature is not well-suited for handling a large amount of data from WSNs. To overcome the limitations of the kernel-based GP, recent advancements in GP research focus on approximating kernel functions as products of E-dimensional nonlinear basis functions, which can handle large WSNs more efficiently in a distributed manner. However, this approach requires a large number of basis functions for accurate approximation, leading to increased computational and communication complexities. To address these complexity issues, the paper proposes a distributed GP framework by incorporating a Kalman filter scheme (termed as K-DGP), which scales linearly with the number of nonlinear basis functions. Moreover, we propose a new consensus protocol designed to handle the unique data transmission requirement residing in the proposed K-DGP framework. This protocol preserves the inherent elements in the form of a certain column in the nonlinear function matrix of the communicated message; it enables wireless sensors to cooperatively estimate the environment and reach the global consensus through distributed learning with faster convergence than the widely-used average consensus protocol. Simulation results demonstrate rapid consensus convergence and outstanding estimation accuracy achieved by the proposed K-DGP algorithm. The scalability and efficiency of the proposed approach are further demonstrated by online dynamic environment estimation using WSNs. △ Less

Submitted 9 February, 2025; originally announced February 2025.

Journal ref: Expert Systems with Applications, vol. 285, p. 127822, 2025

arXiv:2502.05609 [pdf, other]

Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding

Authors: Sukmin Cho, Sangjin Choi, Taeho Hwang, Jeongyeon Seo, Soyeong Jeong, Huije Lee, Hoyun Song, Jong C. Park, Youngjin Kwon

Abstract: Accelerating inference in Large Language Models (LLMs) is critical for real-time interactions, as they have been widely incorporated into real-world services. Speculative decoding, a fully algorithmic solution, has gained attention for improving inference speed by drafting and verifying tokens, thereby generating multiple tokens in a single forward pass. However, current drafting strategies usuall… ▽ More Accelerating inference in Large Language Models (LLMs) is critical for real-time interactions, as they have been widely incorporated into real-world services. Speculative decoding, a fully algorithmic solution, has gained attention for improving inference speed by drafting and verifying tokens, thereby generating multiple tokens in a single forward pass. However, current drafting strategies usually require significant fine-tuning or have inconsistent performance across tasks. To address these challenges, we propose Hierarchy Drafting (HD), a novel lossless drafting approach that organizes various token sources into multiple databases in a hierarchical framework based on temporal locality. In the drafting step, HD sequentially accesses multiple databases to obtain draft tokens from the highest to the lowest locality, ensuring consistent acceleration across diverse tasks and minimizing drafting latency. Our experiments on Spec-Bench using LLMs with 7B and 13B parameters demonstrate that HD outperforms existing database drafting methods, achieving robust inference speedups across model sizes, tasks, and temperatures. △ Less

Submitted 8 February, 2025; originally announced February 2025.

Comments: Findings of NAACL 2025

arXiv:2502.05221 [pdf, ps, other]

Blackout DIFUSCO

Authors: Jun Pyo Seo

Abstract: This study explores the integration of Blackout Diffusion into the DIFUSCO framework for combinatorial optimization, specifically targeting the Traveling Salesman Problem (TSP). Inspired by the success of discrete-time diffusion models (D3PM) in maintaining structural integrity, we extend the paradigm to a continuous-time framework, leveraging the unique properties of Blackout Diffusion. Continuou… ▽ More This study explores the integration of Blackout Diffusion into the DIFUSCO framework for combinatorial optimization, specifically targeting the Traveling Salesman Problem (TSP). Inspired by the success of discrete-time diffusion models (D3PM) in maintaining structural integrity, we extend the paradigm to a continuous-time framework, leveraging the unique properties of Blackout Diffusion. Continuous-time modeling introduces smoother transitions and refined control, hypothesizing enhanced solution quality over traditional discrete methods. We propose three key improvements to enhance the diffusion process. First, we transition from a discrete-time-based model to a continuous-time framework, providing a more refined and flexible formulation. Second, we refine the observation time scheduling to ensure a smooth and linear transformation throughout the diffusion process, allowing for a more natural progression of states. Finally, building upon the second improvement, we further enhance the reverse process by introducing finer time slices in regions that are particularly challenging for the model, thereby improving accuracy and stability in the reconstruction phase. Although the experimental results did not exceed the baseline performance, they demonstrate the effectiveness of these methods in balancing simplicity and complexity, offering new insights into diffusion-based combinatorial optimization. This work represents the first application of Blackout Diffusion to combinatorial optimization, providing a foundation for further advancements in this domain. * The code is available for review at https://github.com/Giventicket/BlackoutDIFUSCO. △ Less

Submitted 5 June, 2025; v1 submitted 5 February, 2025; originally announced February 2025.

Comments: 12 pages

arXiv:2502.04018 [pdf, other]

PINT: Physics-Informed Neural Time Series Models with Applications to Long-term Inference on WeatherBench 2m-Temperature Data

Authors: Keon Vin Park, Jisu Kim, Jaemin Seo

Abstract: This paper introduces PINT (Physics-Informed Neural Time Series Models), a framework that integrates physical constraints into neural time series models to improve their ability to capture complex dynamics. We apply PINT to the ERA5 WeatherBench dataset, focusing on long-term forecasting of 2m-temperature data. PINT incorporates the Simple Harmonic Oscillator Equation as a physics-informed prior,… ▽ More This paper introduces PINT (Physics-Informed Neural Time Series Models), a framework that integrates physical constraints into neural time series models to improve their ability to capture complex dynamics. We apply PINT to the ERA5 WeatherBench dataset, focusing on long-term forecasting of 2m-temperature data. PINT incorporates the Simple Harmonic Oscillator Equation as a physics-informed prior, embedding its periodic dynamics into RNN, LSTM, and GRU architectures. This equation's analytical solutions (sine and cosine functions) facilitate rigorous evaluation of the benefits of incorporating physics-informed constraints. By benchmarking against a linear regression baseline derived from its exact solutions, we quantify the impact of embedding physical principles in data-driven models. Unlike traditional time series models that rely on future observations, PINT is designed for practical forecasting. Using only the first 90 days of observed data, it iteratively predicts the next two years, addressing challenges posed by limited real-time updates. Experiments on the WeatherBench dataset demonstrate PINT's ability to generalize, capture periodic trends, and align with physical principles. This study highlights the potential of physics-informed neural models in bridging machine learning and interpretable climate applications. Our models and datasets are publicly available on GitHub: https://github.com/KV-Park. △ Less

Submitted 6 February, 2025; originally announced February 2025.

arXiv:2502.00182 [pdf, ps, other]

Understanding Federated Learning from IID to Non-IID dataset: An Experimental Study

Authors: Jungwon Seo, Ferhat Ozgur Catak, Chunming Rong

Abstract: As privacy concerns and data regulations grow, federated learning (FL) has emerged as a promising approach for training machine learning models across decentralized data sources without sharing raw data. However, a significant challenge in FL is that client data are often non-IID (non-independent and identically distributed), leading to reduced performance compared to centralized learning. While m… ▽ More As privacy concerns and data regulations grow, federated learning (FL) has emerged as a promising approach for training machine learning models across decentralized data sources without sharing raw data. However, a significant challenge in FL is that client data are often non-IID (non-independent and identically distributed), leading to reduced performance compared to centralized learning. While many methods have been proposed to address this issue, their underlying mechanisms are often viewed from different perspectives. Through a comprehensive investigation from gradient descent to FL, and from IID to non-IID data settings, we find that inconsistencies in client loss landscapes primarily cause performance degradation in non-IID scenarios. From this understanding, we observe that existing methods can be grouped into two main strategies: (i) adjusting parameter update paths and (ii) modifying client loss landscapes. These findings offer a clear perspective on addressing non-IID challenges in FL and help guide future research in the field. △ Less

Submitted 3 June, 2025; v1 submitted 31 January, 2025; originally announced February 2025.

Journal ref: 36th Norwegian ICT Conference for Research and Education, NIKT 2024

arXiv:2501.17799 [pdf, other]

doi 10.1145/3706598.3714213

Leveraging Multimodal LLM for Inspirational User Interface Search

Authors: Seokhyeon Park, Yumin Song, Soohyun Lee, Jaeyoung Kim, Jinwook Seo

Abstract: Inspirational search, the process of exploring designs to inform and inspire new creative work, is pivotal in mobile user interface (UI) design. However, exploring the vast space of UI references remains a challenge. Existing AI-based UI search methods often miss crucial semantics like target users or the mood of apps. Additionally, these models typically require metadata like view hierarchies, li… ▽ More Inspirational search, the process of exploring designs to inform and inspire new creative work, is pivotal in mobile user interface (UI) design. However, exploring the vast space of UI references remains a challenge. Existing AI-based UI search methods often miss crucial semantics like target users or the mood of apps. Additionally, these models typically require metadata like view hierarchies, limiting their practical use. We used a multimodal large language model (MLLM) to extract and interpret semantics from mobile UI images. We identified key UI semantics through a formative study and developed a semantic-based UI search system. Through computational and human evaluations, we demonstrate that our approach significantly outperforms existing UI retrieval methods, offering UI designers a more enriched and contextually relevant search experience. We enhance the understanding of mobile UI design semantics and highlight MLLMs' potential in inspirational search, providing a rich dataset of UI semantics for future studies. △ Less

Submitted 15 February, 2025; v1 submitted 29 January, 2025; originally announced January 2025.

Comments: In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '25)

arXiv:2501.16680 [pdf, ps, other]

Differentially Private Set Representations

Authors: Sarvar Patel, Giuseppe Persiano, Joon Young Seo, Kevin Yeo

Abstract: We study the problem of differentially private (DP) mechanisms for representing sets of size $k$ from a large universe. Our first construction creates $(ε,δ)$-DP representations with error probability of $1/(e^ε+ 1)$ using space at most $1.05 k ε\cdot \log(e)$ bits where the time to construct a representation is $O(k \log(1/δ))$ while decoding time is $O(\log(1/δ))$. We also present a second algor… ▽ More We study the problem of differentially private (DP) mechanisms for representing sets of size $k$ from a large universe. Our first construction creates $(ε,δ)$-DP representations with error probability of $1/(e^ε+ 1)$ using space at most $1.05 k ε\cdot \log(e)$ bits where the time to construct a representation is $O(k \log(1/δ))$ while decoding time is $O(\log(1/δ))$. We also present a second algorithm for pure $ε$-DP representations with the same error using space at most $k ε\cdot \log(e)$ bits, but requiring large decoding times. Our algorithms match our lower bounds on privacy-utility trade-offs (including constants but ignoring $δ$ factors) and we also present a new space lower bound matching our constructions up to small constant factors. To obtain our results, we design a new approach embedding sets into random linear systems deviating from most prior approaches that inject noise into non-private solutions. △ Less

Submitted 27 January, 2025; originally announced January 2025.

Comments: Appears at NeurIPS 2024

arXiv:2501.16510 [pdf, other]

Decrypting the temperature field in flow boiling with latent diffusion models

Authors: UngJin Na, JunYoung Seo, Taeil Kim, ByongGuk Jeon, HangJin Jo

Abstract: This paper presents an innovative method using Latent Diffusion Models (LDMs) to generate temperature fields from phase indicator maps. By leveraging the BubbleML dataset from numerical simulations, the LDM translates phase field data into corresponding temperature distributions through a two-stage training process involving a vector-quantized variational autoencoder (VQVAE) and a denoising autoen… ▽ More This paper presents an innovative method using Latent Diffusion Models (LDMs) to generate temperature fields from phase indicator maps. By leveraging the BubbleML dataset from numerical simulations, the LDM translates phase field data into corresponding temperature distributions through a two-stage training process involving a vector-quantized variational autoencoder (VQVAE) and a denoising autoencoder. The resulting model effectively reconstructs complex temperature fields at interfaces. Spectral analysis indicates a high degree of agreement with ground truth data in the low to mid wavenumber ranges, even though some inconsistencies are observed at higher wavenumbers, suggesting areas for further enhancement. This machine learning approach significantly reduces the computational burden of traditional simulations and improves the precision of experimental calibration methods. Future work will focus on refining the model's ability to represent small-scale turbulence and expanding its applicability to a broader range of boiling conditions. △ Less

Submitted 27 January, 2025; originally announced January 2025.

arXiv:2501.12668 [pdf, other]

NBDI: A Simple and Effective Termination Condition for Skill Extraction from Task-Agnostic Demonstrations

Authors: Myunsoo Kim, Hayeong Lee, Seong-Woong Shim, JunHo Seo, Byung-Jun Lee

Abstract: Intelligent agents are able to make decisions based on different levels of granularity and duration. Recent advances in skill learning enabled the agent to solve complex, long-horizon tasks by effectively guiding the agent in choosing appropriate skills. However, the practice of using fixed-length skills can easily result in skipping valuable decision points, which ultimately limits the potential… ▽ More Intelligent agents are able to make decisions based on different levels of granularity and duration. Recent advances in skill learning enabled the agent to solve complex, long-horizon tasks by effectively guiding the agent in choosing appropriate skills. However, the practice of using fixed-length skills can easily result in skipping valuable decision points, which ultimately limits the potential for further exploration and faster policy learning. In this work, we propose to learn a simple and effective termination condition that identifies decision points through a state-action novelty module that leverages agent experience data. Our approach, Novelty-based Decision Point Identification (NBDI), outperforms previous baselines in complex, long-horizon tasks, and remains effective even in the presence of significant variations in the environment configurations of downstream tasks, highlighting the importance of decision point identification in skill learning. △ Less

Submitted 20 May, 2025; v1 submitted 22 January, 2025; originally announced January 2025.

arXiv:2501.10168 [pdf, other]

doi 10.1145/3706598.3713551

Unveiling High-dimensional Backstage: A Survey for Reliable Visual Analytics with Dimensionality Reduction

Authors: Hyeon Jeon, Hyunwook Lee, Yun-Hsin Kuo, Taehyun Yang, Daniel Archambault, Sungahn Ko, Takanori Fujiwara, Kwan-Liu Ma, Jinwook Seo

Abstract: Dimensionality reduction (DR) techniques are essential for visually analyzing high-dimensional data. However, visual analytics using DR often face unreliability, stemming from factors such as inherent distortions in DR projections. This unreliability can lead to analytic insights that misrepresent the underlying data, potentially resulting in misguided decisions. To tackle these reliability challe… ▽ More Dimensionality reduction (DR) techniques are essential for visually analyzing high-dimensional data. However, visual analytics using DR often face unreliability, stemming from factors such as inherent distortions in DR projections. This unreliability can lead to analytic insights that misrepresent the underlying data, potentially resulting in misguided decisions. To tackle these reliability challenges, we review 133 papers that address the unreliability of visual analytics using DR. Through this review, we contribute (1) a workflow model that describes the interaction between analysts and machines in visual analytics using DR, and (2) a taxonomy that identifies where and why reliability issues arise within the workflow, along with existing solutions for addressing them. Our review reveals ongoing challenges in the field, whose significance and urgency are validated by five expert researchers. This review also finds that the current research landscape is skewed toward developing new DR techniques rather than their interpretation or evaluation, where we discuss how the HCI community can contribute to broadening this focus. △ Less

Submitted 3 March, 2025; v1 submitted 17 January, 2025; originally announced January 2025.

Comments: In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '25)

arXiv:2501.09954 [pdf, other]

AIRCHITECT v2: Learning the Hardware Accelerator Design Space through Unified Representations

Authors: Jamin Seo, Akshat Ramachandran, Yu-Chuan Chuang, Anirudh Itagi, Tushar Krishna

Abstract: Design space exploration (DSE) plays a crucial role in enabling custom hardware architectures, particularly for emerging applications like AI, where optimized and specialized designs are essential. With the growing complexity of deep neural networks (DNNs) and the introduction of advanced foundational models (FMs), the design space for DNN accelerators is expanding at an exponential rate. Addition… ▽ More Design space exploration (DSE) plays a crucial role in enabling custom hardware architectures, particularly for emerging applications like AI, where optimized and specialized designs are essential. With the growing complexity of deep neural networks (DNNs) and the introduction of advanced foundational models (FMs), the design space for DNN accelerators is expanding at an exponential rate. Additionally, this space is highly non-uniform and non-convex, making it increasingly difficult to navigate and optimize. Traditional DSE techniques rely on search-based methods, which involve iterative sampling of the design space to find the optimal solution. However, this process is both time-consuming and often fails to converge to the global optima for such design spaces. Recently, AIrchitect v1, the first attempt to address the limitations of search-based techniques, transformed DSE into a constant-time classification problem using recommendation networks. In this work, we propose AIrchitect v2, a more accurate and generalizable learning-based DSE technique applicable to large-scale design spaces that overcomes the shortcomings of earlier approaches. Specifically, we devise an encoder-decoder transformer model that (a) encodes the complex design space into a uniform intermediate representation using contrastive learning and (b) leverages a novel unified representation blending the advantages of classification and regression to effectively explore the large DSE space without sacrificing accuracy. Experimental results evaluated on 10^5 real DNN workloads demonstrate that, on average, AIrchitect v2 outperforms existing techniques by 15% in identifying optimal design points. Furthermore, to demonstrate the generalizability of our method, we evaluate performance on unseen model workloads (LLMs) and attain a 1.7x improvement in inference latency on the identified hardware architecture. △ Less

Submitted 16 January, 2025; originally announced January 2025.

Comments: Accepted to DATE 2025

arXiv:2501.05768 [pdf, other]

Halal or Not: Knowledge Graph Completion for Predicting Cultural Appropriateness of Daily Products

Authors: Van Thuy Hoang, Tien-Bach-Thanh Do, Jinho Seo, Seung Charlie Kim, Luong Vuong Nguyen, Duong Nguyen Minh Huy, Hyeon-Ju Jeon, O-Joun Lee

Abstract: The growing demand for halal cosmetic products has exposed significant challenges, especially in Muslim-majority countries. Recently, various machine learning-based strategies, e.g., image-based methods, have shown remarkable success in predicting the halal status of cosmetics. However, these methods mainly focus on analyzing the discrete and specific ingredients within separate cosmetics, which i… ▽ More The growing demand for halal cosmetic products has exposed significant challenges, especially in Muslim-majority countries. Recently, various machine learning-based strategies, e.g., image-based methods, have shown remarkable success in predicting the halal status of cosmetics. However, these methods mainly focus on analyzing the discrete and specific ingredients within separate cosmetics, which ignore the high-order and complex relations between cosmetics and ingredients. To address this problem, we propose a halal cosmetic recommendation framework, namely HaCKG, that leverages a knowledge graph of cosmetics and their ingredients to explicitly model and capture the relationships between cosmetics and their components. By representing cosmetics and ingredients as entities within the knowledge graph, HaCKG effectively learns the high-order and complex relations between entities, offering a robust method for predicting halal status. Specifically, we first construct a cosmetic knowledge graph representing the relations between various cosmetics, ingredients, and their properties. We then propose a pre-trained relational graph attention network model with residual connections to learn the structural relation between entities in the knowledge graph. The pre-trained model is then fine-tuned on downstream cosmetic data to predict halal status. Extensive experiments on the cosmetic dataset over halal prediction tasks demonstrate the superiority of our model over state-of-the-art baselines. △ Less

Submitted 10 January, 2025; originally announced January 2025.

Comments: 10 pages

arXiv:2501.01404 [pdf, other]

doi 10.1145/3663548.3688487

StereoMath: An Accessible and Musical Equation Editor

Authors: Kenneth Ge, JooYoung Seo

Abstract: For blind and low-vision (BLV) individuals, digital math communication is uniquely difficult due to the lack of accessible tools. Currently, the state of the art is either code-based, like LaTeX, or WYSIWYG, like visual editors. However, both paradigms view math communication as primarily a visual typesetting problem, and may be accessible but difficult to use. In this paper, we present an equatio… ▽ More For blind and low-vision (BLV) individuals, digital math communication is uniquely difficult due to the lack of accessible tools. Currently, the state of the art is either code-based, like LaTeX, or WYSIWYG, like visual editors. However, both paradigms view math communication as primarily a visual typesetting problem, and may be accessible but difficult to use. In this paper, we present an equation editor that is built from the ground up with BLV accessibility in mind. Specifically, we notice that two of the biggest barriers with current technology are the high cognitive load and the lack of spatial relationships. Thus, we build an editor that uses spatial audio cues, muscle memory, tones, and more intuitive navigation to properly contextualize math equations. We discuss how this new paradigm can enable new levels of math communication, engagement, and literacy. Finally, we discuss natural next steps. △ Less

Submitted 2 January, 2025; originally announced January 2025.

Comments: 5 pages, 2 figures, accepted as demo paper at ASSETS '24

Journal ref: ASSETS 2024: Proceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility, Article No.: 129, Pages 1 - 5

arXiv:2412.19450 [pdf, other]

Find the Intention of Instruction: Comprehensive Evaluation of Instruction Understanding for Large Language Models

Authors: Hyeonseok Moon, Jaehyung Seo, Seungyoon Lee, Chanjun Park, Heuiseok Lim

Abstract: One of the key strengths of Large Language Models (LLMs) is their ability to interact with humans by generating appropriate responses to given instructions. This ability, known as instruction-following capability, has established a foundation for the use of LLMs across various fields and serves as a crucial metric for evaluating their performance. While numerous evaluation benchmarks have been dev… ▽ More One of the key strengths of Large Language Models (LLMs) is their ability to interact with humans by generating appropriate responses to given instructions. This ability, known as instruction-following capability, has established a foundation for the use of LLMs across various fields and serves as a crucial metric for evaluating their performance. While numerous evaluation benchmarks have been developed, most focus solely on clear and coherent instructions. However, we have noted that LLMs can become easily distracted by instruction-formatted statements, which may lead to an oversight of their instruction comprehension skills. To address this issue, we introduce the Intention of Instruction (IoInst) benchmark. This benchmark evaluates LLMs' capacity to remain focused and understand instructions without being misled by extraneous instructions. The primary objective of this benchmark is to identify the appropriate instruction that accurately guides the generation of a given context. Our findings suggest that even recently introduced state-of-the-art models still lack instruction understanding capability. Along with the proposition of IoInst in this study, we also present broad analyses of the several strategies potentially applicable to IoInst. △ Less

Submitted 22 January, 2025; v1 submitted 26 December, 2024; originally announced December 2024.

Comments: NAACL25-Findings

arXiv:2412.11520 [pdf, other]

EditSplat: Multi-View Fusion and Attention-Guided Optimization for View-Consistent 3D Scene Editing with 3D Gaussian Splatting

Authors: Dong In Lee, Hyeongcheol Park, Jiyoung Seo, Eunbyung Park, Hyunje Park, Ha Dam Baek, Sangheon Shin, Sangmin Kim, Sangpil Kim

Abstract: Recent advancements in 3D editing have highlighted the potential of text-driven methods in real-time, user-friendly AR/VR applications. However, current methods rely on 2D diffusion models without adequately considering multi-view information, resulting in multi-view inconsistency. While 3D Gaussian Splatting (3DGS) significantly improves rendering quality and speed, its 3D editing process encount… ▽ More Recent advancements in 3D editing have highlighted the potential of text-driven methods in real-time, user-friendly AR/VR applications. However, current methods rely on 2D diffusion models without adequately considering multi-view information, resulting in multi-view inconsistency. While 3D Gaussian Splatting (3DGS) significantly improves rendering quality and speed, its 3D editing process encounters difficulties with inefficient optimization, as pre-trained Gaussians retain excessive source information, hindering optimization. To address these limitations, we propose EditSplat, a novel text-driven 3D scene editing framework that integrates Multi-view Fusion Guidance (MFG) and Attention-Guided Trimming (AGT). Our MFG ensures multi-view consistency by incorporating essential multi-view information into the diffusion process, leveraging classifier-free guidance from the text-to-image diffusion model and the geometric structure inherent to 3DGS. Additionally, our AGT utilizes the explicit representation of 3DGS to selectively prune and optimize 3D Gaussians, enhancing optimization efficiency and enabling precise, semantically rich local editing. Through extensive qualitative and quantitative evaluations, EditSplat achieves state-of-the-art performance, establishing a new benchmark for text-driven 3D scene editing. △ Less

Submitted 17 April, 2025; v1 submitted 16 December, 2024; originally announced December 2024.

arXiv:2412.09916 [pdf, other]

ProxyLLM : LLM-Driven Framework for Customer Support Through Text-Style Transfer

Authors: Sehyeong Jo, Jungwon Seo

Abstract: Chatbot-based customer support services have significantly advanced with the introduction of large language models (LLMs), enabling enhanced response quality and broader application across industries. However, while these advancements focus on reducing business costs and improving customer satisfaction, limited attention has been given to the experiences of customer service agents, who are critica… ▽ More Chatbot-based customer support services have significantly advanced with the introduction of large language models (LLMs), enabling enhanced response quality and broader application across industries. However, while these advancements focus on reducing business costs and improving customer satisfaction, limited attention has been given to the experiences of customer service agents, who are critical to the service ecosystem. A major challenge faced by agents is the stress caused by unnecessary emotional exhaustion from harmful texts, which not only impairs their efficiency but also negatively affects customer satisfaction and business outcomes. In this work, we propose an LLM-powered system designed to enhance the working conditions of customer service agents by addressing emotionally intensive communications. Our proposed system leverages LLMs to transform the tone of customer messages, preserving actionable content while mitigating the emotional impact on human agents. Furthermore, the application is implemented as a Chrome extension, making it highly adaptable and easy to integrate into existing systems. Our method aims to enhance the overall service experience for businesses, customers, and agents. △ Less

Submitted 17 December, 2024; v1 submitted 13 December, 2024; originally announced December 2024.

arXiv:2411.17338 [pdf, other]

Different Bias Under Different Criteria: Assessing Bias in LLMs with a Fact-Based Approach

Authors: Changgeon Ko, Jisu Shin, Hoyun Song, Jeongyeon Seo, Jong C. Park

Abstract: Large language models (LLMs) often reflect real-world biases, leading to efforts to mitigate these effects and make the models unbiased. Achieving this goal requires defining clear criteria for an unbiased state, with any deviation from these criteria considered biased. Some studies define an unbiased state as equal treatment across diverse demographic groups, aiming for balanced outputs from LLMs… ▽ More Large language models (LLMs) often reflect real-world biases, leading to efforts to mitigate these effects and make the models unbiased. Achieving this goal requires defining clear criteria for an unbiased state, with any deviation from these criteria considered biased. Some studies define an unbiased state as equal treatment across diverse demographic groups, aiming for balanced outputs from LLMs. However, differing perspectives on equality and the importance of pluralism make it challenging to establish a universal standard. Alternatively, other approaches propose using fact-based criteria for more consistent and objective evaluations, though these methods have not yet been fully applied to LLM bias assessments. Thus, there is a need for a metric with objective criteria that offers a distinct perspective from equality-based approaches. Motivated by this need, we introduce a novel metric to assess bias using fact-based criteria and real-world statistics. In this paper, we conducted a human survey demonstrating that humans tend to perceive LLM outputs more positively when they align closely with real-world demographic distributions. Evaluating various LLMs with our proposed metric reveals that model bias varies depending on the criteria used, highlighting the need for multi-perspective assessment. △ Less

Submitted 26 November, 2024; originally announced November 2024.

Comments: Accepted in NeurIPS 2024 Workshop on Socially Responsible Language Modelling Research (SoLaR)

arXiv:2411.09255 [pdf, other]

DAHL: Domain-specific Automated Hallucination Evaluation of Long-Form Text through a Benchmark Dataset in Biomedicine

Authors: Jean Seo, Jongwon Lim, Dongjun Jang, Hyopil Shin

Abstract: We introduce DAHL, a benchmark dataset and automated evaluation system designed to assess hallucination in long-form text generation, specifically within the biomedical domain. Our benchmark dataset, meticulously curated from biomedical research papers, consists of 8,573 questions across 29 categories. DAHL evaluates fact-conflicting hallucinations in Large Language Models (LLMs) by deconstructing… ▽ More We introduce DAHL, a benchmark dataset and automated evaluation system designed to assess hallucination in long-form text generation, specifically within the biomedical domain. Our benchmark dataset, meticulously curated from biomedical research papers, consists of 8,573 questions across 29 categories. DAHL evaluates fact-conflicting hallucinations in Large Language Models (LLMs) by deconstructing responses into atomic units, each representing a single piece of information. The accuracy of these responses is averaged to produce the DAHL Score, offering a more in-depth evaluation of hallucinations compared to previous methods that rely on multiple-choice tasks. We conduct experiments with 8 different models, finding that larger models tend to hallucinate less; however, beyond a model size of 7 to 8 billion parameters, further scaling does not significantly improve factual accuracy. The DAHL Score holds potential as an efficient alternative to human-annotated preference labels, being able to be expanded to other specialized domains. We release the dataset and code in public. △ Less

Submitted 14 November, 2024; originally announced November 2024.

Comments: EMNLP2024/FEVER

Showing 1–50 of 277 results for author: Seo, J