Skip to main content

Showing 1–50 of 88 results for author: Chae, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2509.15738  [pdf, ps, other

    cs.LG

    GUI-ReWalk: Massive Data Generation for GUI Agent via Stochastic Exploration and Intent-Aware Reasoning

    Authors: Musen Lin, Minghao Liu, Taoran Lu, Lichen Yuan, Yiwei Liu, Haonan Xu, Yu Miao, Yuhao Chao, Zhaojian Li

    Abstract: Graphical User Interface (GUI) Agents, powered by large language and vision-language models, hold promise for enabling end-to-end automation in digital environments. However, their progress is fundamentally constrained by the scarcity of scalable, high-quality trajectory data. Existing data collection strategies either rely on costly and inconsistent manual annotations or on synthetic generation m… ▽ More

    Submitted 19 September, 2025; originally announced September 2025.

  2. arXiv:2509.14589  [pdf, ps, other

    cs.CR cs.AI

    ATLANTIS: AI-driven Threat Localization, Analysis, and Triage Intelligence System

    Authors: Taesoo Kim, HyungSeok Han, Soyeon Park, Dae R. Jeong, Dohyeok Kim, Dongkwan Kim, Eunsoo Kim, Jiho Kim, Joshua Wang, Kangsu Kim, Sangwoo Ji, Woosun Song, Hanqing Zhao, Andrew Chin, Gyejin Lee, Kevin Stevens, Mansour Alharthi, Yizhuo Zhai, Cen Zhang, Joonun Jang, Yeongjin Jang, Ammar Askar, Dongju Kim, Fabian Fleischer, Jeongin Cho , et al. (21 additional authors not shown)

    Abstract: We present ATLANTIS, the cyber reasoning system developed by Team Atlanta that won 1st place in the Final Competition of DARPA's AI Cyber Challenge (AIxCC) at DEF CON 33 (August 2025). AIxCC (2023-2025) challenged teams to build autonomous cyber reasoning systems capable of discovering and patching vulnerabilities at the speed and scale of modern software. ATLANTIS integrates large language models… ▽ More

    Submitted 17 September, 2025; originally announced September 2025.

    Comments: Version 1.0 (September 17, 2025). Technical Report. Team Atlanta -- 1st place in DARPA AIxCC Final Competition. Project page: https://team-atlanta.github.io/

  3. arXiv:2509.10105  [pdf, ps, other

    cs.CV cs.CL

    VARCO-VISION-2.0 Technical Report

    Authors: Young-rok Cha, Jeongho Ju, SunYoung Park, Jong-Hyeon Lee, Younghyun Yu, Youngjune Kim

    Abstract: We introduce VARCO-VISION-2.0, an open-weight bilingual vision-language model (VLM) for Korean and English with improved capabilities compared to the previous model VARCO-VISION-14B. The model supports multi-image understanding for complex inputs such as documents, charts, and tables, and delivers layoutaware OCR by predicting both textual content and its spatial location. Trained with a four-stag… ▽ More

    Submitted 15 September, 2025; v1 submitted 12 September, 2025; originally announced September 2025.

    Comments: 19 pages, 1 figure, 14 tables. Technical report for VARCO-VISION-2.0, a Korean-English bilingual VLM in 14B and 1.7B variants. Key features: multi-image understanding, OCR with text localization, improved Korean capabilities

  4. arXiv:2509.09671  [pdf, ps, other

    cs.RO cs.CV

    Dexplore: Scalable Neural Control for Dexterous Manipulation from Reference-Scoped Exploration

    Authors: Sirui Xu, Yu-Wei Chao, Liuyu Bian, Arsalan Mousavian, Yu-Xiong Wang, Liang-Yan Gui, Wei Yang

    Abstract: Hand-object motion-capture (MoCap) repositories offer large-scale, contact-rich demonstrations and hold promise for scaling dexterous robotic manipulation. Yet demonstration inaccuracies and embodiment gaps between human and robot hands limit the straightforward use of these data. Existing methods adopt a three-stage workflow, including retargeting, tracking, and residual correction, which often l… ▽ More

    Submitted 11 September, 2025; originally announced September 2025.

    Comments: CoRL 2025

  5. arXiv:2509.02317  [pdf, ps, other

    cs.NI

    AI Agent Communication from Internet Architecture Perspective: Challenges and Opportunities

    Authors: Chenguang Du, Chuyi Wang, Yihan Chao, Xiaohui Xie, Yong Cui

    Abstract: The rapid development of AI agents leads to a surge in communication demands. Alongside this rise, a variety of frameworks and protocols emerge. While these efforts demonstrate the vitality of the field, they also highlight increasing fragmentation, with redundant innovation and siloed designs hindering cross-domain interoperability. These challenges underscore the need for a systematic perspectiv… ▽ More

    Submitted 2 September, 2025; originally announced September 2025.

    Comments: Work in Progress

  6. arXiv:2508.06663  [pdf, ps, other

    cs.LG

    Transferring Social Network Knowledge from Multiple GNN Teachers to Kolmogorov-Arnold Networks

    Authors: Yuan-Hung Chao, Chia-Hsun Lu, Chih-Ya Shen

    Abstract: Graph Neural Networks (GNNs) have shown strong performance on graph-structured data, but their reliance on graph connectivity often limits scalability and efficiency. Kolmogorov-Arnold Networks (KANs), a recent architecture with learnable univariate functions, offer strong nonlinear expressiveness and efficient inference. In this work, we integrate KANs into three popular GNN architectures-GAT, SG… ▽ More

    Submitted 8 August, 2025; originally announced August 2025.

    Comments: 6 pages, 3 tables

  7. arXiv:2507.19040  [pdf, ps, other

    eess.AS cs.CL

    FD-Bench: A Full-Duplex Benchmarking Pipeline Designed for Full Duplex Spoken Dialogue Systems

    Authors: Yizhou Peng, Yi-Wen Chao, Dianwen Ng, Yukun Ma, Chongjia Ni, Bin Ma, Eng Siong Chng

    Abstract: Full-duplex spoken dialogue systems (FDSDS) enable more natural human-machine interactions by allowing real-time user interruptions and backchanneling, compared to traditional SDS that rely on turn-taking. However, existing benchmarks lack metrics for FD scenes, e.g., evaluating model performance during user interruptions. In this paper, we present a comprehensive FD benchmarking pipeline utilizin… ▽ More

    Submitted 25 July, 2025; originally announced July 2025.

    Comments: Accepted to Interspeech 2025. 5 pages

  8. arXiv:2507.13097  [pdf, ps, other

    cs.RO cs.AI

    GraspGen: A Diffusion-based Framework for 6-DOF Grasping with On-Generator Training

    Authors: Adithyavairavan Murali, Balakumar Sundaralingam, Yu-Wei Chao, Wentao Yuan, Jun Yamada, Mark Carlson, Fabio Ramos, Stan Birchfield, Dieter Fox, Clemens Eppner

    Abstract: Grasping is a fundamental robot skill, yet despite significant research advancements, learning-based 6-DOF grasping approaches are still not turnkey and struggle to generalize across different embodiments and in-the-wild settings. We build upon the recent success on modeling the object-centric grasp generation process as an iterative diffusion process. Our proposed framework, GraspGen, consists of… ▽ More

    Submitted 17 July, 2025; originally announced July 2025.

  9. arXiv:2507.11287  [pdf, ps, other

    cs.CV cs.RO

    Task-Oriented Human Grasp Synthesis via Context- and Task-Aware Diffusers

    Authors: An-Lun Liu, Yu-Wei Chao, Yi-Ting Chen

    Abstract: In this paper, we study task-oriented human grasp synthesis, a new grasp synthesis task that demands both task and context awareness. At the core of our method is the task-aware contact maps. Unlike traditional contact maps that only reason about the manipulated object and its relation with the hand, our enhanced maps take into account scene and task information. This comprehensive map is critical… ▽ More

    Submitted 15 July, 2025; originally announced July 2025.

    Comments: Accepted by ICCV 2025

  10. arXiv:2506.16853  [pdf, ps, other

    cs.LG

    Reward-Agnostic Prompt Optimization for Text-to-Image Diffusion Models

    Authors: Semin Kim, Yeonwoo Cha, Jaehoon Yoo, Seunghoon Hong

    Abstract: We investigate a general approach for improving user prompts in text-to-image (T2I) diffusion models by finding prompts that maximize a reward function specified at test-time. Although diverse reward models are used for evaluating image generation, existing automated prompt engineering methods typically target specific reward configurations. Consequently, these specialized designs exhibit suboptim… ▽ More

    Submitted 29 September, 2025; v1 submitted 20 June, 2025; originally announced June 2025.

    Comments: 29 pages, Under review

  11. arXiv:2506.16538  [pdf, ps, other

    cs.SD eess.AS

    Towards Bitrate-Efficient and Noise-Robust Speech Coding with Variable Bitrate RVQ

    Authors: Yunkee Chae, Kyogu Lee

    Abstract: Residual Vector Quantization (RVQ) has become a dominant approach in neural speech and audio coding, providing high-fidelity compression. However, speech coding presents additional challenges due to real-world noise, which degrades compression efficiency. Standard codecs allocate bits uniformly, wasting bitrate on noise components that do not contribute to intelligibility. This paper introduces a… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

    Comments: Accepted to Interspeech 2025

  12. arXiv:2506.13339  [pdf, ps, other

    cs.CL eess.AS

    NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025

    Authors: Yizhou Peng, Bin Wang, Yi-Wen Chao, Ziyang Ma, Haoyang Zhang, Hexin Liu, Xie Chen, Eng Siong Chng

    Abstract: This report details the NTU Speechlab system developed for the Interspeech 2025 Multilingual Conversational Speech and Language Model (MLC-SLM) Challenge (Task I), where we achieved 5th place. We present comprehensive analyses of our multilingual automatic speech recognition system, highlighting key advancements in model architecture, data selection, and training strategies. In particular, languag… ▽ More

    Submitted 4 July, 2025; v1 submitted 16 June, 2025; originally announced June 2025.

    Comments: Accepted by Interspeech 2025 MLC-SLM challenge (5th place). System report

  13. arXiv:2505.23305  [pdf, ps, other

    cs.SD cs.LG eess.AS

    MGE-LDM: Joint Latent Diffusion for Simultaneous Music Generation and Source Extraction

    Authors: Yunkee Chae, Kyogu Lee

    Abstract: We present MGE-LDM, a unified latent diffusion framework for simultaneous music generation, source imputation, and query-driven source separation. Unlike prior approaches constrained to fixed instrument classes, MGE-LDM learns a joint distribution over full mixtures, submixtures, and individual stems within a single compact latent diffusion model. At inference, MGE-LDM enables (1) complete mixture… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: 27 pages, 4 figures

  14. arXiv:2505.13032  [pdf, other

    cs.SD cs.CL cs.MM eess.AS

    MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

    Authors: Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, Wenxi Chen, Yuanzhe Chen, Zhuo Chen, Jian Cong, Kai Li, Keliang Li, Siyou Li, Xinfeng Li, Xiquan Li, Zheng Lian, Yuzhe Liang, Minghao Liu, Zhikang Niu, Tianrui Wang, Yuping Wang, Yuxuan Wang, Yihao Wu, Guanrou Yang, Jianwei Yu , et al. (9 additional authors not shown)

    Abstract: We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1,000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    Comments: Open-source at https://github.com/ddlBoJack/MMAR

  15. arXiv:2505.07235  [pdf, other

    cs.SD eess.AS

    Multi-band Frequency Reconstruction for Neural Psychoacoustic Coding

    Authors: Dianwen Ng, Kun Zhou, Yi-Wen Chao, Zhiwei Xiong, Bin Ma, Eng Siong Chng

    Abstract: Achieving high-fidelity audio compression while preserving perceptual quality across diverse content remains a key challenge in Neural Audio Coding (NAC). We introduce MUFFIN, a fully convolutional Neural Psychoacoustic Coding (NPC) framework that leverages psychoacoustically guided multi-band frequency reconstruction. At its core is a Multi-Band Spectral Residual Vector Quantization (MBS-RVQ) mod… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  16. arXiv:2504.11914  [pdf, other

    cs.CV

    AnomalyR1: A GRPO-based End-to-end MLLM for Industrial Anomaly Detection

    Authors: Yuhao Chao, Jie Liu, Jie Tang, Gangshan Wu

    Abstract: Industrial Anomaly Detection (IAD) poses a formidable challenge due to the scarcity of defective samples, making it imperative to deploy models capable of robust generalization to detect unseen anomalies effectively. Traditional approaches, often constrained by hand-crafted features or domain-specific expert models, struggle to address this limitation, underscoring the need for a paradigm shift. W… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

  17. arXiv:2504.01959  [pdf, other

    cs.RO cs.CV

    Slot-Level Robotic Placement via Visual Imitation from Single Human Video

    Authors: Dandan Shan, Kaichun Mo, Wei Yang, Yu-Wei Chao, David Fouhey, Dieter Fox, Arsalan Mousavian

    Abstract: The majority of modern robot learning methods focus on learning a set of pre-defined tasks with limited or no generalization to new tasks. Extending the robot skillset to novel tasks involves gathering an extensive amount of training data for additional tasks. In this paper, we address the problem of teaching new tasks to robots using human demonstration videos for repetitive tasks (e.g., packing)… ▽ More

    Submitted 2 April, 2025; originally announced April 2025.

  18. arXiv:2502.17842  [pdf, other

    cs.LG cs.NI

    Task-Driven Semantic Quantization and Imitation Learning for Goal-Oriented Communications

    Authors: Yu-Chieh Chao, Yubei Chen, Weiwei Wang, Achintha Wijesinghe, Suchinthaka Wanninayaka, Songyang Zhang, Zhi Ding

    Abstract: Semantic communication marks a new paradigm shift from bit-wise data transmission to semantic information delivery for the purpose of bandwidth reduction. To more effectively carry out specialized downstream tasks at the receiver end, it is crucial to define the most critical semantic message in the data based on the task or goal-oriented features. In this work, we propose a novel goal-oriented co… ▽ More

    Submitted 27 February, 2025; v1 submitted 24 February, 2025; originally announced February 2025.

    Comments: Accepted for publication in 2025 International Conference on Communications (IEEE ICC); 6 pages, 4 figures

    Journal ref: 2025 International Conference on Communications (IEEE ICC)

  19. arXiv:2502.13716  [pdf, other

    cs.CV

    Event-Based Video Frame Interpolation With Cross-Modal Asymmetric Bidirectional Motion Fields

    Authors: Taewoo Kim, Yujeong Chae, Hyun-Kurl Jang, Kuk-Jin Yoon

    Abstract: Video Frame Interpolation (VFI) aims to generate intermediate video frames between consecutive input frames. Since the event cameras are bio-inspired sensors that only encode brightness changes with a micro-second temporal resolution, several works utilized the event camera to enhance the performance of VFI. However, existing methods estimate bidirectional inter-frame motion fields with only event… ▽ More

    Submitted 19 February, 2025; originally announced February 2025.

    Comments: Accepted in CVPR2023(Highlight)

  20. arXiv:2502.05498  [pdf, other

    cs.LG cs.AI cs.GT cs.MA

    Riemannian Manifold Learning for Stackelberg Games with Neural Flow Representations

    Authors: Larkin Liu, Kashif Rasul, Yutong Chao, Jalal Etesami

    Abstract: We present a novel framework for online learning in Stackelberg general-sum games, where two agents, the leader and follower, engage in sequential turn-based interactions. At the core of this approach is a learned diffeomorphism that maps the joint action space to a smooth Riemannian manifold, referred to as the Stackelberg manifold. This mapping, facilitated by neural normalizing flows, ensures t… ▽ More

    Submitted 8 February, 2025; originally announced February 2025.

    Comments: Stackelberg games. Manifold learning. Online learning

    MSC Class: 91A10 ACM Class: I.2.6; I.2.11

  21. arXiv:2501.19380  [pdf, ps, other

    cs.SE

    Creative Problem-Solving: A Study with Blind and Low Vision Software Professionals

    Authors: Karina Kohl, Yoonha Cha, Victoria Jackson, Stacy Branham, André van der Hoek, Rafael Prikladnicki

    Abstract: Background: Software engineering requires both technical skills and creative problem-solving. Blind and low-vision software professionals (BLVSPs) encounter numerous workplace challenges, including inaccessible tools and collaboration hurdles with sighted colleagues. Objective: This study explores the innovative strategies employed by BLVSPs to overcome these accessibility barriers, focusing on th… ▽ More

    Submitted 31 January, 2025; originally announced January 2025.

    Comments: Pre-print of accepted paper for CHASE 2025 (18th International Conference on Cooperative and Human Aspects of Software Engineering)

  22. The Dilemma of Building Do-It-Yourself (DIY) Solutions for Workplace Accessibility

    Authors: Yoonha Cha, Victoria Jackson, Karina Kohl, Rafael Prikladnicki, André van der Hoek, Stacy M. Branham

    Abstract: Existing commercial and in-house software development tools are often inaccessible to Blind and Low Vision Software Professionals (BLVSPs), hindering their participation and career growth at work. Building on existing research on Do-It-Yourself (DIY) Assistive Technologies and customized tools made by programmers, we shed light on the currently unexplored intersection of how DIY tools built and us… ▽ More

    Submitted 30 January, 2025; originally announced January 2025.

    Comments: 17 pages, accepted to CHI 2025

    ACM Class: D.2.9; H.5.3

  23. arXiv:2412.17839  [pdf, other

    cs.LG cs.AI eess.IV

    LaMI-GO: Latent Mixture Integration for Goal-Oriented Communications Achieving High Spectrum Efficiency

    Authors: Achintha Wijesinghe, Suchinthaka Wanninayaka, Weiwei Wang, Yu-Chieh Chao, Songyang Zhang, Zhi Ding

    Abstract: The recent rise of semantic-style communications includes the development of goal-oriented communications (GOCOMs) remarkably efficient multimedia information transmissions. The concept of GO-COMS leverages advanced artificial intelligence (AI) tools to address the rising demand for bandwidth efficiency in applications, such as edge computing and Internet-of-Things (IoT). Unlike traditional commun… ▽ More

    Submitted 18 December, 2024; originally announced December 2024.

    Comments: Under review

  24. arXiv:2412.16720  [pdf, other

    cs.AI

    OpenAI o1 System Card

    Authors: OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich , et al. (238 additional authors not shown)

    Abstract: The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment. This leads to state-of-the-ar… ▽ More

    Submitted 21 December, 2024; originally announced December 2024.

  25. arXiv:2411.16627  [pdf, other

    cs.RO cs.AI cs.HC cs.LG

    Inference-Time Policy Steering through Human Interactions

    Authors: Yanwei Wang, Lirui Wang, Yilun Du, Balakumar Sundaralingam, Xuning Yang, Yu-Wei Chao, Claudia Perez-D'Arpino, Dieter Fox, Julie Shah

    Abstract: Generative policies trained with human demonstrations can autonomously accomplish multimodal, long-horizon tasks. However, during inference, humans are often removed from the policy execution loop, limiting the ability to guide a pre-trained policy towards a specific sub-goal or trajectory shape among multiple predictions. Naive human intervention may inadvertently exacerbate distribution shift, l… ▽ More

    Submitted 25 March, 2025; v1 submitted 25 November, 2024; originally announced November 2024.

    Comments: ICRA 2025

  26. arXiv:2411.13100  [pdf, ps, other

    cs.CL cs.AI

    Song Form-aware Full-Song Text-to-Lyrics Generation with Multi-Level Granularity Syllable Count Control

    Authors: Yunkee Chae, Eunsik Shin, Suntae Hwang, Seungryeol Paik, Kyogu Lee

    Abstract: Lyrics generation presents unique challenges, particularly in achieving precise syllable control while adhering to song form structures such as verses and choruses. Conventional line-by-line approaches often lead to unnatural phrasing, underscoring the need for more granular syllable management. We propose a framework for lyrics generation that enables multi-level syllable control at the word, phr… ▽ More

    Submitted 23 June, 2025; v1 submitted 20 November, 2024; originally announced November 2024.

    Comments: Accepted to Interspeech 2025

  27. arXiv:2410.22918  [pdf, other

    cs.LG

    Simulation-Free Training of Neural ODEs on Paired Data

    Authors: Semin Kim, Jaehoon Yoo, Jinwoo Kim, Yeonwoo Cha, Saehoon Kim, Seunghoon Hong

    Abstract: In this work, we investigate a method for simulation-free training of Neural Ordinary Differential Equations (NODEs) for learning deterministic mappings between paired data. Despite the analogy of NODEs as continuous-depth residual networks, their application in typical supervised learning tasks has not been popular, mainly due to the large number of function evaluations required by ODE solvers an… ▽ More

    Submitted 30 October, 2024; originally announced October 2024.

  28. arXiv:2410.21526  [pdf, other

    cs.LG cs.CL

    Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification

    Authors: Hsun-Yu Kuo, Yin-Hsiang Liao, Yu-Chieh Chao, Wei-Yun Ma, Pu-Jen Cheng

    Abstract: Synthetic data augmentation via large language models (LLMs) allows researchers to leverage additional training data, thus enhancing the performance of downstream tasks, especially when real-world data is scarce. However, the generated data can deviate from the real-world data, and this misalignment can bring deficient outcomes while applying the trained model to applications. Therefore, we propos… ▽ More

    Submitted 22 March, 2025; v1 submitted 28 October, 2024; originally announced October 2024.

    Comments: ICLR 2025 camera ready

  29. arXiv:2410.21153  [pdf, other

    cs.CV cs.RO

    Synthetica: Large Scale Synthetic Data for Robot Perception

    Authors: Ritvik Singh, Jingzhou Liu, Karl Van Wyk, Yu-Wei Chao, Jean-Francois Lafleche, Florian Shkurti, Nathan Ratliff, Ankur Handa

    Abstract: Vision-based object detectors are a crucial basis for robotics applications as they provide valuable information about object localisation in the environment. These need to ensure high reliability in different lighting conditions, occlusions, and visual artifacts, all while running in real-time. Collecting and annotating real-world data for these networks is prohibitively time consuming and costly… ▽ More

    Submitted 28 October, 2024; originally announced October 2024.

    Comments: 21 pages, 11 figures, 5 tables

  30. arXiv:2410.11758  [pdf, other

    cs.RO cs.CL cs.CV cs.LG

    Latent Action Pretraining from Videos

    Authors: Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, Minjoon Seo

    Abstract: We introduce Latent Action Pretraining for general Action models (LAPA), an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. Existing Vision-Language-Action models require action labels typically collected by human teleoperators during pretraining, which significantly limits possible data sources and scale. In this work, we propose a… ▽ More

    Submitted 15 May, 2025; v1 submitted 15 October, 2024; originally announced October 2024.

    Comments: ICLR 2025 Website: https://latentactionpretraining.github.io

  31. arXiv:2410.09342  [pdf, other

    cs.CL

    LLM$\times$MapReduce: Simplified Long-Sequence Processing using Large Language Models

    Authors: Zihan Zhou, Chong Li, Xinyi Chen, Shuo Wang, Yu Chao, Zhili Li, Haoyu Wang, Rongqiao An, Qi Shi, Zhixing Tan, Xu Han, Xiaodong Shi, Zhiyuan Liu, Maosong Sun

    Abstract: Enlarging the context window of large language models (LLMs) has become a crucial research area, particularly for applications involving extremely long texts. In this work, we propose a novel training-free framework for processing long texts, utilizing a divide-and-conquer strategy to achieve comprehensive document understanding. The proposed LLM$\times$MapReduce framework splits the entire docume… ▽ More

    Submitted 11 October, 2024; originally announced October 2024.

    Comments: Work in Progress. Code: https://github.com/thunlp/LLMxMapReduce

  32. arXiv:2410.06016  [pdf, other

    cs.SD cs.LG eess.AS

    Variable Bitrate Residual Vector Quantization for Audio Coding

    Authors: Yunkee Chae, Woosung Choi, Yuhta Takida, Junghyun Koo, Yukara Ikemiya, Zhi Zhong, Kin Wai Cheuk, Marco A. Martínez-Ramírez, Kyogu Lee, Wei-Hsiang Liao, Yuki Mitsufuji

    Abstract: Recent state-of-the-art neural audio compression models have progressively adopted residual vector quantization (RVQ). Despite this success, these models employ a fixed number of codebooks per frame, which can be suboptimal in terms of rate-distortion tradeoff, particularly in scenarios with simple input audio, such as silence. To address this limitation, we propose variable bitrate RVQ (VRVQ) for… ▽ More

    Submitted 27 April, 2025; v1 submitted 8 October, 2024; originally announced October 2024.

    Comments: ICASSP 2025 camera ready version

  33. Intersecting Liminality: Acquiring a Smartphone as a Blind or Low Vision Older Adult

    Authors: Isabela Figueira, Yoonha Cha, Stacy M. Branham

    Abstract: Older adults are increasingly acquiring smartphones. But acquiring smartphones can be difficult, and little is known about the particular challenges of older adults who are additionally blind or losing their vision. We shed light on the social and technical aspects of acquiring smartphones with vision loss, based on deep qualitative interviews with 22 blind or low vision (BLV) older adults aged 60… ▽ More

    Submitted 4 September, 2024; originally announced September 2024.

    Comments: 14 pages, 2 figures, 2 tables, conference paper accepted to The 26th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2024)

    ACM Class: H.5.0; K.4.2

  34. arXiv:2406.08545  [pdf, other

    cs.RO cs.AI cs.CV

    RVT-2: Learning Precise Manipulation from Few Demonstrations

    Authors: Ankit Goyal, Valts Blukis, Jie Xu, Yijie Guo, Yu-Wei Chao, Dieter Fox

    Abstract: In this work, we study how to build a robotic system that can solve multiple 3D manipulation tasks given language instructions. To be useful in industrial and household domains, such a system should be capable of learning new tasks with few demonstrations and solving them precisely. Prior works, like PerAct and RVT, have studied this problem, however, they often struggle with tasks requiring high… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted to RSS 2024

  35. arXiv:2406.06843  [pdf, other

    cs.CV

    HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction

    Authors: Jikai Wang, Qifan Zhang, Yu-Wei Chao, Bowen Wen, Xiaohu Guo, Yu Xiang

    Abstract: We introduce a data capture system and a new dataset, HO-Cap, for 3D reconstruction and pose tracking of hands and objects in videos. The system leverages multiple RGBD cameras and a HoloLens headset for data collection, avoiding the use of expensive 3D scanners or mocap systems. We propose a semi-automatic method for annotating the shape and pose of hands and objects in the collected videos, sign… ▽ More

    Submitted 11 March, 2025; v1 submitted 10 June, 2024; originally announced June 2024.

  36. Understanding the Career Mobility of Blind and Low Vision Software Professionals

    Authors: Yoonha Cha, Victoria Jackson, Isabela Figueira, Stacy M. Branham, André van der Hoek

    Abstract: Context: Scholars in the software engineering (SE) research community have investigated career advancement in the software industry. Research topics have included how individual and external factors can impact career mobility of software professionals, and how gender affects career advancement. However, the community has yet to look at career mobility from the lens of accessibility. Specifically,… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: 12 pages, 1 table, conference paper, 2024 ACM / IEEE 17th International Conference on Cooperative and Human Aspects of Software Engineering

    ACM Class: D.2.9; H.5.3

  37. arXiv:2404.01842  [pdf, other

    cs.CV

    Semi-Supervised Domain Adaptation for Wildfire Detection

    Authors: JooYoung Jang, Youngseo Cha, Jisu Kim, SooHyung Lee, Geonu Lee, Minkook Cho, Young Hwang, Nojun Kwak

    Abstract: Recently, both the frequency and intensity of wildfires have increased worldwide, primarily due to climate change. In this paper, we propose a novel protocol for wildfire detection, leveraging semi-supervised Domain Adaptation for object detection, accompanied by a corresponding dataset designed for use by both academics and industries. Our dataset encompasses 30 times more diverse labeled scenes… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

    Comments: 16 pages, 5 figures, 22 tables

  38. arXiv:2403.06497  [pdf, other

    cs.CV cs.MM

    QuantTune: Optimizing Model Quantization with Adaptive Outlier-Driven Fine Tuning

    Authors: Jiun-Man Chen, Yu-Hsuan Chao, Yu-Jie Wang, Ming-Der Shieh, Chih-Chung Hsu, Wei-Fen Lin

    Abstract: Transformer-based models have gained widespread popularity in both the computer vision (CV) and natural language processing (NLP) fields. However, significant challenges arise during post-training linear quantization, leading to noticeable reductions in inference accuracy. Our study focuses on uncovering the underlying causes of these accuracy drops and proposing a quantization-friendly fine-tunin… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

  39. arXiv:2312.14401  [pdf, other

    cs.HC

    Towards an Exploratory Visual Analytics System for Griefer Identification in MOBA Games

    Authors: Zixin Chen, Shiyi Liu, Zhihua Jin, Gaoping Huang, Yang Chao, Zhenchuan Yang, Quan Li, Huamin Qu

    Abstract: Multiplayer Online Battle Arenas (MOBAs) have gained a significant player base worldwide, generating over two billion US dollars in annual game revenue. However, the presence of griefers, who deliberately irritate and harass other players within the game, can have a detrimental impact on players' experience, compromising game fairness and potentially leading to the emergence of gray industries. Un… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

    Comments: IEEE VIS 2023 (Poster)

  40. arXiv:2312.04936  [pdf, other

    cs.RO

    SKT-Hang: Hanging Everyday Objects via Object-Agnostic Semantic Keypoint Trajectory Generation

    Authors: Chia-Liang Kuo, Yu-Wei Chao, Yi-Ting Chen

    Abstract: We study the problem of hanging a wide range of grasped objects on diverse supporting items. Hanging objects is a ubiquitous task that is encountered in numerous aspects of our everyday lives. However, both the objects and supporting items can exhibit substantial variations in their shapes and structures, bringing two challenging issues: (1) determining the task-relevant geometric structures acros… ▽ More

    Submitted 8 December, 2023; originally announced December 2023.

  41. SynH2R: Synthesizing Hand-Object Motions for Learning Human-to-Robot Handovers

    Authors: Sammy Christen, Lan Feng, Wei Yang, Yu-Wei Chao, Otmar Hilliges, Jie Song

    Abstract: Vision-based human-to-robot handover is an important and challenging task in human-robot interaction. Recent work has attempted to train robot policies by interacting with dynamic virtual humans in simulated environments, where the policies can later be transferred to the real world. However, a major bottleneck is the reliance on human motion capture data, which is expensive to acquire and difficu… ▽ More

    Submitted 31 December, 2024; v1 submitted 9 November, 2023; originally announced November 2023.

    Comments: Accepted to ICRA 2024. Project page: https://eth-ait.github.io/synthetic-handovers/

    Journal ref: IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 2024, pp. 3168-3175

  42. arXiv:2310.13969  [pdf, ps, other

    stat.ML cs.LG

    Distributed Linear Regression with Compositional Covariates

    Authors: Yue Chao, Lei Huang, Xuejun Ma

    Abstract: With the availability of extraordinarily huge data sets, solving the problems of distributed statistical methodology and computing for such data sets has become increasingly crucial in the big data area. In this paper, we focus on the distributed sparse penalized linear log-contrast model in massive compositional data. In particular, two distributed optimization techniques under centralized and de… ▽ More

    Submitted 21 October, 2023; originally announced October 2023.

    Comments: 35 pages,2 figures

    MSC Class: 62-08 62-08 62-08 62-08 62-08 ACM Class: G.3

  43. arXiv:2308.12599  [pdf, other

    cs.SD cs.LG eess.AS

    Exploiting Time-Frequency Conformers for Music Audio Enhancement

    Authors: Yunkee Chae, Junghyun Koo, Sungho Lee, Kyogu Lee

    Abstract: With the proliferation of video platforms on the internet, recording musical performances by mobile devices has become commonplace. However, these recordings often suffer from degradation such as noise and reverberation, which negatively impact the listening experience. Consequently, the necessity for music audio enhancement (referred to as music enhancement from this point onward), involving the… ▽ More

    Submitted 24 August, 2023; originally announced August 2023.

    Comments: Accepted by ACM Multimedia 2023

  44. arXiv:2308.11896  [pdf, other

    cs.CV

    Age Prediction From Face Images Via Contrastive Learning

    Authors: Yeongnam Chae, Poulami Raha, Mijung Kim, Bjorn Stenger

    Abstract: This paper presents a novel approach for accurately estimating age from face images, which overcomes the challenge of collecting a large dataset of individuals with the same identity at different ages. Instead, we leverage readily available face datasets of different people at different ages and aim to extract age-related features using contrastive learning. Our method emphasizes these relevant fe… ▽ More

    Submitted 22 August, 2023; originally announced August 2023.

    Comments: MVA2023

  45. arXiv:2308.09383  [pdf, other

    cs.CV

    Label-Free Event-based Object Recognition via Joint Learning with Image Reconstruction from Events

    Authors: Hoonhee Cho, Hyeonseong Kim, Yujeong Chae, Kuk-Jin Yoon

    Abstract: Recognizing objects from sparse and noisy events becomes extremely difficult when paired images and category labels do not exist. In this paper, we study label-free event-based object recognition where category labels and paired images are not available. To this end, we propose a joint formulation of object recognition and image reconstruction in a complementary manner. Our method first reconstruc… ▽ More

    Submitted 18 August, 2023; originally announced August 2023.

    Comments: Accepted to ICCV 2023 (Oral)

  46. arXiv:2307.12576  [pdf, other

    eess.AS cs.IR cs.LG cs.SD

    Self-refining of Pseudo Labels for Music Source Separation with Noisy Labeled Data

    Authors: Junghyun Koo, Yunkee Chae, Chang-Bin Jeon, Kyogu Lee

    Abstract: Music source separation (MSS) faces challenges due to the limited availability of correctly-labeled individual instrument tracks. With the push to acquire larger datasets to improve MSS performance, the inevitability of encountering mislabeled individual instrument tracks becomes a significant challenge to address. This paper introduces an automated technique for refining the labels in a partially… ▽ More

    Submitted 24 July, 2023; originally announced July 2023.

    Comments: 24th International Society for Music Information Retrieval Conference (ISMIR 2023)

  47. arXiv:2307.09699  [pdf, other

    cs.HC

    ActorLens: Visual Analytics for High-level Actor Identification in MOBA Games

    Authors: Zhihua Jin, Gaoping Huang, Zixin Chen, Shiyi Liu, Yang Chao, Zhenchuan Yang, Quan Li, Huamin Qu

    Abstract: Multiplayer Online Battle Arenas (MOBAs) have garnered a substantial player base worldwide. Nevertheless, the presence of noxious players, commonly referred to as "actors", can significantly compromise game fairness by exhibiting negative behaviors that diminish their team's competitive edge. Furthermore, high-level actors tend to engage in more egregious conduct to evade detection, thereby causin… ▽ More

    Submitted 18 July, 2023; originally announced July 2023.

    Comments: 15 pages, 9 figures

  48. arXiv:2307.04577  [pdf, other

    cs.RO cs.CV cs.LG

    AnyTeleop: A General Vision-Based Dexterous Robot Arm-Hand Teleoperation System

    Authors: Yuzhe Qin, Wei Yang, Binghao Huang, Karl Van Wyk, Hao Su, Xiaolong Wang, Yu-Wei Chao, Dieter Fox

    Abstract: Vision-based teleoperation offers the possibility to endow robots with human-level intelligence to physically interact with the environment, while only requiring low-cost camera sensors. However, current vision-based teleoperation systems are designed and engineered towards a particular robot model and deploy environment, which scales poorly as the pool of the robot models expands and the variety… ▽ More

    Submitted 16 May, 2024; v1 submitted 10 July, 2023; originally announced July 2023.

    Comments: http://anyteleop.com/ Robotics: Science and Systems 2023

  49. arXiv:2307.03073  [pdf, other

    cs.CV cs.RO

    Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning

    Authors: Jishnu Jaykumar P, Kamalesh Palanisamy, Yu-Wei Chao, Xinya Du, Yu Xiang

    Abstract: We propose a novel framework for few-shot learning by leveraging large-scale vision-language models such as CLIP. Motivated by unimodal prototypical networks for few-shot learning, we introduce Proto-CLIP which utilizes image prototypes and text prototypes for few-shot learning. Specifically, Proto-CLIP adapts the image and text encoder embeddings from CLIP in a joint fashion using few-shot exampl… ▽ More

    Submitted 14 July, 2024; v1 submitted 6 July, 2023; originally announced July 2023.

    Comments: Accepted at 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

  50. arXiv:2306.16495  [pdf

    cs.SI cs.AI cs.IR

    Event Detection from Social Media Stream: Methods, Datasets and Opportunities

    Authors: Quanzhi Li, Yang Chao, Dong Li, Yao Lu, Chi Zhang

    Abstract: Social media streams contain large and diverse amount of information, ranging from daily-life stories to the latest global and local events and news. Twitter, especially, allows a fast spread of events happening real time, and enables individuals and organizations to stay informed of the events happening now. Event detection from social media data poses different challenges from traditional text a… ▽ More

    Submitted 28 June, 2023; originally announced June 2023.

    Comments: 8 pages