Search | arXiv e-print repository

GUI-ReWalk: Massive Data Generation for GUI Agent via Stochastic Exploration and Intent-Aware Reasoning

Authors: Musen Lin, Minghao Liu, Taoran Lu, Lichen Yuan, Yiwei Liu, Haonan Xu, Yu Miao, Yuhao Chao, Zhaojian Li

Abstract: Graphical User Interface (GUI) Agents, powered by large language and vision-language models, hold promise for enabling end-to-end automation in digital environments. However, their progress is fundamentally constrained by the scarcity of scalable, high-quality trajectory data. Existing data collection strategies either rely on costly and inconsistent manual annotations or on synthetic generation m… ▽ More Graphical User Interface (GUI) Agents, powered by large language and vision-language models, hold promise for enabling end-to-end automation in digital environments. However, their progress is fundamentally constrained by the scarcity of scalable, high-quality trajectory data. Existing data collection strategies either rely on costly and inconsistent manual annotations or on synthetic generation methods that trade off between diversity and meaningful task coverage. To bridge this gap, we present GUI-ReWalk: a reasoning-enhanced, multi-stage framework for synthesizing realistic and diverse GUI trajectories. GUI-ReWalk begins with a stochastic exploration phase that emulates human trial-and-error behaviors, and progressively transitions into a reasoning-guided phase where inferred goals drive coherent and purposeful interactions. Moreover, it supports multi-stride task generation, enabling the construction of long-horizon workflows across multiple applications. By combining randomness for diversity with goal-aware reasoning for structure, GUI-ReWalk produces data that better reflects the intent-aware, adaptive nature of human-computer interaction. We further train Qwen2.5-VL-7B on the GUI-ReWalk dataset and evaluate it across multiple benchmarks, including Screenspot-Pro, OSWorld-G, UI-Vision, AndroidControl, and GUI-Odyssey. Results demonstrate that GUI-ReWalk enables superior coverage of diverse interaction flows, higher trajectory entropy, and more realistic user intent. These findings establish GUI-ReWalk as a scalable and data-efficient framework for advancing GUI agent research and enabling robust real-world automation. △ Less

Submitted 19 September, 2025; originally announced September 2025.

arXiv:2509.14589 [pdf, ps, other]

ATLANTIS: AI-driven Threat Localization, Analysis, and Triage Intelligence System

Authors: Taesoo Kim, HyungSeok Han, Soyeon Park, Dae R. Jeong, Dohyeok Kim, Dongkwan Kim, Eunsoo Kim, Jiho Kim, Joshua Wang, Kangsu Kim, Sangwoo Ji, Woosun Song, Hanqing Zhao, Andrew Chin, Gyejin Lee, Kevin Stevens, Mansour Alharthi, Yizhuo Zhai, Cen Zhang, Joonun Jang, Yeongjin Jang, Ammar Askar, Dongju Kim, Fabian Fleischer, Jeongin Cho , et al. (21 additional authors not shown)

Abstract: We present ATLANTIS, the cyber reasoning system developed by Team Atlanta that won 1st place in the Final Competition of DARPA's AI Cyber Challenge (AIxCC) at DEF CON 33 (August 2025). AIxCC (2023-2025) challenged teams to build autonomous cyber reasoning systems capable of discovering and patching vulnerabilities at the speed and scale of modern software. ATLANTIS integrates large language models… ▽ More We present ATLANTIS, the cyber reasoning system developed by Team Atlanta that won 1st place in the Final Competition of DARPA's AI Cyber Challenge (AIxCC) at DEF CON 33 (August 2025). AIxCC (2023-2025) challenged teams to build autonomous cyber reasoning systems capable of discovering and patching vulnerabilities at the speed and scale of modern software. ATLANTIS integrates large language models (LLMs) with program analysis -- combining symbolic execution, directed fuzzing, and static analysis -- to address limitations in automated vulnerability discovery and program repair. Developed by researchers at Georgia Institute of Technology, Samsung Research, KAIST, and POSTECH, the system addresses core challenges: scaling across diverse codebases from C to Java, achieving high precision while maintaining broad coverage, and producing semantically correct patches that preserve intended behavior. We detail the design philosophy, architectural decisions, and implementation strategies behind ATLANTIS, share lessons learned from pushing the boundaries of automated security when program analysis meets modern AI, and release artifacts to support reproducibility and future research. △ Less

Submitted 17 September, 2025; originally announced September 2025.

Comments: Version 1.0 (September 17, 2025). Technical Report. Team Atlanta -- 1st place in DARPA AIxCC Final Competition. Project page: https://team-atlanta.github.io/

arXiv:2509.10105 [pdf, ps, other]

VARCO-VISION-2.0 Technical Report

Authors: Young-rok Cha, Jeongho Ju, SunYoung Park, Jong-Hyeon Lee, Younghyun Yu, Youngjune Kim

Abstract: We introduce VARCO-VISION-2.0, an open-weight bilingual vision-language model (VLM) for Korean and English with improved capabilities compared to the previous model VARCO-VISION-14B. The model supports multi-image understanding for complex inputs such as documents, charts, and tables, and delivers layoutaware OCR by predicting both textual content and its spatial location. Trained with a four-stag… ▽ More We introduce VARCO-VISION-2.0, an open-weight bilingual vision-language model (VLM) for Korean and English with improved capabilities compared to the previous model VARCO-VISION-14B. The model supports multi-image understanding for complex inputs such as documents, charts, and tables, and delivers layoutaware OCR by predicting both textual content and its spatial location. Trained with a four-stage curriculum with memory-efficient techniques, the model achieves enhanced multimodal alignment, while preserving core language abilities and improving safety via preference optimization. Extensive benchmark evaluations demonstrate strong spatial grounding and competitive results for both languages, with the 14B model achieving 8th place on the OpenCompass VLM leaderboard among models of comparable scale. Alongside the 14B-scale model, we release a 1.7B version optimized for on-device deployment. We believe these models advance the development of bilingual VLMs and their practical applications. Two variants of VARCO-VISION-2.0 are available at Hugging Face: a full-scale 14B model and a lightweight 1.7B model. △ Less

Submitted 15 September, 2025; v1 submitted 12 September, 2025; originally announced September 2025.

Comments: 19 pages, 1 figure, 14 tables. Technical report for VARCO-VISION-2.0, a Korean-English bilingual VLM in 14B and 1.7B variants. Key features: multi-image understanding, OCR with text localization, improved Korean capabilities

arXiv:2509.09671 [pdf, ps, other]

Dexplore: Scalable Neural Control for Dexterous Manipulation from Reference-Scoped Exploration

Authors: Sirui Xu, Yu-Wei Chao, Liuyu Bian, Arsalan Mousavian, Yu-Xiong Wang, Liang-Yan Gui, Wei Yang

Abstract: Hand-object motion-capture (MoCap) repositories offer large-scale, contact-rich demonstrations and hold promise for scaling dexterous robotic manipulation. Yet demonstration inaccuracies and embodiment gaps between human and robot hands limit the straightforward use of these data. Existing methods adopt a three-stage workflow, including retargeting, tracking, and residual correction, which often l… ▽ More Hand-object motion-capture (MoCap) repositories offer large-scale, contact-rich demonstrations and hold promise for scaling dexterous robotic manipulation. Yet demonstration inaccuracies and embodiment gaps between human and robot hands limit the straightforward use of these data. Existing methods adopt a three-stage workflow, including retargeting, tracking, and residual correction, which often leaves demonstrations underused and compound errors across stages. We introduce Dexplore, a unified single-loop optimization that jointly performs retargeting and tracking to learn robot control policies directly from MoCap at scale. Rather than treating demonstrations as ground truth, we use them as soft guidance. From raw trajectories, we derive adaptive spatial scopes, and train with reinforcement learning to keep the policy in-scope while minimizing control effort and accomplishing the task. This unified formulation preserves demonstration intent, enables robot-specific strategies to emerge, improves robustness to noise, and scales to large demonstration corpora. We distill the scaled tracking policy into a vision-based, skill-conditioned generative controller that encodes diverse manipulation skills in a rich latent representation, supporting generalization across objects and real-world deployment. Taken together, these contributions position Dexplore as a principled bridge that transforms imperfect demonstrations into effective training signals for dexterous manipulation. △ Less

Submitted 11 September, 2025; originally announced September 2025.

Comments: CoRL 2025

arXiv:2509.02317 [pdf, ps, other]

AI Agent Communication from Internet Architecture Perspective: Challenges and Opportunities

Authors: Chenguang Du, Chuyi Wang, Yihan Chao, Xiaohui Xie, Yong Cui

Abstract: The rapid development of AI agents leads to a surge in communication demands. Alongside this rise, a variety of frameworks and protocols emerge. While these efforts demonstrate the vitality of the field, they also highlight increasing fragmentation, with redundant innovation and siloed designs hindering cross-domain interoperability. These challenges underscore the need for a systematic perspectiv… ▽ More The rapid development of AI agents leads to a surge in communication demands. Alongside this rise, a variety of frameworks and protocols emerge. While these efforts demonstrate the vitality of the field, they also highlight increasing fragmentation, with redundant innovation and siloed designs hindering cross-domain interoperability. These challenges underscore the need for a systematic perspective to guide the development of scalable, secure, and sustainable AI agent ecosystems. To address this need, this paper provides the first systematic analysis of AI agent communication from the standpoint of Internet architecture-the most successful global-scale distributed system in history. Specifically, we distill decades of Internet evolution into five key elements that are directly relevant to agent communication: scalability, security, real-time performance, high performance, and manageability. We then use these elements to examine both the opportunities and the bottlenecks in developing robust multi-agent ecosystems. Overall, this paper bridges Internet architecture and AI agent communication for the first time, providing a new lens for guiding the sustainable growth of AI agent communication ecosystems. △ Less

Submitted 2 September, 2025; originally announced September 2025.

Comments: Work in Progress

arXiv:2508.06663 [pdf, ps, other]

Transferring Social Network Knowledge from Multiple GNN Teachers to Kolmogorov-Arnold Networks

Authors: Yuan-Hung Chao, Chia-Hsun Lu, Chih-Ya Shen

Abstract: Graph Neural Networks (GNNs) have shown strong performance on graph-structured data, but their reliance on graph connectivity often limits scalability and efficiency. Kolmogorov-Arnold Networks (KANs), a recent architecture with learnable univariate functions, offer strong nonlinear expressiveness and efficient inference. In this work, we integrate KANs into three popular GNN architectures-GAT, SG… ▽ More Graph Neural Networks (GNNs) have shown strong performance on graph-structured data, but their reliance on graph connectivity often limits scalability and efficiency. Kolmogorov-Arnold Networks (KANs), a recent architecture with learnable univariate functions, offer strong nonlinear expressiveness and efficient inference. In this work, we integrate KANs into three popular GNN architectures-GAT, SGC, and APPNP-resulting in three new models: KGAT, KSGC, and KAPPNP. We further adopt a multi-teacher knowledge amalgamation framework, where knowledge from multiple KAN-based GNNs is distilled into a graph-independent KAN student model. Experiments on benchmark datasets show that the proposed models improve node classification accuracy, and the knowledge amalgamation approach significantly boosts student model performance. Our findings highlight the potential of KANs for enhancing GNN expressiveness and for enabling efficient, graph-free inference. △ Less

Submitted 8 August, 2025; originally announced August 2025.

Comments: 6 pages, 3 tables

arXiv:2507.19040 [pdf, ps, other]

FD-Bench: A Full-Duplex Benchmarking Pipeline Designed for Full Duplex Spoken Dialogue Systems

Authors: Yizhou Peng, Yi-Wen Chao, Dianwen Ng, Yukun Ma, Chongjia Ni, Bin Ma, Eng Siong Chng

Abstract: Full-duplex spoken dialogue systems (FDSDS) enable more natural human-machine interactions by allowing real-time user interruptions and backchanneling, compared to traditional SDS that rely on turn-taking. However, existing benchmarks lack metrics for FD scenes, e.g., evaluating model performance during user interruptions. In this paper, we present a comprehensive FD benchmarking pipeline utilizin… ▽ More Full-duplex spoken dialogue systems (FDSDS) enable more natural human-machine interactions by allowing real-time user interruptions and backchanneling, compared to traditional SDS that rely on turn-taking. However, existing benchmarks lack metrics for FD scenes, e.g., evaluating model performance during user interruptions. In this paper, we present a comprehensive FD benchmarking pipeline utilizing LLMs, TTS, and ASR to address this gap. It assesses FDSDS's ability to handle user interruptions, manage delays, and maintain robustness in challenging scenarios with diverse novel metrics. We applied our benchmark to three open-source FDSDS (Moshi, Freeze-omni, and VITA-1.5) using over 40 hours of generated speech, with 293 simulated conversations and 1,200 interruptions. The results show that all models continue to face challenges, such as failing to respond to user interruptions, under frequent disruptions and noisy conditions. Demonstrations, data, and code will be released. △ Less

Submitted 25 July, 2025; originally announced July 2025.

Comments: Accepted to Interspeech 2025. 5 pages

arXiv:2507.13097 [pdf, ps, other]

GraspGen: A Diffusion-based Framework for 6-DOF Grasping with On-Generator Training

Authors: Adithyavairavan Murali, Balakumar Sundaralingam, Yu-Wei Chao, Wentao Yuan, Jun Yamada, Mark Carlson, Fabio Ramos, Stan Birchfield, Dieter Fox, Clemens Eppner

Abstract: Grasping is a fundamental robot skill, yet despite significant research advancements, learning-based 6-DOF grasping approaches are still not turnkey and struggle to generalize across different embodiments and in-the-wild settings. We build upon the recent success on modeling the object-centric grasp generation process as an iterative diffusion process. Our proposed framework, GraspGen, consists of… ▽ More Grasping is a fundamental robot skill, yet despite significant research advancements, learning-based 6-DOF grasping approaches are still not turnkey and struggle to generalize across different embodiments and in-the-wild settings. We build upon the recent success on modeling the object-centric grasp generation process as an iterative diffusion process. Our proposed framework, GraspGen, consists of a DiffusionTransformer architecture that enhances grasp generation, paired with an efficient discriminator to score and filter sampled grasps. We introduce a novel and performant on-generator training recipe for the discriminator. To scale GraspGen to both objects and grippers, we release a new simulated dataset consisting of over 53 million grasps. We demonstrate that GraspGen outperforms prior methods in simulations with singulated objects across different grippers, achieves state-of-the-art performance on the FetchBench grasping benchmark, and performs well on a real robot with noisy visual observations. △ Less

Submitted 17 July, 2025; originally announced July 2025.

arXiv:2507.11287 [pdf, ps, other]

Task-Oriented Human Grasp Synthesis via Context- and Task-Aware Diffusers

Authors: An-Lun Liu, Yu-Wei Chao, Yi-Ting Chen

Abstract: In this paper, we study task-oriented human grasp synthesis, a new grasp synthesis task that demands both task and context awareness. At the core of our method is the task-aware contact maps. Unlike traditional contact maps that only reason about the manipulated object and its relation with the hand, our enhanced maps take into account scene and task information. This comprehensive map is critical… ▽ More In this paper, we study task-oriented human grasp synthesis, a new grasp synthesis task that demands both task and context awareness. At the core of our method is the task-aware contact maps. Unlike traditional contact maps that only reason about the manipulated object and its relation with the hand, our enhanced maps take into account scene and task information. This comprehensive map is critical for hand-object interaction, enabling accurate grasping poses that align with the task. We propose a two-stage pipeline that first constructs a task-aware contact map informed by the scene and task. In the subsequent stage, we use this contact map to synthesize task-oriented human grasps. We introduce a new dataset and a metric for the proposed task to evaluate our approach. Our experiments validate the importance of modeling both scene and task, demonstrating significant improvements over existing methods in both grasp quality and task performance. See our project page for more details: https://hcis-lab.github.io/TOHGS/ △ Less

Submitted 15 July, 2025; originally announced July 2025.

Comments: Accepted by ICCV 2025

arXiv:2506.16853 [pdf, ps, other]

Reward-Agnostic Prompt Optimization for Text-to-Image Diffusion Models

Authors: Semin Kim, Yeonwoo Cha, Jaehoon Yoo, Seunghoon Hong

Abstract: We investigate a general approach for improving user prompts in text-to-image (T2I) diffusion models by finding prompts that maximize a reward function specified at test-time. Although diverse reward models are used for evaluating image generation, existing automated prompt engineering methods typically target specific reward configurations. Consequently, these specialized designs exhibit suboptim… ▽ More We investigate a general approach for improving user prompts in text-to-image (T2I) diffusion models by finding prompts that maximize a reward function specified at test-time. Although diverse reward models are used for evaluating image generation, existing automated prompt engineering methods typically target specific reward configurations. Consequently, these specialized designs exhibit suboptimal performance when applied to new prompt engineering scenarios involving different reward models. To address this limitation, we introduce RATTPO (Reward-Agnostic Test-Time Prompt Optimization), a flexible test-time optimization method applicable across various reward scenarios without modification. RATTPO iteratively searches for optimized prompts by querying large language models (LLMs) \textit{without} requiring reward-specific task descriptions. Instead, it uses the optimization trajectory and a novel reward-aware feedback signal (termed a "hint") as context. Empirical results demonstrate the versatility of RATTPO, effectively enhancing user prompts across diverse reward setups that assess various generation aspects, such as aesthetics, general human preference, or spatial relationships between objects. RATTPO surpasses other test-time search baselines in search efficiency, running 4.8 times faster than naive reward-agnostic test-time search baseline on average. Furthermore, with sufficient inference budget, it can achieve comparable performance to learning-based baselines that require reward-specific fine-tuning. The code is available at https://github.com/seminkim/RATTPO. △ Less

Submitted 29 September, 2025; v1 submitted 20 June, 2025; originally announced June 2025.

Comments: 29 pages, Under review

arXiv:2506.16538 [pdf, ps, other]

Towards Bitrate-Efficient and Noise-Robust Speech Coding with Variable Bitrate RVQ

Authors: Yunkee Chae, Kyogu Lee

Abstract: Residual Vector Quantization (RVQ) has become a dominant approach in neural speech and audio coding, providing high-fidelity compression. However, speech coding presents additional challenges due to real-world noise, which degrades compression efficiency. Standard codecs allocate bits uniformly, wasting bitrate on noise components that do not contribute to intelligibility. This paper introduces a… ▽ More Residual Vector Quantization (RVQ) has become a dominant approach in neural speech and audio coding, providing high-fidelity compression. However, speech coding presents additional challenges due to real-world noise, which degrades compression efficiency. Standard codecs allocate bits uniformly, wasting bitrate on noise components that do not contribute to intelligibility. This paper introduces a Variable Bitrate RVQ (VRVQ) framework for noise-robust speech coding, dynamically adjusting bitrate per frame to optimize rate-distortion trade-offs. Unlike constant bitrate (CBR) RVQ, our method prioritizes critical speech components while suppressing residual noise. Additionally, we integrate a feature denoiser to further improve noise robustness. Experimental results show that VRVQ improves rate-distortion trade-offs over conventional methods, achieving better compression efficiency and perceptual quality in noisy conditions. Samples are available at our project page: https://yoongi43.github.io/noise_robust_vrvq/. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: Accepted to Interspeech 2025

arXiv:2506.13339 [pdf, ps, other]

NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025

Authors: Yizhou Peng, Bin Wang, Yi-Wen Chao, Ziyang Ma, Haoyang Zhang, Hexin Liu, Xie Chen, Eng Siong Chng

Abstract: This report details the NTU Speechlab system developed for the Interspeech 2025 Multilingual Conversational Speech and Language Model (MLC-SLM) Challenge (Task I), where we achieved 5th place. We present comprehensive analyses of our multilingual automatic speech recognition system, highlighting key advancements in model architecture, data selection, and training strategies. In particular, languag… ▽ More This report details the NTU Speechlab system developed for the Interspeech 2025 Multilingual Conversational Speech and Language Model (MLC-SLM) Challenge (Task I), where we achieved 5th place. We present comprehensive analyses of our multilingual automatic speech recognition system, highlighting key advancements in model architecture, data selection, and training strategies. In particular, language-specific prompts and model averaging techniques were instrumental in boosting system performance across diverse languages. Compared to the initial baseline system, our final model reduced the average Mix Error Rate from 20.2% to 10.6%, representing an absolute improvement of 9.6% (a relative improvement of 48%) on the evaluation set. Our results demonstrate the effectiveness of our approach and offer practical insights for future Speech Large Language Models. △ Less

Submitted 4 July, 2025; v1 submitted 16 June, 2025; originally announced June 2025.

Comments: Accepted by Interspeech 2025 MLC-SLM challenge (5th place). System report

arXiv:2505.23305 [pdf, ps, other]

MGE-LDM: Joint Latent Diffusion for Simultaneous Music Generation and Source Extraction

Authors: Yunkee Chae, Kyogu Lee

Abstract: We present MGE-LDM, a unified latent diffusion framework for simultaneous music generation, source imputation, and query-driven source separation. Unlike prior approaches constrained to fixed instrument classes, MGE-LDM learns a joint distribution over full mixtures, submixtures, and individual stems within a single compact latent diffusion model. At inference, MGE-LDM enables (1) complete mixture… ▽ More We present MGE-LDM, a unified latent diffusion framework for simultaneous music generation, source imputation, and query-driven source separation. Unlike prior approaches constrained to fixed instrument classes, MGE-LDM learns a joint distribution over full mixtures, submixtures, and individual stems within a single compact latent diffusion model. At inference, MGE-LDM enables (1) complete mixture generation, (2) partial generation (i.e., source imputation), and (3) text-conditioned extraction of arbitrary sources. By formulating both separation and imputation as conditional inpainting tasks in the latent space, our approach supports flexible, class-agnostic manipulation of arbitrary instrument sources. Notably, MGE-LDM can be trained jointly across heterogeneous multi-track datasets (e.g., Slakh2100, MUSDB18, MoisesDB) without relying on predefined instrument categories. Audio samples are available at our project page: https://yoongi43.github.io/MGELDM_Samples/. △ Less

Submitted 29 May, 2025; originally announced May 2025.

Comments: 27 pages, 4 figures

arXiv:2505.13032 [pdf, other]

MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

Authors: Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, Wenxi Chen, Yuanzhe Chen, Zhuo Chen, Jian Cong, Kai Li, Keliang Li, Siyou Li, Xinfeng Li, Xiquan Li, Zheng Lian, Yuzhe Liang, Minghao Liu, Zhikang Niu, Tianrui Wang, Yuping Wang, Yuxuan Wang, Yihao Wu, Guanrou Yang, Jianwei Yu , et al. (9 additional authors not shown)

Abstract: We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1,000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that… ▽ More We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1,000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that are limited to specific domains of sound, music, or speech, MMAR extends them to a broad spectrum of real-world audio scenarios, including mixed-modality combinations of sound, music, and speech. Each question in MMAR is hierarchically categorized across four reasoning layers: Signal, Perception, Semantic, and Cultural, with additional sub-categories within each layer to reflect task diversity and complexity. To further foster research in this area, we annotate every question with a Chain-of-Thought (CoT) rationale to promote future advancements in audio reasoning. Each item in the benchmark demands multi-step deep reasoning beyond surface-level understanding. Moreover, a part of the questions requires graduate-level perceptual and domain-specific knowledge, elevating the benchmark's difficulty and depth. We evaluate MMAR using a broad set of models, including Large Audio-Language Models (LALMs), Large Audio Reasoning Models (LARMs), Omni Language Models (OLMs), Large Language Models (LLMs), and Large Reasoning Models (LRMs), with audio caption inputs. The performance of these models on MMAR highlights the benchmark's challenging nature, and our analysis further reveals critical limitations of understanding and reasoning capabilities among current models. We hope MMAR will serve as a catalyst for future advances in this important but little-explored area. △ Less

Submitted 19 May, 2025; originally announced May 2025.

Comments: Open-source at https://github.com/ddlBoJack/MMAR

arXiv:2505.07235 [pdf, other]

Multi-band Frequency Reconstruction for Neural Psychoacoustic Coding

Authors: Dianwen Ng, Kun Zhou, Yi-Wen Chao, Zhiwei Xiong, Bin Ma, Eng Siong Chng

Abstract: Achieving high-fidelity audio compression while preserving perceptual quality across diverse content remains a key challenge in Neural Audio Coding (NAC). We introduce MUFFIN, a fully convolutional Neural Psychoacoustic Coding (NPC) framework that leverages psychoacoustically guided multi-band frequency reconstruction. At its core is a Multi-Band Spectral Residual Vector Quantization (MBS-RVQ) mod… ▽ More Achieving high-fidelity audio compression while preserving perceptual quality across diverse content remains a key challenge in Neural Audio Coding (NAC). We introduce MUFFIN, a fully convolutional Neural Psychoacoustic Coding (NPC) framework that leverages psychoacoustically guided multi-band frequency reconstruction. At its core is a Multi-Band Spectral Residual Vector Quantization (MBS-RVQ) module that allocates bitrate across frequency bands based on perceptual salience. This design enables efficient compression while disentangling speaker identity from content using distinct codebooks. MUFFIN incorporates a transformer-inspired convolutional backbone and a modified snake activation to enhance resolution in fine-grained spectral regions. Experimental results on multiple benchmarks demonstrate that MUFFIN consistently outperforms existing approaches in reconstruction quality. A high-compression variant achieves a state-of-the-art 12.5 Hz rate with minimal loss. MUFFIN also proves effective in downstream generative tasks, highlighting its promise as a token representation for integration with language models. Audio samples and code are available. △ Less

Submitted 12 May, 2025; originally announced May 2025.

arXiv:2504.11914 [pdf, other]

AnomalyR1: A GRPO-based End-to-end MLLM for Industrial Anomaly Detection

Authors: Yuhao Chao, Jie Liu, Jie Tang, Gangshan Wu

Abstract: Industrial Anomaly Detection (IAD) poses a formidable challenge due to the scarcity of defective samples, making it imperative to deploy models capable of robust generalization to detect unseen anomalies effectively. Traditional approaches, often constrained by hand-crafted features or domain-specific expert models, struggle to address this limitation, underscoring the need for a paradigm shift. W… ▽ More Industrial Anomaly Detection (IAD) poses a formidable challenge due to the scarcity of defective samples, making it imperative to deploy models capable of robust generalization to detect unseen anomalies effectively. Traditional approaches, often constrained by hand-crafted features or domain-specific expert models, struggle to address this limitation, underscoring the need for a paradigm shift. We introduce AnomalyR1, a pioneering framework that leverages VLM-R1, a Multimodal Large Language Model (MLLM) renowned for its exceptional generalization and interpretability, to revolutionize IAD. By integrating MLLM with Group Relative Policy Optimization (GRPO), enhanced by our novel Reasoned Outcome Alignment Metric (ROAM), AnomalyR1 achieves a fully end-to-end solution that autonomously processes inputs of image and domain knowledge, reasons through analysis, and generates precise anomaly localizations and masks. Based on the latest multimodal IAD benchmark, our compact 3-billion-parameter model outperforms existing methods, establishing state-of-the-art results. As MLLM capabilities continue to advance, this study is the first to deliver an end-to-end VLM-based IAD solution that demonstrates the transformative potential of ROAM-enhanced GRPO, positioning our framework as a forward-looking cornerstone for next-generation intelligent anomaly detection systems in industrial applications with limited defective data. △ Less

Submitted 16 April, 2025; originally announced April 2025.

arXiv:2504.01959 [pdf, other]

Slot-Level Robotic Placement via Visual Imitation from Single Human Video

Authors: Dandan Shan, Kaichun Mo, Wei Yang, Yu-Wei Chao, David Fouhey, Dieter Fox, Arsalan Mousavian

Abstract: The majority of modern robot learning methods focus on learning a set of pre-defined tasks with limited or no generalization to new tasks. Extending the robot skillset to novel tasks involves gathering an extensive amount of training data for additional tasks. In this paper, we address the problem of teaching new tasks to robots using human demonstration videos for repetitive tasks (e.g., packing)… ▽ More The majority of modern robot learning methods focus on learning a set of pre-defined tasks with limited or no generalization to new tasks. Extending the robot skillset to novel tasks involves gathering an extensive amount of training data for additional tasks. In this paper, we address the problem of teaching new tasks to robots using human demonstration videos for repetitive tasks (e.g., packing). This task requires understanding the human video to identify which object is being manipulated (the pick object) and where it is being placed (the placement slot). In addition, it needs to re-identify the pick object and the placement slots during inference along with the relative poses to enable robot execution of the task. To tackle this, we propose SLeRP, a modular system that leverages several advanced visual foundation models and a novel slot-level placement detector Slot-Net, eliminating the need for expensive video demonstrations for training. We evaluate our system using a new benchmark of real-world videos. The evaluation results show that SLeRP outperforms several baselines and can be deployed on a real robot. △ Less

Submitted 2 April, 2025; originally announced April 2025.

arXiv:2502.17842 [pdf, other]

Task-Driven Semantic Quantization and Imitation Learning for Goal-Oriented Communications

Authors: Yu-Chieh Chao, Yubei Chen, Weiwei Wang, Achintha Wijesinghe, Suchinthaka Wanninayaka, Songyang Zhang, Zhi Ding

Abstract: Semantic communication marks a new paradigm shift from bit-wise data transmission to semantic information delivery for the purpose of bandwidth reduction. To more effectively carry out specialized downstream tasks at the receiver end, it is crucial to define the most critical semantic message in the data based on the task or goal-oriented features. In this work, we propose a novel goal-oriented co… ▽ More Semantic communication marks a new paradigm shift from bit-wise data transmission to semantic information delivery for the purpose of bandwidth reduction. To more effectively carry out specialized downstream tasks at the receiver end, it is crucial to define the most critical semantic message in the data based on the task or goal-oriented features. In this work, we propose a novel goal-oriented communication (GO-COM) framework, namely Goal-Oriented Semantic Variational Autoencoder (GOS-VAE), by focusing on the extraction of the semantics vital to the downstream tasks. Specifically, we adopt a Vector Quantized Variational Autoencoder (VQ-VAE) to compress media data at the transmitter side. Instead of targeting the pixel-wise image data reconstruction, we measure the quality-of-service at the receiver end based on a pre-defined task-incentivized model. Moreover, to capture the relevant semantic features in the data reconstruction, imitation learning is adopted to measure the data regeneration quality in terms of goal-oriented semantics. Our experimental results demonstrate the power of imitation learning in characterizing goal-oriented semantics and bandwidth efficiency of our proposed GOS-VAE. △ Less

Submitted 27 February, 2025; v1 submitted 24 February, 2025; originally announced February 2025.

Comments: Accepted for publication in 2025 International Conference on Communications (IEEE ICC); 6 pages, 4 figures

Journal ref: 2025 International Conference on Communications (IEEE ICC)

arXiv:2502.13716 [pdf, other]

Event-Based Video Frame Interpolation With Cross-Modal Asymmetric Bidirectional Motion Fields

Authors: Taewoo Kim, Yujeong Chae, Hyun-Kurl Jang, Kuk-Jin Yoon

Abstract: Video Frame Interpolation (VFI) aims to generate intermediate video frames between consecutive input frames. Since the event cameras are bio-inspired sensors that only encode brightness changes with a micro-second temporal resolution, several works utilized the event camera to enhance the performance of VFI. However, existing methods estimate bidirectional inter-frame motion fields with only event… ▽ More Video Frame Interpolation (VFI) aims to generate intermediate video frames between consecutive input frames. Since the event cameras are bio-inspired sensors that only encode brightness changes with a micro-second temporal resolution, several works utilized the event camera to enhance the performance of VFI. However, existing methods estimate bidirectional inter-frame motion fields with only events or approximations, which can not consider the complex motion in real-world scenarios. In this paper, we propose a novel event-based VFI framework with cross-modal asymmetric bidirectional motion field estimation. In detail, our EIF-BiOFNet utilizes each valuable characteristic of the events and images for direct estimation of inter-frame motion fields without any approximation methods. Moreover, we develop an interactive attention-based frame synthesis network to efficiently leverage the complementary warping-based and synthesis-based features. Finally, we build a large-scale event-based VFI dataset, ERF-X170FPS, with a high frame rate, extreme motion, and dynamic textures to overcome the limitations of previous event-based VFI datasets. Extensive experimental results validate that our method shows significant performance improvement over the state-of-the-art VFI methods on various datasets. Our project pages are available at: https://github.com/intelpro/CBMNet △ Less

Submitted 19 February, 2025; originally announced February 2025.

Comments: Accepted in CVPR2023(Highlight)

arXiv:2502.05498 [pdf, other]

Riemannian Manifold Learning for Stackelberg Games with Neural Flow Representations

Authors: Larkin Liu, Kashif Rasul, Yutong Chao, Jalal Etesami

Abstract: We present a novel framework for online learning in Stackelberg general-sum games, where two agents, the leader and follower, engage in sequential turn-based interactions. At the core of this approach is a learned diffeomorphism that maps the joint action space to a smooth Riemannian manifold, referred to as the Stackelberg manifold. This mapping, facilitated by neural normalizing flows, ensures t… ▽ More We present a novel framework for online learning in Stackelberg general-sum games, where two agents, the leader and follower, engage in sequential turn-based interactions. At the core of this approach is a learned diffeomorphism that maps the joint action space to a smooth Riemannian manifold, referred to as the Stackelberg manifold. This mapping, facilitated by neural normalizing flows, ensures the formation of tractable isoplanar subspaces, enabling efficient techniques for online learning. By assuming linearity between the agents' reward functions on the Stackelberg manifold, our construct allows the application of standard bandit algorithms. We then provide a rigorous theoretical basis for regret minimization on convex manifolds and establish finite-time bounds on simple regret for learning Stackelberg equilibria. This integration of manifold learning into game theory uncovers a previously unrecognized potential for neural normalizing flows as an effective tool for multi-agent learning. We present empirical results demonstrating the effectiveness of our approach compared to standard baselines, with applications spanning domains such as cybersecurity and economic supply chain optimization. △ Less

Submitted 8 February, 2025; originally announced February 2025.

Comments: Stackelberg games. Manifold learning. Online learning

MSC Class: 91A10 ACM Class: I.2.6; I.2.11

arXiv:2501.19380 [pdf, ps, other]

Creative Problem-Solving: A Study with Blind and Low Vision Software Professionals

Authors: Karina Kohl, Yoonha Cha, Victoria Jackson, Stacy Branham, André van der Hoek, Rafael Prikladnicki

Abstract: Background: Software engineering requires both technical skills and creative problem-solving. Blind and low-vision software professionals (BLVSPs) encounter numerous workplace challenges, including inaccessible tools and collaboration hurdles with sighted colleagues. Objective: This study explores the innovative strategies employed by BLVSPs to overcome these accessibility barriers, focusing on th… ▽ More Background: Software engineering requires both technical skills and creative problem-solving. Blind and low-vision software professionals (BLVSPs) encounter numerous workplace challenges, including inaccessible tools and collaboration hurdles with sighted colleagues. Objective: This study explores the innovative strategies employed by BLVSPs to overcome these accessibility barriers, focusing on their custom solutions and the importance of supportive communities. Methodology: We conducted semi-structured interviews with 30 BLVSPs and used reflexive thematic analysis to identify key themes. Results: Findings reveal that BLVSPs are motivated to develop creative and adaptive solutions, highlighting the vital role of collaborative communities in fostering shared problem-solving. Conclusion: For BLVSPs, creative problem-solving is essential for navigating inaccessible work environments, in contrast to sighted peers, who pursue optimization. This study enhances understanding of how BLVSPs navigate accessibility challenges through innovation. △ Less

Submitted 31 January, 2025; originally announced January 2025.

Comments: Pre-print of accepted paper for CHASE 2025 (18th International Conference on Cooperative and Human Aspects of Software Engineering)

arXiv:2501.18148 [pdf, ps, other]

doi 10.1145/3706598.3713302

The Dilemma of Building Do-It-Yourself (DIY) Solutions for Workplace Accessibility

Authors: Yoonha Cha, Victoria Jackson, Karina Kohl, Rafael Prikladnicki, André van der Hoek, Stacy M. Branham

Abstract: Existing commercial and in-house software development tools are often inaccessible to Blind and Low Vision Software Professionals (BLVSPs), hindering their participation and career growth at work. Building on existing research on Do-It-Yourself (DIY) Assistive Technologies and customized tools made by programmers, we shed light on the currently unexplored intersection of how DIY tools built and us… ▽ More Existing commercial and in-house software development tools are often inaccessible to Blind and Low Vision Software Professionals (BLVSPs), hindering their participation and career growth at work. Building on existing research on Do-It-Yourself (DIY) Assistive Technologies and customized tools made by programmers, we shed light on the currently unexplored intersection of how DIY tools built and used by BLVSPs support accessible software development. Through semi-structured interviews with 30 BLVSPs, we found that such tools serve many different purposes and are driven by motivations such as desiring to maintain a professional image and a sense of dignity at work. These tools had significant impacts on workplace accessibility and revealed a need for a more centralized community for sharing tools, tips, and tricks. Based on our findings, we introduce the "Double Hacker Dilemma" and highlight a need for developing more effective peer and organizational platforms that support DIY tool sharing. △ Less

Submitted 30 January, 2025; originally announced January 2025.

Comments: 17 pages, accepted to CHI 2025

ACM Class: D.2.9; H.5.3

arXiv:2412.17839 [pdf, other]

LaMI-GO: Latent Mixture Integration for Goal-Oriented Communications Achieving High Spectrum Efficiency

Authors: Achintha Wijesinghe, Suchinthaka Wanninayaka, Weiwei Wang, Yu-Chieh Chao, Songyang Zhang, Zhi Ding

Abstract: The recent rise of semantic-style communications includes the development of goal-oriented communications (GOCOMs) remarkably efficient multimedia information transmissions. The concept of GO-COMS leverages advanced artificial intelligence (AI) tools to address the rising demand for bandwidth efficiency in applications, such as edge computing and Internet-of-Things (IoT). Unlike traditional commun… ▽ More The recent rise of semantic-style communications includes the development of goal-oriented communications (GOCOMs) remarkably efficient multimedia information transmissions. The concept of GO-COMS leverages advanced artificial intelligence (AI) tools to address the rising demand for bandwidth efficiency in applications, such as edge computing and Internet-of-Things (IoT). Unlike traditional communication systems focusing on source data accuracy, GO-COMs provide intelligent message delivery catering to the special needs critical to accomplishing downstream tasks at the receiver. In this work, we present a novel GO-COM framework, namely LaMI-GO that utilizes emerging generative AI for better quality-of-service (QoS) with ultra-high communication efficiency. Specifically, we design our LaMI-GO system backbone based on a latent diffusion model followed by a vector-quantized generative adversarial network (VQGAN) for efficient latent embedding and information representation. The system trains a common feature codebook the receiver side. Our experimental results demonstrate substantial improvement in perceptual quality, accuracy of downstream tasks, and bandwidth consumption over the state-of-the-art GOCOM systems and establish the power of our proposed LaMI-GO communication framework. △ Less

Submitted 18 December, 2024; originally announced December 2024.

Comments: Under review

arXiv:2412.16720 [pdf, other]

OpenAI o1 System Card

Authors: OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich , et al. (238 additional authors not shown)

Abstract: The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment. This leads to state-of-the-ar… ▽ More The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment. This leads to state-of-the-art performance on certain benchmarks for risks such as generating illicit advice, choosing stereotyped responses, and succumbing to known jailbreaks. Training models to incorporate a chain of thought before answering has the potential to unlock substantial benefits, while also increasing potential risks that stem from heightened intelligence. Our results underscore the need for building robust alignment methods, extensively stress-testing their efficacy, and maintaining meticulous risk management protocols. This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations. △ Less

Submitted 21 December, 2024; originally announced December 2024.

arXiv:2411.16627 [pdf, other]

Inference-Time Policy Steering through Human Interactions

Authors: Yanwei Wang, Lirui Wang, Yilun Du, Balakumar Sundaralingam, Xuning Yang, Yu-Wei Chao, Claudia Perez-D'Arpino, Dieter Fox, Julie Shah

Abstract: Generative policies trained with human demonstrations can autonomously accomplish multimodal, long-horizon tasks. However, during inference, humans are often removed from the policy execution loop, limiting the ability to guide a pre-trained policy towards a specific sub-goal or trajectory shape among multiple predictions. Naive human intervention may inadvertently exacerbate distribution shift, l… ▽ More Generative policies trained with human demonstrations can autonomously accomplish multimodal, long-horizon tasks. However, during inference, humans are often removed from the policy execution loop, limiting the ability to guide a pre-trained policy towards a specific sub-goal or trajectory shape among multiple predictions. Naive human intervention may inadvertently exacerbate distribution shift, leading to constraint violations or execution failures. To better align policy output with human intent without inducing out-of-distribution errors, we propose an Inference-Time Policy Steering (ITPS) framework that leverages human interactions to bias the generative sampling process, rather than fine-tuning the policy on interaction data. We evaluate ITPS across three simulated and real-world benchmarks, testing three forms of human interaction and associated alignment distance metrics. Among six sampling strategies, our proposed stochastic sampling with diffusion policy achieves the best trade-off between alignment and distribution shift. Videos are available at https://yanweiw.github.io/itps/. △ Less

Submitted 25 March, 2025; v1 submitted 25 November, 2024; originally announced November 2024.

Comments: ICRA 2025

arXiv:2411.13100 [pdf, ps, other]

Song Form-aware Full-Song Text-to-Lyrics Generation with Multi-Level Granularity Syllable Count Control

Authors: Yunkee Chae, Eunsik Shin, Suntae Hwang, Seungryeol Paik, Kyogu Lee

Abstract: Lyrics generation presents unique challenges, particularly in achieving precise syllable control while adhering to song form structures such as verses and choruses. Conventional line-by-line approaches often lead to unnatural phrasing, underscoring the need for more granular syllable management. We propose a framework for lyrics generation that enables multi-level syllable control at the word, phr… ▽ More Lyrics generation presents unique challenges, particularly in achieving precise syllable control while adhering to song form structures such as verses and choruses. Conventional line-by-line approaches often lead to unnatural phrasing, underscoring the need for more granular syllable management. We propose a framework for lyrics generation that enables multi-level syllable control at the word, phrase, line, and paragraph levels, aware of song form. Our approach generates complete lyrics conditioned on input text and song form, ensuring alignment with specified syllable constraints. Generated lyrics samples are available at: https://tinyurl.com/lyrics9999 △ Less

Submitted 23 June, 2025; v1 submitted 20 November, 2024; originally announced November 2024.

Comments: Accepted to Interspeech 2025

arXiv:2410.22918 [pdf, other]

Simulation-Free Training of Neural ODEs on Paired Data

Authors: Semin Kim, Jaehoon Yoo, Jinwoo Kim, Yeonwoo Cha, Saehoon Kim, Seunghoon Hong

Abstract: In this work, we investigate a method for simulation-free training of Neural Ordinary Differential Equations (NODEs) for learning deterministic mappings between paired data. Despite the analogy of NODEs as continuous-depth residual networks, their application in typical supervised learning tasks has not been popular, mainly due to the large number of function evaluations required by ODE solvers an… ▽ More In this work, we investigate a method for simulation-free training of Neural Ordinary Differential Equations (NODEs) for learning deterministic mappings between paired data. Despite the analogy of NODEs as continuous-depth residual networks, their application in typical supervised learning tasks has not been popular, mainly due to the large number of function evaluations required by ODE solvers and numerical instability in gradient estimation. To alleviate this problem, we employ the flow matching framework for simulation-free training of NODEs, which directly regresses the parameterized dynamics function to a predefined target velocity field. Contrary to generative tasks, however, we show that applying flow matching directly between paired data can often lead to an ill-defined flow that breaks the coupling of the data pairs (e.g., due to crossing trajectories). We propose a simple extension that applies flow matching in the embedding space of data pairs, where the embeddings are learned jointly with the dynamic function to ensure the validity of the flow which is also easier to learn. We demonstrate the effectiveness of our method on both regression and classification tasks, where our method outperforms existing NODEs with a significantly lower number of function evaluations. The code is available at https://github.com/seminkim/simulation-free-node. △ Less

Submitted 30 October, 2024; originally announced October 2024.

arXiv:2410.21526 [pdf, other]

Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification

Authors: Hsun-Yu Kuo, Yin-Hsiang Liao, Yu-Chieh Chao, Wei-Yun Ma, Pu-Jen Cheng

Abstract: Synthetic data augmentation via large language models (LLMs) allows researchers to leverage additional training data, thus enhancing the performance of downstream tasks, especially when real-world data is scarce. However, the generated data can deviate from the real-world data, and this misalignment can bring deficient outcomes while applying the trained model to applications. Therefore, we propos… ▽ More Synthetic data augmentation via large language models (LLMs) allows researchers to leverage additional training data, thus enhancing the performance of downstream tasks, especially when real-world data is scarce. However, the generated data can deviate from the real-world data, and this misalignment can bring deficient outcomes while applying the trained model to applications. Therefore, we proposed efficient weighted-loss approaches to align synthetic data with real-world distribution by emphasizing high-quality and diversified data generated by LLMs with using merely a little real-world data. We empirically assessed the effectiveness of our method on multiple text classification tasks, and the results showed leveraging our approaches on a BERT-level model robustly outperformed standard cross-entropy and other data weighting approaches, providing potential solutions to effectively leveraging synthetic data from any suitable data generator for model training. △ Less

Submitted 22 March, 2025; v1 submitted 28 October, 2024; originally announced October 2024.

Comments: ICLR 2025 camera ready

arXiv:2410.21153 [pdf, other]

Synthetica: Large Scale Synthetic Data for Robot Perception

Authors: Ritvik Singh, Jingzhou Liu, Karl Van Wyk, Yu-Wei Chao, Jean-Francois Lafleche, Florian Shkurti, Nathan Ratliff, Ankur Handa

Abstract: Vision-based object detectors are a crucial basis for robotics applications as they provide valuable information about object localisation in the environment. These need to ensure high reliability in different lighting conditions, occlusions, and visual artifacts, all while running in real-time. Collecting and annotating real-world data for these networks is prohibitively time consuming and costly… ▽ More Vision-based object detectors are a crucial basis for robotics applications as they provide valuable information about object localisation in the environment. These need to ensure high reliability in different lighting conditions, occlusions, and visual artifacts, all while running in real-time. Collecting and annotating real-world data for these networks is prohibitively time consuming and costly, especially for custom assets, such as industrial objects, making it untenable for generalization to in-the-wild scenarios. To this end, we present Synthetica, a method for large-scale synthetic data generation for training robust state estimators. This paper focuses on the task of object detection, an important problem which can serve as the front-end for most state estimation problems, such as pose estimation. Leveraging data from a photorealistic ray-tracing renderer, we scale up data generation, generating 2.7 million images, to train highly accurate real-time detection transformers. We present a collection of rendering randomization and training-time data augmentation techniques conducive to robust sim-to-real performance for vision tasks. We demonstrate state-of-the-art performance on the task of object detection while having detectors that run at 50-100Hz which is 9 times faster than the prior SOTA. We further demonstrate the usefulness of our training methodology for robotics applications by showcasing a pipeline for use in the real world with custom objects for which there do not exist prior datasets. Our work highlights the importance of scaling synthetic data generation for robust sim-to-real transfer while achieving the fastest real-time inference speeds. Videos and supplementary information can be found at this URL: https://sites.google.com/view/synthetica-vision. △ Less

Submitted 28 October, 2024; originally announced October 2024.

Comments: 21 pages, 11 figures, 5 tables

arXiv:2410.11758 [pdf, other]

Latent Action Pretraining from Videos

Authors: Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, Minjoon Seo

Abstract: We introduce Latent Action Pretraining for general Action models (LAPA), an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. Existing Vision-Language-Action models require action labels typically collected by human teleoperators during pretraining, which significantly limits possible data sources and scale. In this work, we propose a… ▽ More We introduce Latent Action Pretraining for general Action models (LAPA), an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. Existing Vision-Language-Action models require action labels typically collected by human teleoperators during pretraining, which significantly limits possible data sources and scale. In this work, we propose a method to learn from internet-scale videos that do not have robot action labels. We first train an action quantization model leveraging VQ-VAE-based objective to learn discrete latent actions between image frames, then pretrain a latent VLA model to predict these latent actions from observations and task descriptions, and finally finetune the VLA on small-scale robot manipulation data to map from latent to robot actions. Experimental results demonstrate that our method significantly outperforms existing techniques that train robot manipulation policies from large-scale videos. Furthermore, it outperforms the state-of-the-art VLA model trained with robotic action labels on real-world manipulation tasks that require language conditioning, generalization to unseen objects, and semantic generalization to unseen instructions. Training only on human manipulation videos also shows positive transfer, opening up the potential for leveraging web-scale data for robotics foundation model. △ Less

Submitted 15 May, 2025; v1 submitted 15 October, 2024; originally announced October 2024.

Comments: ICLR 2025 Website: https://latentactionpretraining.github.io

arXiv:2410.09342 [pdf, other]

LLM$\times$MapReduce: Simplified Long-Sequence Processing using Large Language Models

Authors: Zihan Zhou, Chong Li, Xinyi Chen, Shuo Wang, Yu Chao, Zhili Li, Haoyu Wang, Rongqiao An, Qi Shi, Zhixing Tan, Xu Han, Xiaodong Shi, Zhiyuan Liu, Maosong Sun

Abstract: Enlarging the context window of large language models (LLMs) has become a crucial research area, particularly for applications involving extremely long texts. In this work, we propose a novel training-free framework for processing long texts, utilizing a divide-and-conquer strategy to achieve comprehensive document understanding. The proposed LLM$\times$MapReduce framework splits the entire docume… ▽ More Enlarging the context window of large language models (LLMs) has become a crucial research area, particularly for applications involving extremely long texts. In this work, we propose a novel training-free framework for processing long texts, utilizing a divide-and-conquer strategy to achieve comprehensive document understanding. The proposed LLM$\times$MapReduce framework splits the entire document into several chunks for LLMs to read and then aggregates the intermediate answers to produce the final output. The main challenge for divide-and-conquer long text processing frameworks lies in the risk of losing essential long-range information when splitting the document, which can lead the model to produce incomplete or incorrect answers based on the segmented texts. Disrupted long-range information can be classified into two categories: inter-chunk dependency and inter-chunk conflict. We design a structured information protocol to better cope with inter-chunk dependency and an in-context confidence calibration mechanism to resolve inter-chunk conflicts. Experimental results demonstrate that LLM$\times$MapReduce can outperform representative open-source and commercial long-context LLMs, and is applicable to several different models. △ Less

Submitted 11 October, 2024; originally announced October 2024.

Comments: Work in Progress. Code: https://github.com/thunlp/LLMxMapReduce

arXiv:2410.06016 [pdf, other]

Variable Bitrate Residual Vector Quantization for Audio Coding

Authors: Yunkee Chae, Woosung Choi, Yuhta Takida, Junghyun Koo, Yukara Ikemiya, Zhi Zhong, Kin Wai Cheuk, Marco A. Martínez-Ramírez, Kyogu Lee, Wei-Hsiang Liao, Yuki Mitsufuji

Abstract: Recent state-of-the-art neural audio compression models have progressively adopted residual vector quantization (RVQ). Despite this success, these models employ a fixed number of codebooks per frame, which can be suboptimal in terms of rate-distortion tradeoff, particularly in scenarios with simple input audio, such as silence. To address this limitation, we propose variable bitrate RVQ (VRVQ) for… ▽ More Recent state-of-the-art neural audio compression models have progressively adopted residual vector quantization (RVQ). Despite this success, these models employ a fixed number of codebooks per frame, which can be suboptimal in terms of rate-distortion tradeoff, particularly in scenarios with simple input audio, such as silence. To address this limitation, we propose variable bitrate RVQ (VRVQ) for audio codecs, which allows for more efficient coding by adapting the number of codebooks used per frame. Furthermore, we propose a gradient estimation method for the non-differentiable masking operation that transforms from the importance map to the binary importance mask, improving model training via a straight-through estimator. We demonstrate that the proposed training framework achieves superior results compared to the baseline method and shows further improvement when applied to the current state-of-the-art codec. △ Less

Submitted 27 April, 2025; v1 submitted 8 October, 2024; originally announced October 2024.

Comments: ICASSP 2025 camera ready version

arXiv:2409.03086 [pdf, other]

doi 10.1145/3663548.3675622

Intersecting Liminality: Acquiring a Smartphone as a Blind or Low Vision Older Adult

Authors: Isabela Figueira, Yoonha Cha, Stacy M. Branham

Abstract: Older adults are increasingly acquiring smartphones. But acquiring smartphones can be difficult, and little is known about the particular challenges of older adults who are additionally blind or losing their vision. We shed light on the social and technical aspects of acquiring smartphones with vision loss, based on deep qualitative interviews with 22 blind or low vision (BLV) older adults aged 60… ▽ More Older adults are increasingly acquiring smartphones. But acquiring smartphones can be difficult, and little is known about the particular challenges of older adults who are additionally blind or losing their vision. We shed light on the social and technical aspects of acquiring smartphones with vision loss, based on deep qualitative interviews with 22 blind or low vision (BLV) older adults aged 60 and over. Through our grounded theory analysis, we found that BLV older adults experience liminality as they acquire smartphones and transition through re-acquiring smartphones as they become blind, and they can transition through liminality by participating in mutual aid within the blind community. We contribute the notion of "Intersecting Liminality," which explains the marginalizing experience of simultaneously transitioning through vision loss, aging, and technology acquisition. We contend that Intersecting Liminality can serve as a framework that centers the dynamic nature of disability to help our community generate a more nuanced understanding of technology acquisition and more effective assistive interventions. △ Less

Submitted 4 September, 2024; originally announced September 2024.

Comments: 14 pages, 2 figures, 2 tables, conference paper accepted to The 26th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2024)

ACM Class: H.5.0; K.4.2

arXiv:2406.08545 [pdf, other]

RVT-2: Learning Precise Manipulation from Few Demonstrations

Authors: Ankit Goyal, Valts Blukis, Jie Xu, Yijie Guo, Yu-Wei Chao, Dieter Fox

Abstract: In this work, we study how to build a robotic system that can solve multiple 3D manipulation tasks given language instructions. To be useful in industrial and household domains, such a system should be capable of learning new tasks with few demonstrations and solving them precisely. Prior works, like PerAct and RVT, have studied this problem, however, they often struggle with tasks requiring high… ▽ More In this work, we study how to build a robotic system that can solve multiple 3D manipulation tasks given language instructions. To be useful in industrial and household domains, such a system should be capable of learning new tasks with few demonstrations and solving them precisely. Prior works, like PerAct and RVT, have studied this problem, however, they often struggle with tasks requiring high precision. We study how to make them more effective, precise, and fast. Using a combination of architectural and system-level improvements, we propose RVT-2, a multitask 3D manipulation model that is 6X faster in training and 2X faster in inference than its predecessor RVT. RVT-2 achieves a new state-of-the-art on RLBench, improving the success rate from 65% to 82%. RVT-2 is also effective in the real world, where it can learn tasks requiring high precision, like picking up and inserting plugs, with just 10 demonstrations. Visual results, code, and trained model are provided at: https://robotic-view-transformer-2.github.io/. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: Accepted to RSS 2024

arXiv:2406.06843 [pdf, other]

HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction

Authors: Jikai Wang, Qifan Zhang, Yu-Wei Chao, Bowen Wen, Xiaohu Guo, Yu Xiang

Abstract: We introduce a data capture system and a new dataset, HO-Cap, for 3D reconstruction and pose tracking of hands and objects in videos. The system leverages multiple RGBD cameras and a HoloLens headset for data collection, avoiding the use of expensive 3D scanners or mocap systems. We propose a semi-automatic method for annotating the shape and pose of hands and objects in the collected videos, sign… ▽ More We introduce a data capture system and a new dataset, HO-Cap, for 3D reconstruction and pose tracking of hands and objects in videos. The system leverages multiple RGBD cameras and a HoloLens headset for data collection, avoiding the use of expensive 3D scanners or mocap systems. We propose a semi-automatic method for annotating the shape and pose of hands and objects in the collected videos, significantly reducing the annotation time compared to manual labeling. With this system, we captured a video dataset of humans interacting with objects to perform various tasks, including simple pick-and-place actions, handovers between hands, and using objects according to their affordance, which can serve as human demonstrations for research in embodied AI and robot manipulation. Our data capture setup and annotation framework will be available for the community to use in reconstructing 3D shapes of objects and human hands and tracking their poses in videos. △ Less

Submitted 11 March, 2025; v1 submitted 10 June, 2024; originally announced June 2024.

arXiv:2404.17036 [pdf, other]

doi 10.1145/3641822.3641872

Understanding the Career Mobility of Blind and Low Vision Software Professionals

Authors: Yoonha Cha, Victoria Jackson, Isabela Figueira, Stacy M. Branham, André van der Hoek

Abstract: Context: Scholars in the software engineering (SE) research community have investigated career advancement in the software industry. Research topics have included how individual and external factors can impact career mobility of software professionals, and how gender affects career advancement. However, the community has yet to look at career mobility from the lens of accessibility. Specifically,… ▽ More Context: Scholars in the software engineering (SE) research community have investigated career advancement in the software industry. Research topics have included how individual and external factors can impact career mobility of software professionals, and how gender affects career advancement. However, the community has yet to look at career mobility from the lens of accessibility. Specifically, there is a pressing need to illuminate the factors that hinder the career mobility of blind and low vision software professionals (BLVSPs). Objective: This study aims to understand aspects of the workplace that impact career mobility for BLVSPs. Methods: We interviewed 26 BLVSPs with different roles, years of experience, and industry sectors. Thematic analysis was used to identify common factors related to career mobility. Results: We found four factors that impacted the career mobility of BLVSPs: (1) technical challenges, (2) colleagues' perceptions of BLVSPs, (3) BLVSPs' own perceptions on managerial progression, and (4) BLVSPs' investment in accessibility at the workplace. Conclusion: We suggest implications for tool designers, organizations, and researchers towards fostering more accessible workplaces to support the career mobility of BLVSPs. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: 12 pages, 1 table, conference paper, 2024 ACM / IEEE 17th International Conference on Cooperative and Human Aspects of Software Engineering

ACM Class: D.2.9; H.5.3

arXiv:2404.01842 [pdf, other]

Semi-Supervised Domain Adaptation for Wildfire Detection

Authors: JooYoung Jang, Youngseo Cha, Jisu Kim, SooHyung Lee, Geonu Lee, Minkook Cho, Young Hwang, Nojun Kwak

Abstract: Recently, both the frequency and intensity of wildfires have increased worldwide, primarily due to climate change. In this paper, we propose a novel protocol for wildfire detection, leveraging semi-supervised Domain Adaptation for object detection, accompanied by a corresponding dataset designed for use by both academics and industries. Our dataset encompasses 30 times more diverse labeled scenes… ▽ More Recently, both the frequency and intensity of wildfires have increased worldwide, primarily due to climate change. In this paper, we propose a novel protocol for wildfire detection, leveraging semi-supervised Domain Adaptation for object detection, accompanied by a corresponding dataset designed for use by both academics and industries. Our dataset encompasses 30 times more diverse labeled scenes for the current largest benchmark wildfire dataset, HPWREN, and introduces a new labeling policy for wildfire detection. Inspired by CoordConv, we propose a robust baseline, Location-Aware Object Detection for Semi-Supervised Domain Adaptation (LADA), utilizing a teacher-student based framework capable of extracting translational variance features characteristic of wildfires. With only using 1% target domain labeled data, our framework significantly outperforms our source-only baseline by a notable margin of 3.8% in mean Average Precision on the HPWREN wildfire dataset. Our dataset is available at https://github.com/BloomBerry/LADA. △ Less

Submitted 2 April, 2024; originally announced April 2024.

Comments: 16 pages, 5 figures, 22 tables

arXiv:2403.06497 [pdf, other]

QuantTune: Optimizing Model Quantization with Adaptive Outlier-Driven Fine Tuning

Authors: Jiun-Man Chen, Yu-Hsuan Chao, Yu-Jie Wang, Ming-Der Shieh, Chih-Chung Hsu, Wei-Fen Lin

Abstract: Transformer-based models have gained widespread popularity in both the computer vision (CV) and natural language processing (NLP) fields. However, significant challenges arise during post-training linear quantization, leading to noticeable reductions in inference accuracy. Our study focuses on uncovering the underlying causes of these accuracy drops and proposing a quantization-friendly fine-tunin… ▽ More Transformer-based models have gained widespread popularity in both the computer vision (CV) and natural language processing (NLP) fields. However, significant challenges arise during post-training linear quantization, leading to noticeable reductions in inference accuracy. Our study focuses on uncovering the underlying causes of these accuracy drops and proposing a quantization-friendly fine-tuning method, \textbf{QuantTune}. Firstly, our analysis revealed that, on average, 65\% of quantization errors result from the precision loss incurred by the dynamic range amplification effect of outliers across the target Transformer-based models. Secondly, \textbf{QuantTune} adjusts weights based on the deviation of outlier activations and effectively constrains the dynamic ranges of the problematic activations. As a result, it successfully mitigates the negative impact of outliers on the inference accuracy of quantized models. Lastly, \textbf{QuantTune} can be seamlessly integrated into the back-propagation pass in the fine-tuning process without requiring extra complexity in inference software and hardware design. Our approach showcases significant improvements in post-training quantization across a range of Transformer-based models, including ViT, Bert-base, and OPT. QuantTune reduces accuracy drops by 12.09\% at 8-bit quantization and 33.8\% at 7-bit compared to top calibration methods, outperforming state-of-the-art solutions by over 18.84\% across ViT models. △ Less

Submitted 11 March, 2024; originally announced March 2024.

arXiv:2312.14401 [pdf, other]

Towards an Exploratory Visual Analytics System for Griefer Identification in MOBA Games

Authors: Zixin Chen, Shiyi Liu, Zhihua Jin, Gaoping Huang, Yang Chao, Zhenchuan Yang, Quan Li, Huamin Qu

Abstract: Multiplayer Online Battle Arenas (MOBAs) have gained a significant player base worldwide, generating over two billion US dollars in annual game revenue. However, the presence of griefers, who deliberately irritate and harass other players within the game, can have a detrimental impact on players' experience, compromising game fairness and potentially leading to the emergence of gray industries. Un… ▽ More Multiplayer Online Battle Arenas (MOBAs) have gained a significant player base worldwide, generating over two billion US dollars in annual game revenue. However, the presence of griefers, who deliberately irritate and harass other players within the game, can have a detrimental impact on players' experience, compromising game fairness and potentially leading to the emergence of gray industries. Unfortunately, the absence of a standardized criterion, and the lack of high-quality labeled and annotated data has made it challenging to detect the presence of griefers. Given the complexity of the multivariant spatiotemporal data for MOBA games, game developers heavily rely on manual review of entire game video recordings to label and annotate griefers, which is a time-consuming process. To alleviate this issue, we have collaborated with a team of game specialists to develop an interactive visual analysis interface, called GrieferLens. It overviews players' behavior analysis and synthesizes their key match events. By presenting multiple views of information, GrieferLens can help the game design team efficiently recognize and label griefers in MOBA games and build up a foundation for creating a more enjoyable and fair gameplay environment. △ Less

Submitted 21 December, 2023; originally announced December 2023.

Comments: IEEE VIS 2023 (Poster)

arXiv:2312.04936 [pdf, other]

SKT-Hang: Hanging Everyday Objects via Object-Agnostic Semantic Keypoint Trajectory Generation

Authors: Chia-Liang Kuo, Yu-Wei Chao, Yi-Ting Chen

Abstract: We study the problem of hanging a wide range of grasped objects on diverse supporting items. Hanging objects is a ubiquitous task that is encountered in numerous aspects of our everyday lives. However, both the objects and supporting items can exhibit substantial variations in their shapes and structures, bringing two challenging issues: (1) determining the task-relevant geometric structures acros… ▽ More We study the problem of hanging a wide range of grasped objects on diverse supporting items. Hanging objects is a ubiquitous task that is encountered in numerous aspects of our everyday lives. However, both the objects and supporting items can exhibit substantial variations in their shapes and structures, bringing two challenging issues: (1) determining the task-relevant geometric structures across different objects and supporting items, and (2) identifying a robust action sequence to accommodate the shape variations of supporting items. To this end, we propose Semantic Keypoint Trajectory (SKT), an object-agnostic representation that is highly versatile and applicable to various everyday objects. We also propose Shape-conditioned Trajectory Deformation Network (SCTDN), a model that learns to generate SKT by deforming a template trajectory based on the task-relevant geometric structure features of the supporting items. We conduct extensive experiments and demonstrate substantial improvements in our framework over existing robot hanging methods in the success rate and inference time. Finally, our simulation-trained framework shows promising hanging results in the real world. For videos and supplementary materials, please visit our project webpage: https://hcis-lab.github.io/SKT-Hang/. △ Less

Submitted 8 December, 2023; originally announced December 2023.

arXiv:2311.05599 [pdf, other]

doi 10.1109/ICRA57147.2024.10610694

SynH2R: Synthesizing Hand-Object Motions for Learning Human-to-Robot Handovers

Authors: Sammy Christen, Lan Feng, Wei Yang, Yu-Wei Chao, Otmar Hilliges, Jie Song

Abstract: Vision-based human-to-robot handover is an important and challenging task in human-robot interaction. Recent work has attempted to train robot policies by interacting with dynamic virtual humans in simulated environments, where the policies can later be transferred to the real world. However, a major bottleneck is the reliance on human motion capture data, which is expensive to acquire and difficu… ▽ More Vision-based human-to-robot handover is an important and challenging task in human-robot interaction. Recent work has attempted to train robot policies by interacting with dynamic virtual humans in simulated environments, where the policies can later be transferred to the real world. However, a major bottleneck is the reliance on human motion capture data, which is expensive to acquire and difficult to scale to arbitrary objects and human grasping motions. In this paper, we introduce a framework that can generate plausible human grasping motions suitable for training the robot. To achieve this, we propose a hand-object synthesis method that is designed to generate handover-friendly motions similar to humans. This allows us to generate synthetic training and testing data with 100x more objects than previous work. In our experiments, we show that our method trained purely with synthetic data is competitive with state-of-the-art methods that rely on real human motion data both in simulation and on a real system. In addition, we can perform evaluations on a larger scale compared to prior work. With our newly introduced test set, we show that our model can better scale to a large variety of unseen objects and human motions compared to the baselines. Project page: https://eth-ait.github.io/synthetic-handovers/ △ Less

Submitted 31 December, 2024; v1 submitted 9 November, 2023; originally announced November 2023.

Comments: Accepted to ICRA 2024. Project page: https://eth-ait.github.io/synthetic-handovers/

Journal ref: IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 2024, pp. 3168-3175

arXiv:2310.13969 [pdf, ps, other]

Distributed Linear Regression with Compositional Covariates

Authors: Yue Chao, Lei Huang, Xuejun Ma

Abstract: With the availability of extraordinarily huge data sets, solving the problems of distributed statistical methodology and computing for such data sets has become increasingly crucial in the big data area. In this paper, we focus on the distributed sparse penalized linear log-contrast model in massive compositional data. In particular, two distributed optimization techniques under centralized and de… ▽ More With the availability of extraordinarily huge data sets, solving the problems of distributed statistical methodology and computing for such data sets has become increasingly crucial in the big data area. In this paper, we focus on the distributed sparse penalized linear log-contrast model in massive compositional data. In particular, two distributed optimization techniques under centralized and decentralized topologies are proposed for solving the two different constrained convex optimization problems. Both two proposed algorithms are based on the frameworks of Alternating Direction Method of Multipliers (ADMM) and Coordinate Descent Method of Multipliers(CDMM, Lin et al., 2014, Biometrika). It is worth emphasizing that, in the decentralized topology, we introduce a distributed coordinate-wise descent algorithm based on Group ADMM(GADMM, Elgabli et al., 2020, Journal of Machine Learning Research) for obtaining a communication-efficient regularized estimation. Correspondingly, the convergence theories of the proposed algorithms are rigorously established under some regularity conditions. Numerical experiments on both synthetic and real data are conducted to evaluate our proposed algorithms. △ Less

Submitted 21 October, 2023; originally announced October 2023.

Comments: 35 pages,2 figures

MSC Class: 62-08 62-08 62-08 62-08 62-08 ACM Class: G.3

arXiv:2308.12599 [pdf, other]

Exploiting Time-Frequency Conformers for Music Audio Enhancement

Authors: Yunkee Chae, Junghyun Koo, Sungho Lee, Kyogu Lee

Abstract: With the proliferation of video platforms on the internet, recording musical performances by mobile devices has become commonplace. However, these recordings often suffer from degradation such as noise and reverberation, which negatively impact the listening experience. Consequently, the necessity for music audio enhancement (referred to as music enhancement from this point onward), involving the… ▽ More With the proliferation of video platforms on the internet, recording musical performances by mobile devices has become commonplace. However, these recordings often suffer from degradation such as noise and reverberation, which negatively impact the listening experience. Consequently, the necessity for music audio enhancement (referred to as music enhancement from this point onward), involving the transformation of degraded audio recordings into pristine high-quality music, has surged to augment the auditory experience. To address this issue, we propose a music enhancement system based on the Conformer architecture that has demonstrated outstanding performance in speech enhancement tasks. Our approach explores the attention mechanisms of the Conformer and examines their performance to discover the best approach for the music enhancement task. Our experimental results show that our proposed model achieves state-of-the-art performance on single-stem music enhancement. Furthermore, our system can perform general music enhancement with multi-track mixtures, which has not been examined in previous work. △ Less

Submitted 24 August, 2023; originally announced August 2023.

Comments: Accepted by ACM Multimedia 2023

arXiv:2308.11896 [pdf, other]

Age Prediction From Face Images Via Contrastive Learning

Authors: Yeongnam Chae, Poulami Raha, Mijung Kim, Bjorn Stenger

Abstract: This paper presents a novel approach for accurately estimating age from face images, which overcomes the challenge of collecting a large dataset of individuals with the same identity at different ages. Instead, we leverage readily available face datasets of different people at different ages and aim to extract age-related features using contrastive learning. Our method emphasizes these relevant fe… ▽ More This paper presents a novel approach for accurately estimating age from face images, which overcomes the challenge of collecting a large dataset of individuals with the same identity at different ages. Instead, we leverage readily available face datasets of different people at different ages and aim to extract age-related features using contrastive learning. Our method emphasizes these relevant features while suppressing identity-related features using a combination of cosine similarity and triplet margin losses. We demonstrate the effectiveness of our proposed approach by achieving state-of-the-art performance on two public datasets, FG-NET and MORPH-II. △ Less

Submitted 22 August, 2023; originally announced August 2023.

Comments: MVA2023

arXiv:2308.09383 [pdf, other]

Label-Free Event-based Object Recognition via Joint Learning with Image Reconstruction from Events

Authors: Hoonhee Cho, Hyeonseong Kim, Yujeong Chae, Kuk-Jin Yoon

Abstract: Recognizing objects from sparse and noisy events becomes extremely difficult when paired images and category labels do not exist. In this paper, we study label-free event-based object recognition where category labels and paired images are not available. To this end, we propose a joint formulation of object recognition and image reconstruction in a complementary manner. Our method first reconstruc… ▽ More Recognizing objects from sparse and noisy events becomes extremely difficult when paired images and category labels do not exist. In this paper, we study label-free event-based object recognition where category labels and paired images are not available. To this end, we propose a joint formulation of object recognition and image reconstruction in a complementary manner. Our method first reconstructs images from events and performs object recognition through Contrastive Language-Image Pre-training (CLIP), enabling better recognition through a rich context of images. Since the category information is essential in reconstructing images, we propose category-guided attraction loss and category-agnostic repulsion loss to bridge the textual features of predicted categories and the visual features of reconstructed images using CLIP. Moreover, we introduce a reliable data sampling strategy and local-global reconstruction consistency to boost joint learning of two tasks. To enhance the accuracy of prediction and quality of reconstruction, we also propose a prototype-based approach using unpaired images. Extensive experiments demonstrate the superiority of our method and its extensibility for zero-shot object recognition. Our project code is available at \url{https://github.com/Chohoonhee/Ev-LaFOR}. △ Less

Submitted 18 August, 2023; originally announced August 2023.

Comments: Accepted to ICCV 2023 (Oral)

arXiv:2307.12576 [pdf, other]

Self-refining of Pseudo Labels for Music Source Separation with Noisy Labeled Data

Authors: Junghyun Koo, Yunkee Chae, Chang-Bin Jeon, Kyogu Lee

Abstract: Music source separation (MSS) faces challenges due to the limited availability of correctly-labeled individual instrument tracks. With the push to acquire larger datasets to improve MSS performance, the inevitability of encountering mislabeled individual instrument tracks becomes a significant challenge to address. This paper introduces an automated technique for refining the labels in a partially… ▽ More Music source separation (MSS) faces challenges due to the limited availability of correctly-labeled individual instrument tracks. With the push to acquire larger datasets to improve MSS performance, the inevitability of encountering mislabeled individual instrument tracks becomes a significant challenge to address. This paper introduces an automated technique for refining the labels in a partially mislabeled dataset. Our proposed self-refining technique, employed with a noisy-labeled dataset, results in only a 1% accuracy degradation in multi-label instrument recognition compared to a classifier trained on a clean-labeled dataset. The study demonstrates the importance of refining noisy-labeled data in MSS model training and shows that utilizing the refined dataset leads to comparable results derived from a clean-labeled dataset. Notably, upon only access to a noisy dataset, MSS models trained on a self-refined dataset even outperform those trained on a dataset refined with a classifier trained on clean labels. △ Less

Submitted 24 July, 2023; originally announced July 2023.

Comments: 24th International Society for Music Information Retrieval Conference (ISMIR 2023)

arXiv:2307.09699 [pdf, other]

ActorLens: Visual Analytics for High-level Actor Identification in MOBA Games

Authors: Zhihua Jin, Gaoping Huang, Zixin Chen, Shiyi Liu, Yang Chao, Zhenchuan Yang, Quan Li, Huamin Qu

Abstract: Multiplayer Online Battle Arenas (MOBAs) have garnered a substantial player base worldwide. Nevertheless, the presence of noxious players, commonly referred to as "actors", can significantly compromise game fairness by exhibiting negative behaviors that diminish their team's competitive edge. Furthermore, high-level actors tend to engage in more egregious conduct to evade detection, thereby causin… ▽ More Multiplayer Online Battle Arenas (MOBAs) have garnered a substantial player base worldwide. Nevertheless, the presence of noxious players, commonly referred to as "actors", can significantly compromise game fairness by exhibiting negative behaviors that diminish their team's competitive edge. Furthermore, high-level actors tend to engage in more egregious conduct to evade detection, thereby causing harm to the game community and necessitating their identification. To tackle this urgent concern, a partnership was formed with a team of game specialists from a prominent company to facilitate the identification and labeling of high-level actors in MOBA games. We first characterize the problem and abstract data and events from the game scene to formulate design requirements. Subsequently, ActorLens, a visual analytics system, was developed to exclude low-level actors, detect potential high-level actors, and assist users in labeling players. ActorLens furnishes an overview of players' status, summarizes behavioral patterns across three player cohorts (namely, focused players, historical matches of focused players, and matches of other players who played the same hero), and synthesizes key match events. By incorporating multiple views of information, users can proficiently recognize and label high-level actors in MOBA games. We conducted case studies and user studies to demonstrate the efficacy of the system. △ Less

Submitted 18 July, 2023; originally announced July 2023.

Comments: 15 pages, 9 figures

arXiv:2307.04577 [pdf, other]

AnyTeleop: A General Vision-Based Dexterous Robot Arm-Hand Teleoperation System

Authors: Yuzhe Qin, Wei Yang, Binghao Huang, Karl Van Wyk, Hao Su, Xiaolong Wang, Yu-Wei Chao, Dieter Fox

Abstract: Vision-based teleoperation offers the possibility to endow robots with human-level intelligence to physically interact with the environment, while only requiring low-cost camera sensors. However, current vision-based teleoperation systems are designed and engineered towards a particular robot model and deploy environment, which scales poorly as the pool of the robot models expands and the variety… ▽ More Vision-based teleoperation offers the possibility to endow robots with human-level intelligence to physically interact with the environment, while only requiring low-cost camera sensors. However, current vision-based teleoperation systems are designed and engineered towards a particular robot model and deploy environment, which scales poorly as the pool of the robot models expands and the variety of the operating environment increases. In this paper, we propose AnyTeleop, a unified and general teleoperation system to support multiple different arms, hands, realities, and camera configurations within a single system. Although being designed to provide great flexibility to the choice of simulators and real hardware, our system can still achieve great performance. For real-world experiments, AnyTeleop can outperform a previous system that was designed for a specific robot hardware with a higher success rate, using the same robot. For teleoperation in simulation, AnyTeleop leads to better imitation learning performance, compared with a previous system that is particularly designed for that simulator. Project page: https://yzqin.github.io/anyteleop/. △ Less

Submitted 16 May, 2024; v1 submitted 10 July, 2023; originally announced July 2023.

Comments: http://anyteleop.com/ Robotics: Science and Systems 2023

arXiv:2307.03073 [pdf, other]

Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning

Authors: Jishnu Jaykumar P, Kamalesh Palanisamy, Yu-Wei Chao, Xinya Du, Yu Xiang

Abstract: We propose a novel framework for few-shot learning by leveraging large-scale vision-language models such as CLIP. Motivated by unimodal prototypical networks for few-shot learning, we introduce Proto-CLIP which utilizes image prototypes and text prototypes for few-shot learning. Specifically, Proto-CLIP adapts the image and text encoder embeddings from CLIP in a joint fashion using few-shot exampl… ▽ More We propose a novel framework for few-shot learning by leveraging large-scale vision-language models such as CLIP. Motivated by unimodal prototypical networks for few-shot learning, we introduce Proto-CLIP which utilizes image prototypes and text prototypes for few-shot learning. Specifically, Proto-CLIP adapts the image and text encoder embeddings from CLIP in a joint fashion using few-shot examples. The embeddings from the two encoders are used to compute the respective prototypes of image classes for classification. During adaptation, we propose aligning the image and text prototypes of the corresponding classes. Such alignment is beneficial for few-shot classification due to the reinforced contributions from both types of prototypes. Proto-CLIP has both training-free and fine-tuned variants. We demonstrate the effectiveness of our method by conducting experiments on benchmark datasets for few-shot learning, as well as in the real world for robot perception. The project page is available at https://irvlutd.github.io/Proto-CLIP △ Less

Submitted 14 July, 2024; v1 submitted 6 July, 2023; originally announced July 2023.

Comments: Accepted at 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

arXiv:2306.16495 [pdf]

Event Detection from Social Media Stream: Methods, Datasets and Opportunities

Authors: Quanzhi Li, Yang Chao, Dong Li, Yao Lu, Chi Zhang

Abstract: Social media streams contain large and diverse amount of information, ranging from daily-life stories to the latest global and local events and news. Twitter, especially, allows a fast spread of events happening real time, and enables individuals and organizations to stay informed of the events happening now. Event detection from social media data poses different challenges from traditional text a… ▽ More Social media streams contain large and diverse amount of information, ranging from daily-life stories to the latest global and local events and news. Twitter, especially, allows a fast spread of events happening real time, and enables individuals and organizations to stay informed of the events happening now. Event detection from social media data poses different challenges from traditional text and is a research area that has attracted much attention in recent years. In this paper, we survey a wide range of event detection methods for Twitter data stream, helping readers understand the recent development in this area. We present the datasets available to the public. Furthermore, a few research opportunities △ Less

Submitted 28 June, 2023; originally announced June 2023.

Comments: 8 pages

Showing 1–50 of 88 results for author: Chae, Y