Skip to main content

Showing 1–50 of 117 results for author: Chai, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.22992  [pdf, ps, other

    cs.AI cs.CL cs.CV

    MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning

    Authors: Yulun Jiang, Yekun Chai, Maria Brbić, Michael Moor

    Abstract: The ability to process information from multiple modalities and to reason through it step-by-step remains a critical challenge in advancing artificial intelligence. However, existing reasoning benchmarks focus on text-only reasoning, or employ multimodal questions that can be answered by directly retrieving information from a non-text modality. Thus, complex reasoning remains poorly understood in… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

  2. arXiv:2506.14135  [pdf, ps, other

    cs.RO cs.CV

    GAF: Gaussian Action Field as a Dynamic World Model for Robotic Manipulation

    Authors: Ying Chai, Litao Deng, Ruizhi Shao, Jiajun Zhang, Liangjun Xing, Hongwen Zhang, Yebin Liu

    Abstract: Accurate action inference is critical for vision-based robotic manipulation. Existing approaches typically follow either a Vision-to-Action (V-A) paradigm, predicting actions directly from visual inputs, or a Vision-to-3D-to-Action (V-3D-A) paradigm, leveraging intermediate 3D representations. However, these methods often struggle with action inaccuracies due to the complexity and dynamic nature o… ▽ More

    Submitted 23 June, 2025; v1 submitted 16 June, 2025; originally announced June 2025.

    Comments: http://chaiying1.github.io/GAF.github.io/project_page/

  3. arXiv:2506.07148  [pdf, other

    cs.CL

    Semantic-preserved Augmentation with Confidence-weighted Fine-tuning for Aspect Category Sentiment Analysis

    Authors: Yaping Chai, Haoran Xie, Joe S. Qin

    Abstract: Large language model (LLM) is an effective approach to addressing data scarcity in low-resource scenarios. Recent existing research designs hand-crafted prompts to guide LLM for data augmentation. We introduce a data augmentation strategy for the aspect category sentiment analysis (ACSA) task that preserves the original sentence semantics and has linguistic diversity, specifically by providing a s… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

    Comments: 10 pages, 7 figures, 4 tables

  4. arXiv:2505.24848  [pdf, ps, other

    cs.CV cs.LG

    Reading Recognition in the Wild

    Authors: Charig Yang, Samiul Alam, Shakhrul Iman Siam, Michael J. Proulx, Lambert Mathias, Kiran Somasundaram, Luis Pesqueira, James Fort, Sheroze Sheriffdeen, Omkar Parkhi, Carl Ren, Mi Zhang, Yuning Chai, Richard Newcombe, Hyo Jin Kim

    Abstract: To enable egocentric contextual AI in always-on smart glasses, it is crucial to be able to keep a record of the user's interactions with the world, including during reading. In this paper, we introduce a new task of reading recognition to determine when the user is reading. We first introduce the first-of-its-kind large-scale multimodal Reading in the Wild dataset, containing 100 hours of reading… ▽ More

    Submitted 5 June, 2025; v1 submitted 30 May, 2025; originally announced May 2025.

    Comments: Project Page: https://www.projectaria.com/datasets/reading-in-the-wild/

  5. arXiv:2505.21496  [pdf, ps, other

    cs.CL cs.CV cs.LG

    UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents

    Authors: Han Xiao, Guozhi Wang, Yuxiang Chai, Zimu Lu, Weifeng Lin, Hao He, Lue Fan, Liuyang Bian, Rui Hu, Liang Liu, Shuai Ren, Yafei Wen, Xiaoxin Chen, Aojun Zhou, Hongsheng Li

    Abstract: In this paper, we introduce UI-Genie, a self-improving framework addressing two key challenges in GUI agents: verification of trajectory outcome is challenging and high-quality training data are not scalable. These challenges are addressed by a reward model and a self-improving pipeline, respectively. The reward model, UI-Genie-RM, features an image-text interleaved architecture that efficiently p… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: https://github.com/Euphoria16/UI-Genie

  6. arXiv:2505.16278  [pdf, ps, other

    cs.CV cs.AI cs.RO

    DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

    Authors: Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, Junchi Yan

    Abstract: End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose D… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

    Comments: Project Page: https://thinklab-sjtu.github.io/DriveMoE/

  7. arXiv:2505.00990  [pdf, other

    cs.SE

    Identifying Root Cause of bugs by Capturing Changed Code Lines with Relational Graph Neural Networks

    Authors: Jiaqi Zhang, Shikai Guo, Hui Li, Chenchen Li, Yu Chai, Rong Chen

    Abstract: The Just-In-Time defect prediction model helps development teams improve software quality and efficiency by assessing whether code changes submitted by developers are likely to introduce defects in real-time, allowing timely identification of potential issues during the commit stage. However, two main challenges exist in current work due to the reality that all deleted and added lines in bug-fixin… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

  8. arXiv:2504.19838  [pdf, other

    cs.HC

    LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects

    Authors: Guangyi Liu, Pengxiang Zhao, Liang Liu, Yaxuan Guo, Han Xiao, Weifeng Lin, Yuxiang Chai, Yue Han, Shuai Ren, Hao Wang, Xiaoyu Liang, Wenhao Wang, Tianze Wu, Linghao Li, Hao Wang, Guanjing Xiong, Yong Liu, Hongsheng Li

    Abstract: With the rapid rise of large language models (LLMs), phone automation has undergone transformative changes. This paper systematically reviews LLM-driven phone GUI agents, highlighting their evolution from script-based automation to intelligent, adaptive systems. We first contextualize key challenges, (i) limited generality, (ii) high maintenance overhead, and (iii) weak intent comprehension, and s… ▽ More

    Submitted 23 May, 2025; v1 submitted 28 April, 2025; originally announced April 2025.

    Comments: 39 pages, 10 figures, 7 tables, Project Homepage: https://github.com/PhoneLLM/Awesome-LLM-Powered-Phone-GUI-Agents

  9. arXiv:2504.13805  [pdf, other

    cs.HC

    LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration Benchmark

    Authors: Guangyi Liu, Pengxiang Zhao, Liang Liu, Zhiming Chen, Yuxiang Chai, Shuai Ren, Hao Wang, Shibo He, Wenchao Meng

    Abstract: Mobile GUI agents show promise in automating tasks but face generalization challenges in diverse real-world scenarios. Traditional approaches using pre-training or fine-tuning with massive datasets struggle with the diversity of mobile applications and user-specific tasks. We propose enhancing mobile GUI agent capabilities through human demonstrations, focusing on improving performance in unseen s… ▽ More

    Submitted 18 April, 2025; originally announced April 2025.

    Comments: 23 pages, 16 figures, the project resources are available at https://lgy0404.github.io/LearnAct

  10. arXiv:2504.06884  [pdf, other

    cs.MM cs.AI cs.CV

    Audio-visual Event Localization on Portrait Mode Short Videos

    Authors: Wuyang Liu, Yi Chai, Yongpeng Yan, Yanzhen Ren

    Abstract: Audio-visual event localization (AVEL) plays a critical role in multimodal scene understanding. While existing datasets for AVEL predominantly comprise landscape-oriented long videos with clean and simple audio context, short videos have become the primary format of online video content due to the the proliferation of smartphones. Short videos are characterized by portrait-oriented framing and lay… ▽ More

    Submitted 9 April, 2025; originally announced April 2025.

  11. arXiv:2504.06854  [pdf, other

    cs.DB

    TXSQL: Lock Optimizations Towards High Contented Workloads (Extended Version)

    Authors: Donghui Wang, Yuxing Chen, Chengyao Jiang, Anqun Pan, Wei Jiang, Songli Wang, Hailin Lei, Chong Zhu, Lixiong Zheng, Wei Lu, Yunpeng Chai, Feng Zhang, Xiaoyong Du

    Abstract: Two-phase locking (2PL) is a fundamental and widely used concurrency control protocol. It regulates concurrent access to database data by following a specific sequence of acquiring and releasing locks during transaction execution, thereby ensuring transaction isolation. However, in strict 2PL, transactions must wait for conflicting transactions to commit and release their locks, which reduces conc… ▽ More

    Submitted 9 April, 2025; originally announced April 2025.

  12. arXiv:2504.01308  [pdf, other

    cs.CV

    Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks

    Authors: Jiawei Wang, Yushen Zuo, Yuanjun Chai, Zhendong Liu, Yicheng Fu, Yichun Feng, Kin-Man Lam

    Abstract: Vision-Language Models (VLMs) extend the capabilities of Large Language Models (LLMs) by incorporating visual information, yet they remain vulnerable to jailbreak attacks, especially when processing noisy or corrupted images. Although existing VLMs adopt security measures during training to mitigate such attacks, vulnerabilities associated with noise-augmented visual inputs are overlooked. In this… ▽ More

    Submitted 6 April, 2025; v1 submitted 1 April, 2025; originally announced April 2025.

  13. arXiv:2503.21620  [pdf, other

    cs.AI

    UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

    Authors: Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Guanjing Xiong, Hongsheng Li

    Abstract: The recent DeepSeek-R1 has showcased the emergence of reasoning capabilities in LLMs through reinforcement learning (RL) with rule-based rewards. Despite its success in language models, its application in multi-modal domains, particularly in graphic user interface (GUI) agent tasks, remains under-explored. To address this issue, we propose UI-R1, the first framework to explore how rule-based RL ca… ▽ More

    Submitted 24 May, 2025; v1 submitted 27 March, 2025; originally announced March 2025.

    Comments: Updated UI-R1-E-3B

  14. arXiv:2503.06995  [pdf, other

    cs.RO

    Physics-informed Neural Network Predictive Control for Quadruped Locomotion

    Authors: Haolin Li, Yikang Chai, Bailin Lv, Lecheng Ruan, Hang Zhao, Ye Zhao, Jianwen Luo

    Abstract: This study introduces a unified control framework that addresses the challenge of precise quadruped locomotion with unknown payloads, named as online payload identification-based physics-informed neural network predictive control (OPI-PINNPC). By integrating online payload identification with physics-informed neural networks (PINNs), our approach embeds identified mass parameters directly into the… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

  15. arXiv:2502.01980  [pdf, ps, other

    cs.LG cs.AI

    Generative Data Mining with Longtail-Guided Diffusion

    Authors: David S. Hayden, Mao Ye, Timur Garipov, Gregory P. Meyer, Carl Vondrick, Zhao Chen, Yuning Chai, Eric Wolff, Siddhartha S. Srinivasa

    Abstract: It is difficult to anticipate the myriad challenges that a predictive model will encounter once deployed. Common practice entails a reactive, cyclical approach: model deployment, data mining, and retraining. We instead develop a proactive longtail discovery process by imagining additional data during training. In particular, we develop general model-based longtail signals, including a differentiab… ▽ More

    Submitted 26 June, 2025; v1 submitted 3 February, 2025; originally announced February 2025.

    Comments: 20 pages

    Journal ref: Proceedings of the 42nd International Conference on Machine Learning (ICML), 2025

  16. arXiv:2501.18845  [pdf, other

    cs.CL

    Text Data Augmentation for Large Language Models: A Comprehensive Survey of Methods, Challenges, and Opportunities

    Authors: Yaping Chai, Haoran Xie, Joe S. Qin

    Abstract: The increasing size and complexity of pre-trained language models have demonstrated superior performance in many applications, but they usually require large training datasets to be adequately trained. Insufficient training sets could unexpectedly make the model overfit and fail to cope with complex tasks. Large language models (LLMs) trained on extensive corpora have prominent text generation cap… ▽ More

    Submitted 30 January, 2025; originally announced January 2025.

    Comments: 20 pages, 4 figures, 4 tables

  17. arXiv:2501.14179  [pdf, other

    cs.HC

    AI Chatbots as Professional Service Agents: Developing a Professional Identity

    Authors: Wenwen Li, Kangwei Shi, Yidong Chai

    Abstract: With the rapid expansion of large language model (LLM) applications, there is an emerging shift in the role of LLM-based AI chatbots from serving merely as general inquiry tools to acting as professional service agents. However, current studies often overlook a critical aspect of professional service agents: the act of communicating in a manner consistent with their professional identities. This i… ▽ More

    Submitted 23 January, 2025; originally announced January 2025.

  18. arXiv:2501.11463  [pdf, ps, other

    cs.CL

    Curiosity-Driven Reinforcement Learning from Human Feedback

    Authors: Haoran Sun, Yekun Chai, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang

    Abstract: Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but often at the cost of reduced output diversity. This trade-off between diversity and alignment quality remains a significant challenge. Drawing inspiration from curiosity-driven exploration in reinforcement learning, we introduce curiosity-driven RLHF (CD-RLHF)… ▽ More

    Submitted 31 May, 2025; v1 submitted 20 January, 2025; originally announced January 2025.

    Comments: ACL 2025

  19. arXiv:2501.07139  [pdf, other

    cs.AI cs.PF

    FlexQuant: Elastic Quantization Framework for Locally Hosted LLM on Edge Devices

    Authors: Yuji Chai, Mujin Kwen, David Brooks, Gu-Yeon Wei

    Abstract: Deploying LLMs on edge devices presents serious technical challenges. Memory elasticity is crucial for edge devices with unified memory, where memory is shared and fluctuates dynamically. Existing solutions suffer from either poor transition granularity or high storage costs. We propose FlexQuant, a novel elasticity framework that generates an ensemble of quantized models, providing an elastic hos… ▽ More

    Submitted 13 January, 2025; originally announced January 2025.

  20. arXiv:2501.01149  [pdf, other

    cs.AI

    A3: Android Agent Arena for Mobile GUI Agents

    Authors: Yuxiang Chai, Hanhao Li, Jiayu Zhang, Liang Liu, Guangyi Liu, Guozhi Wang, Shuai Ren, Siyuan Huang, Hongsheng Li

    Abstract: AI agents have become increasingly prevalent in recent years, driven by significant advancements in the field of large language models (LLMs). Mobile GUI agents, a subset of AI agents, are designed to autonomously perform tasks on mobile devices. While numerous studies have introduced agents, datasets, and benchmarks to advance mobile GUI agent research, many existing datasets focus on static fram… ▽ More

    Submitted 18 February, 2025; v1 submitted 2 January, 2025; originally announced January 2025.

  21. arXiv:2412.14415  [pdf, other

    cs.LG cs.AI cs.CV cs.RO

    DriveGPT: Scaling Autoregressive Behavior Models for Driving

    Authors: Xin Huang, Eric M. Wolff, Paul Vernaza, Tung Phan-Minh, Hongge Chen, David S. Hayden, Mark Edmonds, Brian Pierce, Xinxin Chen, Pratik Elias Jacob, Xiaobai Chen, Chingiz Tairbekov, Pratik Agarwal, Tianshi Gao, Yuning Chai, Siddhartha Srinivasa

    Abstract: We present DriveGPT, a scalable behavior model for autonomous driving. We model driving as a sequential decision-making task, and learn a transformer model to predict future agent states as tokens in an autoregressive fashion. We scale up our model parameters and training data by multiple orders of magnitude, enabling us to explore the scaling properties in terms of dataset size, model parameters,… ▽ More

    Submitted 1 May, 2025; v1 submitted 18 December, 2024; originally announced December 2024.

    Comments: ICML 2025. 14 pages, 17 figures, 8 tables, and 1 video link

  22. arXiv:2412.06833  [pdf

    cs.LG cs.AI cs.SI

    Detecting Fake News on Social Media: A Novel Reliability Aware Machine-Crowd Hybrid Intelligence-Based Method

    Authors: Yidong Chai, Kangwei Shi, Jiaheng Xie, Chunli Liu, Yuanchun Jiang, Yezheng Liu

    Abstract: Fake news on social media platforms poses a significant threat to societal systems, underscoring the urgent need for advanced detection methods. The existing detection methods can be divided into machine intelligence-based, crowd intelligence-based, and hybrid intelligence-based methods. Among them, hybrid intelligence-based methods achieve the best performance but fail to consider the reliability… ▽ More

    Submitted 6 December, 2024; originally announced December 2024.

  23. arXiv:2412.01930  [pdf, other

    cs.CV

    PROFIT: A Specialized Optimizer for Deep Fine Tuning

    Authors: Anirudh S Chakravarthy, Shuai Kyle Zheng, Xin Huang, Sachithra Hemachandra, Xiao Zhang, Yuning Chai, Zhao Chen

    Abstract: Fine-tuning pre-trained models has become invaluable in computer vision and robotics. Recent fine-tuning approaches focus on improving efficiency rather than accuracy by using a mixture of smaller learning rates or frozen backbones. To return the spotlight to model accuracy, we present PROFIT (Proximally Restricted Optimizer For Iterative Training), one of the first optimizers specifically designe… ▽ More

    Submitted 9 December, 2024; v1 submitted 2 December, 2024; originally announced December 2024.

    Comments: technical report

  24. arXiv:2411.17126  [pdf, other

    cs.LG

    From Machine Learning to Machine Unlearning: Complying with GDPR's Right to be Forgotten while Maintaining Business Value of Predictive Models

    Authors: Yuncong Yang, Xiao Han, Yidong Chai, Reza Ebrahimi, Rouzbeh Behnia, Balaji Padmanabhan

    Abstract: Recent privacy regulations (e.g., GDPR) grant data subjects the `Right to Be Forgotten' (RTBF) and mandate companies to fulfill data erasure requests from data subjects. However, companies encounter great challenges in complying with the RTBF regulations, particularly when asked to erase specific training data from their well-trained predictive models. While researchers have introduced machine unl… ▽ More

    Submitted 2 December, 2024; v1 submitted 26 November, 2024; originally announced November 2024.

  25. arXiv:2411.09549  [pdf, other

    quant-ph cs.CY

    Quantum computing inspired paintings: reinterpreting classical masterpieces

    Authors: Arianna Crippa, Yahui Chai, Omar Costa Hamido, Paulo Itaborai, Karl Jansen

    Abstract: We aim to apply a quantum computing technique to compose artworks. The main idea is to revisit three paintings of different styles and historical periods: ''Narciso'', painted circa 1597-1599 by Michelangelo Merisi (Caravaggio), ''Les fils de l'homme'', painted in 1964 by Rene Magritte and ''192 Farben'', painted in 1966 by Gerard Richter. We utilize the output of a quantum computation to change t… ▽ More

    Submitted 6 May, 2025; v1 submitted 14 November, 2024; originally announced November 2024.

    Comments: 10 pages, 8 figures

  26. arXiv:2410.20868  [pdf, other

    cs.IR

    RecFlow: An Industrial Full Flow Recommendation Dataset

    Authors: Qi Liu, Kai Zheng, Rui Huang, Wuchao Li, Kuo Cai, Yuan Chai, Yanan Niu, Yiqun Hui, Bing Han, Na Mou, Hongning Wang, Wentian Bao, Yunen Yu, Guorui Zhou, Han Li, Yang Song, Defu Lian, Kun Gai

    Abstract: Industrial recommendation systems (RS) rely on the multi-stage pipeline to balance effectiveness and efficiency when delivering items from a vast corpus to users. Existing RS benchmark datasets primarily focus on the exposure space, where novel RS algorithms are trained and evaluated. However, when these algorithms transition to real world industrial RS, they face a critical challenge of handling… ▽ More

    Submitted 28 October, 2024; originally announced October 2024.

  27. arXiv:2410.07725  [pdf

    cs.LG cs.NE

    Towards Trustworthy Web Attack Detection: An Uncertainty-Aware Ensemble Deep Kernel Learning Model

    Authors: Yonghang Zhou, Hongyi Zhu, Yidong Chai, Yuanchun Jiang, Yezheng Liu

    Abstract: Web attacks are one of the major and most persistent forms of cyber threats, which bring huge costs and losses to web application-based businesses. Various detection methods, such as signature-based, machine learning-based, and deep learning-based, have been proposed to identify web attacks. However, these methods either (1) heavily rely on accurate and complete rule design and feature engineering… ▽ More

    Submitted 10 October, 2024; originally announced October 2024.

  28. arXiv:2410.02743  [pdf, other

    cs.CL

    MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions

    Authors: Yekun Chai, Haoran Sun, Huang Fang, Shuohuan Wang, Yu Sun, Hua Wu

    Abstract: Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences. However, token-level RLHF suffers from the credit assignment problem over long sequences, where delayed rewards make it challenging for the model to discern which actions contributed to preferred outcomes. This hinders learning efficiency and slows conve… ▽ More

    Submitted 14 February, 2025; v1 submitted 3 October, 2024; originally announced October 2024.

  29. arXiv:2409.15486  [pdf, other

    cs.CV cs.AI

    VLMine: Long-Tail Data Mining with Vision Language Models

    Authors: Mao Ye, Gregory P. Meyer, Zaiwei Zhang, Dennis Park, Siva Karthik Mustikovela, Yuning Chai, Eric M Wolff

    Abstract: Ensuring robust performance on long-tail examples is an important problem for many real-world applications of machine learning, such as autonomous driving. This work focuses on the problem of identifying rare examples within a corpus of unlabeled data. We propose a simple and scalable data mining approach that leverages the knowledge contained within a large vision language model (VLM). Our approa… ▽ More

    Submitted 23 September, 2024; originally announced September 2024.

  30. arXiv:2409.01388  [pdf, other

    cs.DB

    Serverless Query Processing with Flexible Performance SLAs and Prices

    Authors: Haoqiong Bian, Dongyang Geng, Yunpeng Chai, Anastasia Ailamaki

    Abstract: Serverless query processing has become increasingly popular due to its auto-scaling, high elasticity, and pay-as-you-go pricing. It allows cloud data warehouse (or lakehouse) users to focus on data analysis without the burden of managing systems and resources. Accordingly, in serverless query services, users become more concerned about cost-efficiency under acceptable performance than performance… ▽ More

    Submitted 23 December, 2024; v1 submitted 2 September, 2024; originally announced September 2024.

    Comments: 9 pages, 7 figures

  31. arXiv:2408.15512  [pdf, other

    cs.AI cs.CL physics.chem-ph

    Toward Automated Simulation Research Workflow through LLM Prompt Engineering Design

    Authors: Zhihan Liu, Yubo Chai, Jianfeng Li

    Abstract: The advent of Large Language Models (LLMs) has created new opportunities for the automation of scientific research spanning both experimental processes and computational simulations. This study explores the feasibility of constructing an autonomous simulation agent (ASA) powered by LLMs through prompt engineering and automated program design to automate the entire simulation research process accor… ▽ More

    Submitted 15 January, 2025; v1 submitted 27 August, 2024; originally announced August 2024.

    Comments: The source code and example results of ASA can be found at https://github.com/zokaraa/autonomous_simulation_agent

  32. arXiv:2408.13712  [pdf, other

    cs.CV cs.MM

    Riemann-based Multi-scale Attention Reasoning Network for Text-3D Retrieval

    Authors: Wenrui Li, Wei Han, Yandu Chen, Yeyu Chai, Yidan Lu, Xingtao Wang, Xiaopeng Fan

    Abstract: Due to the challenges in acquiring paired Text-3D data and the inherent irregularity of 3D data structures, combined representation learning of 3D point clouds and text remains unexplored. In this paper, we propose a novel Riemann-based Multi-scale Attention Reasoning Network (RMARN) for text-3D retrieval. Specifically, the extracted text and point cloud features are refined by their respective Ad… ▽ More

    Submitted 12 December, 2024; v1 submitted 24 August, 2024; originally announced August 2024.

    Comments: Accepted by AAAI25

  33. arXiv:2407.17490  [pdf, other

    cs.HC cs.AI cs.MM

    AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents

    Authors: Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Dingyu Zhang, Shuai Ren, Hongsheng Li

    Abstract: AI agents have drawn increasing attention mostly on their ability to perceive environments, understand tasks, and autonomously achieve goals. To advance research on AI agents in mobile scenarios, we introduce the Android Multi-annotation EXpo (AMEX), a comprehensive, large-scale dataset designed for generalist mobile GUI-control agents which are capable of completing tasks by directly interacting… ▽ More

    Submitted 28 May, 2025; v1 submitted 3 July, 2024; originally announced July 2024.

  34. arXiv:2407.00719  [pdf

    cs.CR cs.DC cs.LG

    A Whole-Process Certifiably Robust Aggregation Method Against Backdoor Attacks in Federated Learning

    Authors: Anqi Zhou, Yezheng Liu, Yidong Chai, Hongyi Zhu, Xinyue Ge, Yuanchun Jiang, Meng Wang

    Abstract: Federated Learning (FL) has garnered widespread adoption across various domains such as finance, healthcare, and cybersecurity. Nonetheless, FL remains under significant threat from backdoor attacks, wherein malicious actors insert triggers into trained models, enabling them to perform certain tasks while still meeting FL's primary objectives. In response, robust aggregation methods have been prop… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

    Comments: 14 pages

  35. arXiv:2406.14090  [pdf, other

    cs.AI

    Emotion-aware Personalized Music Recommendation with a Heterogeneity-aware Deep Bayesian Network

    Authors: Erkang Jing, Yezheng Liu, Yidong Chai, Shuo Yu, Longshun Liu, Yuanchun Jiang, Yang Wang

    Abstract: Music recommender systems play a critical role in music streaming platforms by providing users with music that they are likely to enjoy. Recent studies have shown that user emotions can influence users' preferences for music moods. However, existing emotion-aware music recommender systems (EMRSs) explicitly or implicitly assume that users' actual emotional states expressed through identical emotio… ▽ More

    Submitted 29 November, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

    Comments: 43 pages, 20 figures

  36. arXiv:2406.11687  [pdf, other

    cs.CL

    Tokenization Falling Short: On Subword Robustness in Large Language Models

    Authors: Yekun Chai, Yewei Fang, Qiwei Peng, Xuhong Li

    Abstract: Language models typically tokenize raw text into sequences of subword identifiers from a predefined vocabulary, a process inherently sensitive to typographical errors, length variations, and largely oblivious to the internal structure of tokens--issues we term the curse of tokenization. In this study, we delve into these drawbacks and demonstrate that large language models (LLMs) remain susceptibl… ▽ More

    Submitted 4 October, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

    Comments: EMNLP 2024 Findings

  37. arXiv:2405.19784  [pdf

    cs.DB cs.AI cs.DC cs.HC cs.LG

    PixelsDB: Serverless and NL-Aided Data Analytics with Flexible Service Levels and Prices

    Authors: Haoqiong Bian, Dongyang Geng, Haoyang Li, Yunpeng Chai, Anastasia Ailamaki

    Abstract: Serverless query processing has become increasingly popular due to its advantages, including automated resource management, high elasticity, and pay-as-you-go pricing. For users who are not system experts, serverless query processing greatly reduces the cost of owning a data analytic system. However, it is still a significant challenge for non-expert users to transform their complex and evolving d… ▽ More

    Submitted 23 December, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

    Comments: 4 pages, 4 figures

  38. arXiv:2404.11502  [pdf, other

    cs.CL cs.AI

    Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models

    Authors: Yushuo Chen, Tianyi Tang, Erge Xiang, Linjiang Li, Wayne Xin Zhao, Jing Wang, Yunpeng Chai, Ji-Rong Wen

    Abstract: In real world, large language models (LLMs) can serve as the assistant to help users accomplish their jobs, and also support the development of advanced applications. For the wide application of LLMs, the inference efficiency is an essential concern, which has been widely studied in existing work, and numerous optimization algorithms and code libraries have been proposed to improve it. Nonetheless… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

  39. arXiv:2404.10710  [pdf, other

    cs.CL cs.CV

    Autoregressive Pre-Training on Pixels and Texts

    Authors: Yekun Chai, Qingyi Liu, Jingwu Xiao, Shuohuan Wang, Yu Sun, Hua Wu

    Abstract: The integration of visual and textual information represents a promising direction in the advancement of language models. In this paper, we explore the dual modality of language--both visual and textual--within an autoregressive framework, pre-trained on both document images and texts. Our method employs a multimodal training strategy, utilizing visual data through next patch prediction with a reg… ▽ More

    Submitted 3 October, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

    Comments: EMNLP 2024

  40. arXiv:2404.07840  [pdf, other

    cs.CL cs.LG

    On Training Data Influence of GPT Models

    Authors: Yekun Chai, Qingyi Liu, Shuohuan Wang, Yu Sun, Qiwei Peng, Hua Wu

    Abstract: Amidst the rapid advancements in generative language models, the investigation of how training data shapes the performance of GPT models is still emerging. This paper presents GPTfluence, a novel approach that leverages a featurized simulation to assess the impact of training examples on the training dynamics of GPT models. Our approach not only traces the influence of individual training instance… ▽ More

    Submitted 3 October, 2024; v1 submitted 11 April, 2024; originally announced April 2024.

    Comments: EMNLP 2024

  41. arXiv:2404.03659  [pdf, other

    cs.LG cs.CR

    Federated Unlearning for Human Activity Recognition

    Authors: Kongyang Chen, Dongping zhang, Yaping Chai, Weibin Zhang, Shaowei Wang, Jiaxing Shen

    Abstract: The rapid evolution of Internet of Things (IoT) technology has spurred the widespread adoption of Human Activity Recognition (HAR) in various daily life domains. Federated Learning (FL) is frequently utilized to build a global HAR model by aggregating user contributions without transmitting raw individual data. Despite substantial progress in user privacy protection with FL, challenges persist. Re… ▽ More

    Submitted 17 January, 2024; originally announced April 2024.

  42. arXiv:2404.00399  [pdf, other

    cs.CL cs.AI cs.LG

    Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code

    Authors: Taishi Nakamura, Mayank Mishra, Simone Tedeschi, Yekun Chai, Jason T Stillerman, Felix Friedrich, Prateek Yadav, Tanmay Laud, Vu Minh Chien, Terry Yue Zhuo, Diganta Misra, Ben Bogin, Xuan-Son Vu, Marzena Karpinska, Arnav Varma Dantuluri, Wojciech Kusa, Tommaso Furlanello, Rio Yokota, Niklas Muennighoff, Suhas Pai, Tosin Adewumi, Veronika Laippala, Xiaozhe Yao, Adalberto Junior, Alpay Ariyak , et al. (20 additional authors not shown)

    Abstract: Pretrained language models are an integral part of AI applications, but their high computational cost for training limits accessibility. Initiatives such as Bloom and StarCoder aim to democratize access to pretrained models for collaborative community development. Despite these efforts, such models encounter challenges such as limited multilingual capabilities, risks of catastrophic forgetting dur… ▽ More

    Submitted 26 December, 2024; v1 submitted 30 March, 2024; originally announced April 2024.

    Comments: Preprint

  43. arXiv:2402.19173  [pdf, other

    cs.SE cs.AI

    StarCoder 2 and The Stack v2: The Next Generation

    Authors: Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo , et al. (41 additional authors not shown)

    Abstract: The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

  44. arXiv:2402.16694  [pdf, other

    cs.CL cs.PL cs.SE

    HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization

    Authors: Qiwei Peng, Yekun Chai, Xuhong Li

    Abstract: Large language models (LLMs) have made significant progress in generating codes from textual prompts. However, existing benchmarks have mainly concentrated on translating English prompts to multilingual codes or have been constrained to very limited natural languages (NLs). These benchmarks have overlooked the vast landscape of massively multilingual NL to multilingual code, leaving a critical gap… ▽ More

    Submitted 24 March, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

    Comments: LREC-COLING 2024

  45. arXiv:2402.15583  [pdf, other

    cs.CV cs.LG

    Cohere3D: Exploiting Temporal Coherence for Unsupervised Representation Learning of Vision-based Autonomous Driving

    Authors: Yichen Xie, Hongge Chen, Gregory P. Meyer, Yong Jae Lee, Eric M. Wolff, Masayoshi Tomizuka, Wei Zhan, Yuning Chai, Xin Huang

    Abstract: Due to the lack of depth cues in images, multi-frame inputs are important for the success of vision-based perception, prediction, and planning in autonomous driving. Observations from different angles enable the recovery of 3D object states from 2D image inputs if we can identify the same instance in different input frames. However, the dynamic nature of autonomous driving scenes leads to signific… ▽ More

    Submitted 23 February, 2024; originally announced February 2024.

  46. arXiv:2402.10045  [pdf

    cs.CV cs.LG

    Short-Form Videos and Mental Health: A Knowledge-Guided Neural Topic Model

    Authors: Jiaheng Xie, Ruicheng Liang, Yidong Chai, Yang Liu, Daniel Zeng

    Abstract: Along with the rise of short-form videos, their mental impacts on viewers have led to widespread consequences, prompting platforms to predict videos' impact on viewers' mental health. Subsequently, they can take intervention measures according to their community guidelines. Nevertheless, applicable predictive methods lack relevance to well-established medical knowledge, which outlines clinically p… ▽ More

    Submitted 12 October, 2024; v1 submitted 10 January, 2024; originally announced February 2024.

  47. arXiv:2401.12988  [pdf

    cs.CL cs.AI

    Few-Shot Learning for Mental Disorder Detection: A Continuous Multi-Prompt Engineering Approach with Medical Knowledge Injection

    Authors: Haoxin Liu, Wenli Zhang, Jiaheng Xie, Buomsoo Kim, Zhu Zhang, Yidong Chai, Sudha Ram

    Abstract: This study harnesses state-of-the-art AI technology for detecting mental disorders through user-generated textual content. Existing studies typically rely on fully supervised machine learning, which presents challenges such as the labor-intensive manual process of annotating extensive training data for each research problem and the need to design specialized deep learning architectures for each ta… ▽ More

    Submitted 13 March, 2025; v1 submitted 16 January, 2024; originally announced January 2024.

    MSC Class: K.5 ACM Class: I.2.7; H.4.m

  48. arXiv:2312.11276  [pdf, other

    cs.CL

    Compositional Generalization for Multi-label Text Classification: A Data-Augmentation Approach

    Authors: Yuyang Chai, Zhuang Li, Jiahui Liu, Lei Chen, Fei Li, Donghong Ji, Chong Teng

    Abstract: Despite significant advancements in multi-label text classification, the ability of existing models to generalize to novel and seldom-encountered complex concepts, which are compositions of elementary ones, remains underexplored. This research addresses this gap. By creating unique data splits across three benchmarks, we assess the compositional generalization ability of existing multi-label text… ▽ More

    Submitted 20 December, 2023; v1 submitted 18 December, 2023; originally announced December 2023.

    Comments: Accepted by AAAI'24

  49. arXiv:2312.00784  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

    Authors: Mu Cai, Haotian Liu, Dennis Park, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Yong Jae Lee

    Abstract: While existing large vision-language multimodal models focus on whole image understanding, there is a prominent gap in achieving region-specific comprehension. Current approaches that use textual coordinates or spatial encodings often fail to provide a user-friendly interface for visual prompting. To address this challenge, we introduce a novel multimodal model capable of decoding arbitrary visual… ▽ More

    Submitted 26 April, 2024; v1 submitted 1 December, 2023; originally announced December 2023.

    Comments: Accepted to CVPR2024. Project page: https://vip-llava.github.io/

  50. arXiv:2310.01045  [pdf, other

    cs.CL

    Tool-Augmented Reward Modeling

    Authors: Lei Li, Yekun Chai, Shuohuan Wang, Yu Sun, Hao Tian, Ningyu Zhang, Hua Wu

    Abstract: Reward modeling (a.k.a., preference modeling) is instrumental for aligning large language models with human preferences, particularly within the context of reinforcement learning from human feedback (RLHF). While conventional reward models (RMs) have exhibited remarkable scalability, they oft struggle with fundamental functionality such as arithmetic computation, code execution, and factual lookup… ▽ More

    Submitted 11 February, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

    Comments: ICLR 2024 Spotlight