Skip to main content

Showing 1–50 of 58 results for author: Sha, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.22316  [pdf, ps, other

    cs.CL

    Evaluating Scoring Bias in LLM-as-a-Judge

    Authors: Qingquan Li, Shaoyu Dou, Kailai Shao, Chao Chen, Haixiang Hu

    Abstract: The remarkable performance of Large Language Models (LLMs) gives rise to``LLM-as-a-Judge'', where LLMs are employed as evaluators for complex tasks. Moreover, it has been widely adopted across fields such as Natural Language Processing (NLP), preference learning, and various specific domains. However, there are various biases within LLM-as-a-Judge, which adversely affect the fairness and reliabili… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

  2. arXiv:2506.21591  [pdf, ps, other

    cs.CL

    FinEval-KR: A Financial Domain Evaluation Framework for Large Language Models' Knowledge and Reasoning

    Authors: Shaoyu Dou, Yutian Shen, Mofan Chen, Zixuan Wang, Jiajie Xu, Qi Guo, Kailai Shao, Chao Chen, Haixiang Hu, Haibo Shi, Min Min, Liwen Zhang

    Abstract: Large Language Models (LLMs) demonstrate significant potential but face challenges in complex financial reasoning tasks requiring both domain knowledge and sophisticated reasoning. Current evaluation benchmarks often fall short by not decoupling these capabilities indicators from single task performance and lack root cause analysis for task failure. To address this, we introduce FinEval-KR, a nove… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

    Comments: Submitted to EMNLP 2025, 27 pages, 20 figures

  3. arXiv:2506.18096  [pdf, ps, other

    cs.AI

    Deep Research Agents: A Systematic Examination And Roadmap

    Authors: Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, Jianye Hao, Kun Shao, Jun Wang

    Abstract: The rapid progress of Large Language Models (LLMs) has given rise to a new category of autonomous AI systems, referred to as Deep Research (DR) agents. These agents are designed to tackle complex, multi-turn informational research tasks by leveraging a combination of dynamic reasoning, adaptive long-horizon planning, multi-hop information retrieval, iterative tool use, and the generation of struct… ▽ More

    Submitted 22 June, 2025; originally announced June 2025.

  4. arXiv:2506.17697  [pdf, ps, other

    cs.AI

    Beyond Syntax: Action Semantics Learning for App Agents

    Authors: Bohan Tang, Dezhao Luo, Jingxuan Chen, Shaogang Gong, Jianye Hao, Jun Wang, Kun Shao

    Abstract: The advent of Large Language Models (LLMs) enables the rise of App agents that interpret user intent and operate smartphone Apps through actions such as clicking and scrolling. While prompt-based solutions with closed LLM APIs show promising ability, they incur heavy compute costs and external API dependency. Fine-tuning smaller open-source LLMs solves these limitations. However, current fine-tuni… ▽ More

    Submitted 21 June, 2025; originally announced June 2025.

  5. arXiv:2506.17346  [pdf, ps, other

    cs.CV cs.AI

    A Novel Multi-layer Task-centric and Data Quality Framework for Autonomous Driving

    Authors: Yuhan Zhou, Haihua Chen, Kewei Sha

    Abstract: The next-generation autonomous vehicles (AVs), embedded with frequent real-time decision-making, will rely heavily on a large volume of multisource and multimodal data. In real-world settings, the data quality (DQ) of different sources and modalities usually varies due to unexpected environmental factors or sensor issues. However, both researchers and practitioners in the AV field overwhelmingly c… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

  6. arXiv:2506.11104  [pdf, ps, other

    cs.CL cs.AI

    DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference Acceleration

    Authors: Hanzhi Zhang, Heng Fan, Kewei Sha, Yan Huang, Yunhe Feng

    Abstract: Long-context understanding is crucial for many NLP applications, yet transformers struggle with efficiency due to the quadratic complexity of self-attention. Sparse attention methods alleviate this cost but often impose static, predefined masks, failing to capture heterogeneous attention patterns. This results in suboptimal token interactions, limiting adaptability and retrieval accuracy in long-s… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

  7. arXiv:2506.06017  [pdf, ps, other

    cs.CL

    AgentSwift: Efficient LLM Agent Design via Value-guided Hierarchical Search

    Authors: Yu Li, Lehui Li, Zhihao Wu, Qingmin Liao, Jianye Hao, Kun Shao, Fengli Xu, Yong Li

    Abstract: Large language model (LLM) agents have demonstrated strong capabilities across diverse domains. However, designing high-performing agentic systems remains challenging. Existing agent search methods suffer from three major limitations: (1) an emphasis on optimizing agentic workflows while under-utilizing proven human-designed components such as memory, planning, and tool use; (2) high evaluation co… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

    Comments: 20pages

  8. arXiv:2505.21334  [pdf, ps, other

    cs.CV

    HoliTom: Holistic Token Merging for Fast Video Large Language Models

    Authors: Kele Shao, Keda Tao, Can Qin, Haoxuan You, Yang Sui, Huan Wang

    Abstract: Video large language models (video LLMs) excel at video comprehension but face significant computational inefficiency due to redundant video tokens. Existing token pruning methods offer solutions. However, approaches operating within the LLM (inner-LLM pruning), such as FastV, incur intrinsic computational overhead in shallow layers. In contrast, methods performing token pruning before the LLM (ou… ▽ More

    Submitted 28 May, 2025; v1 submitted 27 May, 2025; originally announced May 2025.

    Comments: version provides code link: https://github.com/cokeshao/HoliTom

  9. arXiv:2504.13936  [pdf, other

    cs.HC cs.LG eess.SY

    ViMo: A Generative Visual GUI World Model for App Agents

    Authors: Dezhao Luo, Bohan Tang, Kang Li, Georgios Papoudakis, Jifei Song, Shaogang Gong, Jianye Hao, Jun Wang, Kun Shao

    Abstract: App agents, which autonomously operate mobile Apps through Graphical User Interfaces (GUIs), have gained significant interest in real-world applications. Yet, they often struggle with long-horizon planning, failing to find the optimal actions for complex tasks with longer steps. To address this, world models are used to predict the next GUI observation based on user actions, enabling more effectiv… ▽ More

    Submitted 20 May, 2025; v1 submitted 15 April, 2025; originally announced April 2025.

    Comments: https://ai-agents-2030.github.io/ViMo/

  10. arXiv:2504.05686  [pdf, other

    cs.SD cs.AI cs.LG cs.MM eess.AS

    kNN-SVC: Robust Zero-Shot Singing Voice Conversion with Additive Synthesis and Concatenation Smoothness Optimization

    Authors: Keren Shao, Ke Chen, Matthew Baas, Shlomo Dubnov

    Abstract: Robustness is critical in zero-shot singing voice conversion (SVC). This paper introduces two novel methods to strengthen the robustness of the kNN-VC framework for SVC. First, kNN-VC's core representation, WavLM, lacks harmonic emphasis, resulting in dull sounds and ringing artifacts. To address this, we leverage the bijection between WavLM, pitch contours, and spectrograms to perform additive sy… ▽ More

    Submitted 8 April, 2025; originally announced April 2025.

    Comments: 5 pages, 6 figures, 1 table, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025

  11. arXiv:2504.04471  [pdf, other

    cs.CV

    VideoAgent2: Enhancing the LLM-Based Agent System for Long-Form Video Understanding by Uncertainty-Aware CoT

    Authors: Zhuo Zhi, Qiangqiang Wu, Minghe shen, Wenbo Li, Yinchuan Li, Kun Shao, Kaiwen Zhou

    Abstract: Long video understanding has emerged as an increasingly important yet challenging task in computer vision. Agent-based approaches are gaining popularity for processing long videos, as they can handle extended sequences and integrate various tools to capture fine-grained information. However, existing methods still face several challenges: (1) they often rely solely on the reasoning ability of larg… ▽ More

    Submitted 6 April, 2025; originally announced April 2025.

  12. arXiv:2503.10212  [pdf, other

    cs.CV

    MouseGPT: A Large-scale Vision-Language Model for Mouse Behavior Analysis

    Authors: Teng Xu, Taotao Zhou, Youjia Wang, Peng Yang, Simin Tang, Kuixiang Shao, Zifeng Tang, Yifei Liu, Xinyuan Chen, Hongshuang Wang, Xiaohui Wang, Huoqing Luo, Jingya Wang, Ji Hu, Jingyi Yu

    Abstract: Analyzing animal behavior is crucial in advancing neuroscience, yet quantifying and deciphering its intricate dynamics remains a significant challenge. Traditional machine vision approaches, despite their ability to detect spontaneous behaviors, fall short due to limited interpretability and reliance on manual labeling, which restricts the exploration of the full behavioral spectrum. Here, we intr… ▽ More

    Submitted 27 March, 2025; v1 submitted 13 March, 2025; originally announced March 2025.

    Comments: 53 pages, 5 figures, 7 extended figures

  13. arXiv:2502.16268  [pdf, other

    cs.CL

    ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning

    Authors: Shulin Huang, Linyi Yang, Yan Song, Shuang Chen, Leyang Cui, Ziyu Wan, Qingcheng Zeng, Ying Wen, Kun Shao, Weinan Zhang, Jun Wang, Yue Zhang

    Abstract: Evaluating large language models (LLMs) poses significant challenges, particularly due to issues of data contamination and the leakage of correct answers. To address these challenges, we introduce ThinkBench, a novel evaluation framework designed to evaluate LLMs' reasoning capability robustly. ThinkBench proposes a dynamic data generation method for constructing out-of-distribution (OOD) datasets… ▽ More

    Submitted 22 February, 2025; originally announced February 2025.

  14. arXiv:2502.07949  [pdf, other

    cs.LG cs.AI

    Advancing Autonomous VLM Agents via Variational Subgoal-Conditioned Reinforcement Learning

    Authors: Qingyuan Wu, Jianheng Liu, Jianye Hao, Jun Wang, Kun Shao

    Abstract: State-of-the-art (SOTA) reinforcement learning (RL) methods have enabled vision-language model (VLM) agents to learn from interaction with online environments without human supervision. However, these methods often struggle with learning inefficiencies when applied to complex, real-world decision-making tasks with sparse rewards and long-horizon dependencies. We propose a novel framework, Variatio… ▽ More

    Submitted 20 May, 2025; v1 submitted 11 February, 2025; originally announced February 2025.

  15. arXiv:2502.06395  [pdf, other

    cs.AI

    AppVLM: A Lightweight Vision Language Model for Online App Control

    Authors: Georgios Papoudakis, Thomas Coste, Zhihao Wu, Jianye Hao, Jun Wang, Kun Shao

    Abstract: The utilisation of foundation models as smartphone assistants, termed app agents, is a critical research challenge. These agents aim to execute human instructions on smartphones by interpreting textual instructions and performing actions via the device's interface. While promising, current approaches face significant limitations. Methods that use large proprietary models, such as GPT-4o, are compu… ▽ More

    Submitted 10 February, 2025; originally announced February 2025.

  16. arXiv:2502.00687  [pdf, other

    cs.AR eess.SY

    A Flexible Precision Scaling Deep Neural Network Accelerator with Efficient Weight Combination

    Authors: Liang Zhao, Kunming Shao, Fengshi Tian, Tim Kwang-Ting Cheng, Chi-Ying Tsui, Yi Zou

    Abstract: Deploying mixed-precision neural networks on edge devices is friendly to hardware resources and power consumption. To support fully mixed-precision neural network inference, it is necessary to design flexible hardware accelerators for continuous varying precision operations. However, the previous works have issues on hardware utilization and overhead of reconfigurable logic. In this paper, we prop… ▽ More

    Submitted 2 February, 2025; originally announced February 2025.

    Comments: Accepted by 2025 IEEE International Symposium on Circuits and Systems (ISCAS)

  17. arXiv:2501.00701  [pdf, other

    cs.LG math.DS

    ResKoopNet: Learning Koopman Representations for Complex Dynamics with Spectral Residuals

    Authors: Yuanchao Xu, Kaidi Shao, Nikos Logothetis, Zhongwei Shen

    Abstract: Analyzing the long-term behavior of high-dimensional nonlinear dynamical systems remains a significant challenge. While the Koopman operator framework provides a powerful global linearization tool, current methods for approximating its spectral components often face theoretical limitations and depend on predefined dictionaries. Residual Dynamic Mode Decomposition (ResDMD) advanced the field by int… ▽ More

    Submitted 27 May, 2025; v1 submitted 31 December, 2024; originally announced January 2025.

  18. arXiv:2412.08944  [pdf, other

    cs.SD cs.LG eess.AS

    Interpreting Graphic Notation with MusicLDM: An AI Improvisation of Cornelius Cardew's Treatise

    Authors: Tornike Karchkhadze, Keren Shao, Shlomo Dubnov

    Abstract: This work presents a novel method for composing and improvising music inspired by Cornelius Cardew's Treatise, using AI to bridge graphic notation and musical expression. By leveraging OpenAI's ChatGPT to interpret the abstract visual elements of Treatise, we convert these graphical images into descriptive textual prompts. These prompts are then input into MusicLDM, a pre-trained latent diffusion… ▽ More

    Submitted 12 December, 2024; originally announced December 2024.

    Journal ref: 2024 IEEE International Conference on Big Data (Big Data)

  19. arXiv:2411.16806  [pdf, other

    cs.AR

    SynDCIM: A Performance-Aware Digital Computing-in-Memory Compiler with Multi-Spec-Oriented Subcircuit Synthesis

    Authors: Kunming Shao, Fengshi Tian, Xiaomeng Wang, Jiakun Zheng, Jia Chen, Jingyu He, Hui Wu, Jinbo Chen, Xihao Guan, Yi Deng, Fengbin Tu, Jie Yang, Mohamad Sawan, Tim Kwang-Ting Cheng, Chi-Ying Tsui

    Abstract: Digital Computing-in-Memory (DCIM) is an innovative technology that integrates multiply-accumulation (MAC) logic directly into memory arrays to enhance the performance of modern AI computing. However, the need for customized memory cells and logic components currently necessitates significant manual effort in DCIM design. Existing tools for facilitating DCIM macro designs struggle to optimize subc… ▽ More

    Submitted 5 January, 2025; v1 submitted 25 November, 2024; originally announced November 2024.

    Comments: Accepted by 2025 Design, Automation & Test in Europe Conference & Exhibition (DATE) as a regular paper

  20. arXiv:2411.04890  [pdf, other

    cs.AI cs.HC

    GUI Agents with Foundation Models: A Comprehensive Survey

    Authors: Shuai Wang, Weiwen Liu, Jingxuan Chen, Yuqi Zhou, Weinan Gan, Xingshan Zeng, Yuhan Che, Shuai Yu, Xinlong Hao, Kun Shao, Bin Wang, Chuhan Wu, Yasheng Wang, Ruiming Tang, Jianye Hao

    Abstract: Recent advances in foundation models, particularly Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs), have facilitated the development of intelligent agents capable of performing complex tasks. By leveraging the ability of (M)LLMs to process and interpret Graphical User Interfaces (GUIs), these agents can autonomously execute user instructions, simulating human-like interac… ▽ More

    Submitted 13 February, 2025; v1 submitted 7 November, 2024; originally announced November 2024.

  21. arXiv:2411.03562  [pdf, other

    cs.LG cs.AI

    Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level

    Authors: Antoine Grosnit, Alexandre Maraval, James Doran, Giuseppe Paolo, Albert Thomas, Refinath Shahul Hameed Nabeezath Beevi, Jonas Gonzalez, Khyati Khandelwal, Ignacio Iacobacci, Abdelhakim Benechehab, Hamza Cherkaoui, Youssef Attia El-Hili, Kun Shao, Jianye Hao, Jun Yao, Balazs Kegl, Haitham Bou-Ammar, Jun Wang

    Abstract: We introduce Agent K v1.0, an end-to-end autonomous data science agent designed to automate, optimise, and generalise across diverse data science tasks. Fully automated, Agent K v1.0 manages the entire data science life cycle by learning from experience. It leverages a highly flexible structured reasoning framework to enable it to dynamically process memory in a nested structure, effectively learn… ▽ More

    Submitted 5 November, 2024; originally announced November 2024.

  22. arXiv:2410.17883  [pdf, other

    cs.AI

    Lightweight Neural App Control

    Authors: Filippos Christianos, Georgios Papoudakis, Thomas Coste, Jianye Hao, Jun Wang, Kun Shao

    Abstract: This paper introduces a novel mobile phone control architecture, Lightweight Multi-modal App Control (LiMAC), for efficient interactions and control across various Android apps. LiMAC takes as input a textual goal and a sequence of past mobile observations, such as screenshots and corresponding UI trees, to generate precise actions. To address the computational constraints inherent to smartphones,… ▽ More

    Submitted 12 February, 2025; v1 submitted 23 October, 2024; originally announced October 2024.

    Comments: ICLR 2025 (spotlight)

  23. arXiv:2410.15164  [pdf, other

    cs.AI

    SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation

    Authors: Jingxuan Chen, Derek Yuen, Bin Xie, Yuhao Yang, Gongwei Chen, Zhihao Wu, Li Yixing, Xurui Zhou, Weiwen Liu, Shuai Wang, Kaiwen Zhou, Rui Shao, Liqiang Nie, Yasheng Wang, Jianye Hao, Jun Wang, Kun Shao

    Abstract: Smartphone agents are increasingly important for helping users control devices efficiently, with (Multimodal) Large Language Model (MLLM)-based approaches emerging as key contenders. Fairly comparing these agents is essential but challenging, requiring a varied task scope, the integration of agents with different implementations, and a generalisable evaluation pipeline to assess their strengths an… ▽ More

    Submitted 31 March, 2025; v1 submitted 19 October, 2024; originally announced October 2024.

    Comments: ICLR 2025 Spotlight

  24. arXiv:2410.14803  [pdf, other

    cs.LG cs.AI cs.DC eess.SY

    DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents

    Authors: Taiyi Wang, Zhihao Wu, Jianheng Liu, Jianye Hao, Jun Wang, Kun Shao

    Abstract: On-device control agents, especially on mobile devices, are responsible for operating mobile devices to fulfill users' requests, enabling seamless and intuitive interactions. Integrating Multimodal Large Language Models (MLLMs) into these agents enhances their ability to understand and execute complex commands, thereby improving user experience. However, fine-tuning MLLMs for on-device control pre… ▽ More

    Submitted 21 February, 2025; v1 submitted 18 October, 2024; originally announced October 2024.

    Comments: Paper and Appendix, 26 pages

  25. arXiv:2409.18152  [pdf, other

    cs.GT cs.LG math.OC

    Reinforcement Learning for Finite Space Mean-Field Type Games

    Authors: Kai Shao, Jiacheng Shen, Chijie An, Mathieu Laurière

    Abstract: Mean field type games (MFTGs) describe Nash equilibria between large coalitions: each coalition consists of a continuum of cooperative agents who maximize the average reward of their coalition while interacting non-cooperatively with a finite number of other coalitions. Although the theory has been extensively developed, we are still lacking efficient and scalable computational methods. Here, we d… ▽ More

    Submitted 4 December, 2024; v1 submitted 25 September, 2024; originally announced September 2024.

  26. arXiv:2409.12899  [pdf, other

    cs.RO

    LI-GS: Gaussian Splatting with LiDAR Incorporated for Accurate Large-Scale Reconstruction

    Authors: Changjian Jiang, Ruilan Gao, Kele Shao, Yue Wang, Rong Xiong, Yu Zhang

    Abstract: Large-scale 3D reconstruction is critical in the field of robotics, and the potential of 3D Gaussian Splatting (3DGS) for achieving accurate object-level reconstruction has been demonstrated. However, ensuring geometric accuracy in outdoor and unbounded scenes remains a significant challenge. This study introduces LI-GS, a reconstruction system that incorporates LiDAR and Gaussian Splatting to enh… ▽ More

    Submitted 19 September, 2024; originally announced September 2024.

  27. arXiv:2408.10123  [pdf, other

    cs.RO cs.CV

    Learning Precise Affordances from Egocentric Videos for Robotic Manipulation

    Authors: Gen Li, Nikolaos Tsagkas, Jifei Song, Ruaridh Mon-Williams, Sethu Vijayakumar, Kun Shao, Laura Sevilla-Lara

    Abstract: Affordance, defined as the potential actions that an object offers, is crucial for robotic manipulation tasks. A deep understanding of affordance can lead to more intelligent AI systems. For example, such knowledge directs an agent to grasp a knife by the handle for cutting and by the blade when passing it to someone. In this paper, we present a streamlined affordance learning system that encompas… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

    Comments: Project page: https://reagan1311.github.io/affgrasp

  28. arXiv:2408.00539  [pdf, other

    cs.CL cs.AI

    Intermittent Semi-working Mask: A New Masking Paradigm for LLMs

    Authors: Mingcong Lu, Jiangcai Zhu, Wang Hao, Zheng Li, Shusheng Zhang, Kailai Shao, Chao Chen, Nan Li, Feng Wang, Xin Lu

    Abstract: Multi-turn dialogues are a key interaction method between humans and Large Language Models (LLMs), as conversations extend over multiple rounds, keeping LLMs' high generation quality and low latency is a challenge. Mainstream LLMs can be grouped into two categories based on masking strategy: causal LLM and prefix LLM. Several works have demonstrated that prefix LLMs tend to outperform causal ones… ▽ More

    Submitted 1 August, 2024; originally announced August 2024.

  29. arXiv:2406.19741  [pdf, other

    cs.RO cs.AI

    ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning

    Authors: Christopher E. Mower, Yuhui Wan, Hongzhan Yu, Antoine Grosnit, Jonas Gonzalez-Billandon, Matthieu Zimmer, Jinlong Wang, Xinyu Zhang, Yao Zhao, Anbang Zhai, Puze Liu, Daniel Palenicek, Davide Tateo, Cesar Cadena, Marco Hutter, Jan Peters, Guangjian Tian, Yuzheng Zhuang, Kun Shao, Xingyue Quan, Jianye Hao, Jun Wang, Haitham Bou-Ammar

    Abstract: We present a framework for intuitive robot programming by non-experts, leveraging natural language prompts and contextual information from the Robot Operating System (ROS). Our system integrates large language models (LLMs), enabling non-experts to articulate task requirements to the system through a chat interface. Key features of the framework include: integration of ROS with an AI agent connect… ▽ More

    Submitted 12 July, 2024; v1 submitted 28 June, 2024; originally announced June 2024.

    Comments: This document contains 26 pages and 13 figures

  30. arXiv:2406.19614  [pdf, other

    cs.LG cs.AI

    A Survey on Data Quality Dimensions and Tools for Machine Learning

    Authors: Yuhan Zhou, Fengjiao Tu, Kewei Sha, Junhua Ding, Haihua Chen

    Abstract: Machine learning (ML) technologies have become substantial in practically all aspects of our society, and data quality (DQ) is critical for the performance, fairness, robustness, safety, and scalability of ML models. With the large and complex data in data-centric AI, traditional methods like exploratory data analysis (EDA) and cross-validation (CV) face challenges, highlighting the importance of… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: This paper has been accepted by The 6th IEEE International Conference on Artificial Intelligence Testing (IEEE AITest 2024) as an invited paper

  31. arXiv:2406.16968  [pdf, other

    cs.LG cs.AI

    Multimodal Physiological Signals Representation Learning via Multiscale Contrasting for Depression Recognition

    Authors: Kai Shao, Rui Wang, Yixue Hao, Long Hu, Min Chen, Hans Arno Jacobsen

    Abstract: Depression recognition based on physiological signals such as functional near-infrared spectroscopy (fNIRS) and electroencephalogram (EEG) has made considerable progress. However, most existing studies ignore the complementarity and semantic consistency of multimodal physiological signals under the same stimulation task in complex spatio-temporal patterns. In this paper, we introduce a multimodal… ▽ More

    Submitted 25 June, 2024; v1 submitted 22 June, 2024; originally announced June 2024.

  32. arXiv:2405.18849  [pdf, other

    cs.CV

    SFANet: Spatial-Frequency Attention Network for Weather Forecasting

    Authors: Jiaze Wang, Hao Chen, Hongcan Xu, Jinpeng Li, Bowen Wang, Kun Shao, Furui Liu, Huaxi Chen, Guangyong Chen, Pheng-Ann Heng

    Abstract: Weather forecasting plays a critical role in various sectors, driving decision-making and risk management. However, traditional methods often struggle to capture the complex dynamics of meteorological systems, particularly in the presence of high-resolution data. In this paper, we propose the Spatial-Frequency Attention Network (SFANet), a novel deep learning framework designed to address these ch… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

  33. arXiv:2405.18679  [pdf, other

    cs.CV

    Vim-F: Visual State Space Model Benefiting from Learning in the Frequency Domain

    Authors: Juntao Zhang, Shaogeng Liu, Kun Bian, You Zhou, Pei Zhang, Wenbo An, Jun Zhou, Kun Shao

    Abstract: In recent years, State Space Models (SSMs) with efficient hardware-aware designs, known as the Mamba deep learning models, have made significant progress in modeling long sequences such as language understanding. Therefore, building efficient and general-purpose visual backbones based on SSMs is a promising direction. Compared to traditional convolutional neural networks (CNNs) and Vision Transfor… ▽ More

    Submitted 7 January, 2025; v1 submitted 28 May, 2024; originally announced May 2024.

  34. arXiv:2404.11116  [pdf, other

    cs.SD cs.AI cs.LG cs.MM eess.AS

    Music Enhancement with Deep Filters: A Technical Report for The ICASSP 2024 Cadenza Challenge

    Authors: Keren Shao, Ke Chen, Shlomo Dubnov

    Abstract: In this challenge, we disentangle the deep filters from the original DeepfilterNet and incorporate them into our Spec-UNet-based network to further improve a hybrid Demucs (hdemucs) based remixing pipeline. The motivation behind the use of the deep filter component lies at its potential in better handling temporal fine structures. We demonstrate an incremental improvement in both the Signal-to-Dis… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

    Comments: 2 pages, 2 figures, 1 tables, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024

  35. arXiv:2402.06570  [pdf, other

    cs.LG cs.RO

    Distilling Morphology-Conditioned Hypernetworks for Efficient Universal Morphology Control

    Authors: Zheng Xiong, Risto Vuorio, Jacob Beck, Matthieu Zimmer, Kun Shao, Shimon Whiteson

    Abstract: Learning a universal policy across different robot morphologies can significantly improve learning efficiency and enable zero-shot generalization to unseen morphologies. However, learning a highly performant universal policy requires sophisticated architectures like transformers (TF) that have larger memory and computational cost than simpler multi-layer perceptrons (MLP). To achieve both good per… ▽ More

    Submitted 3 June, 2024; v1 submitted 9 February, 2024; originally announced February 2024.

    Comments: ICML 2024

  36. arXiv:2312.14878  [pdf, other

    cs.AI cs.LG

    Pangu-Agent: A Fine-Tunable Generalist Agent with Structured Reasoning

    Authors: Filippos Christianos, Georgios Papoudakis, Matthieu Zimmer, Thomas Coste, Zhihao Wu, Jingxuan Chen, Khyati Khandelwal, James Doran, Xidong Feng, Jiacheng Liu, Zheng Xiong, Yicheng Luo, Jianye Hao, Kun Shao, Haitham Bou-Ammar, Jun Wang

    Abstract: A key method for creating Artificial Intelligence (AI) agents is Reinforcement Learning (RL). However, constructing a standalone RL policy that maps perception to action directly encounters severe problems, chief among them being its lack of generality across multiple tasks and the need for a large amount of training data. The leading cause is that it cannot effectively integrate prior information… ▽ More

    Submitted 22 December, 2023; originally announced December 2023.

    Comments: paper and appendix, 27 pages

  37. arXiv:2312.11063  [pdf, ps, other

    cs.GT cs.AI cs.DS cs.LG econ.TH

    A survey on algorithms for Nash equilibria in finite normal-form games

    Authors: Hanyu Li, Wenhan Huang, Zhijian Duan, David Henry Mguni, Kun Shao, Jun Wang, Xiaotie Deng

    Abstract: Nash equilibrium is one of the most influential solution concepts in game theory. With the development of computer science and artificial intelligence, there is an increasing demand on Nash equilibrium computation, especially for Internet economics and multi-agent learning. This paper reviews various algorithms computing the Nash equilibrium and its approximation solutions in finite normal-form ga… ▽ More

    Submitted 18 December, 2023; originally announced December 2023.

    Comments: The published version is in Computer Science Review

  38. arXiv:2311.16082  [pdf, other

    quant-ph cs.AI cs.AR cs.ET cs.LG

    Transformer-QEC: Quantum Error Correction Code Decoding with Transferable Transformers

    Authors: Hanrui Wang, Pengyu Liu, Kevin Shao, Dantong Li, Jiaqi Gu, David Z. Pan, Yongshan Ding, Song Han

    Abstract: Quantum computing has the potential to solve problems that are intractable for classical systems, yet the high error rates in contemporary quantum devices often exceed tolerable limits for useful algorithm execution. Quantum Error Correction (QEC) mitigates this by employing redundancy, distributing quantum information across multiple data qubits and utilizing syndrome qubits to monitor their stat… ▽ More

    Submitted 27 November, 2023; originally announced November 2023.

    Comments: Accepted to ICCAD 2023, FAST ML for Science Workshop; 7 pages, 8 figures

  39. arXiv:2308.02723  [pdf, other

    cs.SD cs.AI cs.LG cs.MM eess.AS

    Towards Improving Harmonic Sensitivity and Prediction Stability for Singing Melody Extraction

    Authors: Keren Shao, Ke Chen, Taylor Berg-Kirkpatrick, Shlomo Dubnov

    Abstract: In deep learning research, many melody extraction models rely on redesigning neural network architectures to improve performance. In this paper, we propose an input feature modification and a training objective modification based on two assumptions. First, harmonics in the spectrograms of audio data decay rapidly along the frequency axis. To enhance the model's sensitivity on the trailing harmonic… ▽ More

    Submitted 4 August, 2023; originally announced August 2023.

    Comments: 7 pages, 4 figures, 2 tables, Proceedings of the 24th International Society for Music Information Retrieval Conference, ISMIR 2023

  40. arXiv:2306.09200  [pdf, other

    cs.LG cs.AI

    ChessGPT: Bridging Policy Learning and Language Modeling

    Authors: Xidong Feng, Yicheng Luo, Ziyan Wang, Hongrui Tang, Mengyue Yang, Kun Shao, David Mguni, Yali Du, Jun Wang

    Abstract: When solving decision-making tasks, humans typically depend on information from two key sources: (1) Historical policy data, which provides interaction replay from the environment, and (2) Analytical insights in natural language form, exposing the invaluable thought process or strategic considerations. Despite this, the majority of preceding research focuses on only one source: they either use his… ▽ More

    Submitted 21 December, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

    Comments: Published as a conference article in NeurIPS 2023

  41. DropDim: A Regularization Method for Transformer Networks

    Authors: Hao Zhang, Dan Qu, Keji Shao, Xukui Yang

    Abstract: We introduceDropDim, a structured dropout method designed for regularizing the self-attention mechanism, which is a key component of the transformer. In contrast to the general dropout method, which randomly drops neurons, DropDim drops part of the embedding dimensions. In this way, the semantic information can be completely discarded. Thus, the excessive coadapting between different embedding dim… ▽ More

    Submitted 20 April, 2023; originally announced April 2023.

    Journal ref: IEEE SIGNAL PROCESSING LETTERS, VOL. 29, 2022

  42. arXiv:2303.06697  [pdf, other

    cs.CV

    Traj-MAE: Masked Autoencoders for Trajectory Prediction

    Authors: Hao Chen, Jiaze Wang, Kun Shao, Furui Liu, Jianye Hao, Chenyong Guan, Guangyong Chen, Pheng-Ann Heng

    Abstract: Trajectory prediction has been a crucial task in building a reliable autonomous driving system by anticipating possible dangers. One key issue is to generate consistent trajectory predictions without colliding. To overcome the challenge, we propose an efficient masked autoencoder for trajectory prediction (Traj-MAE) that better represents the complicated behaviors of agents in the driving environm… ▽ More

    Submitted 12 March, 2023; originally announced March 2023.

  43. arXiv:2212.07648  [pdf, other

    cs.CV cs.AI

    Relightable Neural Human Assets from Multi-view Gradient Illuminations

    Authors: Taotao Zhou, Kai He, Di Wu, Teng Xu, Qixuan Zhang, Kuixiang Shao, Wenzheng Chen, Lan Xu, Jingyi Yu

    Abstract: Human modeling and relighting are two fundamental problems in computer vision and graphics, where high-quality datasets can largely facilitate related research. However, most existing human datasets only provide multi-view human images captured under the same illumination. Although valuable for modeling tasks, they are not readily used in relighting problems. To promote research in both fields, in… ▽ More

    Submitted 23 June, 2023; v1 submitted 15 December, 2022; originally announced December 2022.

    Comments: Project page: https://miaoing.github.io/RNHA

  44. arXiv:2211.05543  [pdf, other

    cs.SD cs.LG eess.AS

    Vis2Mus: Exploring Multimodal Representation Mapping for Controllable Music Generation

    Authors: Runbang Zhang, Yixiao Zhang, Kai Shao, Ying Shan, Gus Xia

    Abstract: In this study, we explore the representation mapping from the domain of visual arts to the domain of music, with which we can use visual arts as an effective handle to control music generation. Unlike most studies in multimodal representation learning that are purely data-driven, we adopt an analysis-by-synthesis approach that combines deep music representation learning with user studies. Such an… ▽ More

    Submitted 10 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023. GitHub repo: https://github.com/ldzhangyx/vis2mus

  45. arXiv:2209.01054  [pdf, other

    cs.MA cs.LG

    Taming Multi-Agent Reinforcement Learning with Estimator Variance Reduction

    Authors: Taher Jafferjee, Juliusz Ziomek, Tianpei Yang, Zipeng Dai, Jianhong Wang, Matthew Taylor, Kun Shao, Jun Wang, David Mguni

    Abstract: Centralised training with decentralised execution (CT-DE) serves as the foundation of many leading multi-agent reinforcement learning (MARL) algorithms. Despite its popularity, it suffers from a critical drawback due to its reliance on learning from a single sample of the joint-action at a given state. As agents explore and update their policies during training, these single samples may poorly rep… ▽ More

    Submitted 22 June, 2023; v1 submitted 2 September, 2022; originally announced September 2022.

  46. arXiv:2207.09074  [pdf, other

    cs.CV cs.LG

    Incremental Task Learning with Incremental Rank Updates

    Authors: Rakib Hyder, Ken Shao, Boyu Hou, Panos Markopoulos, Ashley Prater-Bennette, M. Salman Asif

    Abstract: Incremental Task learning (ITL) is a category of continual learning that seeks to train a single network for multiple tasks (one after another), where training data for each task is only available during the training of that task. Neural networks tend to forget older tasks when they are trained for the newer tasks; this property is often known as catastrophic forgetting. To address this issue, ITL… ▽ More

    Submitted 19 July, 2022; originally announced July 2022.

    Comments: Code will be available at https://github.com/CSIPlab/task-increment-rank-update.git

    Journal ref: ECCV 2022

  47. arXiv:2205.15953  [pdf, other

    cs.LG

    Timing is Everything: Learning to Act Selectively with Costly Actions and Budgetary Constraints

    Authors: David Mguni, Aivar Sootla, Juliusz Ziomek, Oliver Slumbers, Zipeng Dai, Kun Shao, Jun Wang

    Abstract: Many real-world settings involve costs for performing actions; transaction costs in financial systems and fuel costs being common examples. In these settings, performing actions at each time step quickly accumulates costs leading to vastly suboptimal outcomes. Additionally, repeatedly acting produces wear and tear and ultimately, damage. Determining \textit{when to act} is crucial for achieving su… ▽ More

    Submitted 4 June, 2023; v1 submitted 31 May, 2022; originally announced May 2022.

  48. arXiv:2204.14096  [pdf, other

    stat.ML cs.LG q-bio.QM stat.AP

    Bayesian Information Criterion for Event-based Multi-trial Ensemble data

    Authors: Kaidi Shao, Nikos K. Logothetis, Michel Besserve

    Abstract: Transient recurring phenomena are ubiquitous in many scientific fields like neuroscience and meteorology. Time inhomogenous Vector Autoregressive Models (VAR) may be used to characterize peri-event system dynamics associated with such phenomena, and can be learned by exploiting multi-dimensional data gathering samples of the evolution of the system in multiple time windows comprising, each associa… ▽ More

    Submitted 29 April, 2022; originally announced April 2022.

    Comments: 12 pages, 4 figures

  49. PVNAS: 3D Neural Architecture Search with Point-Voxel Convolution

    Authors: Zhijian Liu, Haotian Tang, Shengyu Zhao, Kevin Shao, Song Han

    Abstract: 3D neural networks are widely used in real-world applications (e.g., AR/VR headsets, self-driving cars). They are required to be fast and accurate; however, limited hardware resources on edge devices make these requirements rather challenging. Previous work processes 3D data using either voxel-based or point-based neural networks, but both types of 3D models are not hardware-efficient due to the l… ▽ More

    Submitted 25 April, 2022; v1 submitted 25 April, 2022; originally announced April 2022.

    Comments: Journal extension of arXiv:1907.03739 and arXiv:2007.16100 (IEEE TPAMI, 2021). The first two authors contributed equally to this work

  50. arXiv:2201.06837  [pdf, other

    cs.LG physics.data-an physics.geo-ph

    Landslide Susceptibility Modeling by Interpretable Neural Network

    Authors: Khaled Youssef, Kevin Shao, Seulgi Moon, Louis-Serge Bouchard

    Abstract: Landslides are notoriously difficult to predict because numerous spatially and temporally varying factors contribute to slope stability. Artificial neural networks (ANN) have been shown to improve prediction accuracy but are largely uninterpretable. Here we introduce an additive ANN optimization framework to assess landslide susceptibility, as well as dataset division and outcome interpretation te… ▽ More

    Submitted 12 March, 2023; v1 submitted 18 January, 2022; originally announced January 2022.

    Comments: 79 pages (including SI section); 8 main figures; 12 supplementary figures; 9 supplementary tables