-
Deep Research Agents: A Systematic Examination And Roadmap
Authors:
Yuxuan Huang,
Yihang Chen,
Haozheng Zhang,
Kang Li,
Meng Fang,
Linyi Yang,
Xiaoguang Li,
Lifeng Shang,
Songcen Xu,
Jianye Hao,
Kun Shao,
Jun Wang
Abstract:
The rapid progress of Large Language Models (LLMs) has given rise to a new category of autonomous AI systems, referred to as Deep Research (DR) agents. These agents are designed to tackle complex, multi-turn informational research tasks by leveraging a combination of dynamic reasoning, adaptive long-horizon planning, multi-hop information retrieval, iterative tool use, and the generation of struct…
▽ More
The rapid progress of Large Language Models (LLMs) has given rise to a new category of autonomous AI systems, referred to as Deep Research (DR) agents. These agents are designed to tackle complex, multi-turn informational research tasks by leveraging a combination of dynamic reasoning, adaptive long-horizon planning, multi-hop information retrieval, iterative tool use, and the generation of structured analytical reports. In this paper, we conduct a detailed analysis of the foundational technologies and architectural components that constitute Deep Research agents. We begin by reviewing information acquisition strategies, contrasting API-based retrieval methods with browser-based exploration. We then examine modular tool-use frameworks, including code execution, multimodal input processing, and the integration of Model Context Protocols (MCPs) to support extensibility and ecosystem development. To systematize existing approaches, we propose a taxonomy that differentiates between static and dynamic workflows, and we classify agent architectures based on planning strategies and agent composition, including single-agent and multi-agent configurations. We also provide a critical evaluation of current benchmarks, highlighting key limitations such as restricted access to external knowledge, sequential execution inefficiencies, and misalignment between evaluation metrics and the practical objectives of DR agents. Finally, we outline open challenges and promising directions for future research. A curated and continuously updated repository of DR agent research is available at: {https://github.com/ai-agents-2030/awesome-deep-research-agent}.
△ Less
Submitted 22 June, 2025;
originally announced June 2025.
-
Beyond Syntax: Action Semantics Learning for App Agents
Authors:
Bohan Tang,
Dezhao Luo,
Jingxuan Chen,
Shaogang Gong,
Jianye Hao,
Jun Wang,
Kun Shao
Abstract:
The advent of Large Language Models (LLMs) enables the rise of App agents that interpret user intent and operate smartphone Apps through actions such as clicking and scrolling. While prompt-based solutions with closed LLM APIs show promising ability, they incur heavy compute costs and external API dependency. Fine-tuning smaller open-source LLMs solves these limitations. However, current fine-tuni…
▽ More
The advent of Large Language Models (LLMs) enables the rise of App agents that interpret user intent and operate smartphone Apps through actions such as clicking and scrolling. While prompt-based solutions with closed LLM APIs show promising ability, they incur heavy compute costs and external API dependency. Fine-tuning smaller open-source LLMs solves these limitations. However, current fine-tuning methods use a syntax learning paradigm that forces agents to reproduce exactly the ground truth action strings, leading to out-of-distribution (OOD) vulnerability. To fill this gap, we propose Action Semantics Learning (ASL), a novel learning framework, where the learning objective is capturing the semantics of the ground truth actions. Specifically, inspired by the programming language theory, we define the action semantics for App agents as the state transition induced by the action in the user interface. With this insight, ASL employs a novel SEmantic Estimator (SEE) to compute a semantic reward to train the App agents in generating actions aligned with the semantics of ground truth actions, even when the syntactic forms differ. To support the effectiveness of ASL, we theoretically demonstrate the superior robustness of ASL for the OOD problem compared with the existing syntax learning paradigm. Extensive experiments on offline and online smartphone App operation benchmarks show that ASL significantly improves the accuracy and generalisation of App agents over existing methods.
△ Less
Submitted 21 June, 2025;
originally announced June 2025.
-
A Novel Multi-layer Task-centric and Data Quality Framework for Autonomous Driving
Authors:
Yuhan Zhou,
Haihua Chen,
Kewei Sha
Abstract:
The next-generation autonomous vehicles (AVs), embedded with frequent real-time decision-making, will rely heavily on a large volume of multisource and multimodal data. In real-world settings, the data quality (DQ) of different sources and modalities usually varies due to unexpected environmental factors or sensor issues. However, both researchers and practitioners in the AV field overwhelmingly c…
▽ More
The next-generation autonomous vehicles (AVs), embedded with frequent real-time decision-making, will rely heavily on a large volume of multisource and multimodal data. In real-world settings, the data quality (DQ) of different sources and modalities usually varies due to unexpected environmental factors or sensor issues. However, both researchers and practitioners in the AV field overwhelmingly concentrate on models/algorithms while undervaluing the DQ. To fulfill the needs of the next-generation AVs with guarantees of functionality, efficiency, and trustworthiness, this paper proposes a novel task-centric and data quality vase framework which consists of five layers: data layer, DQ layer, task layer, application layer, and goal layer. The proposed framework aims to map DQ with task requirements and performance goals. To illustrate, a case study investigating redundancy on the nuScenes dataset proves that partially removing redundancy on multisource image data could improve YOLOv8 object detection task performance. Analysis on multimodal data of image and LiDAR further presents existing redundancy DQ issues. This paper opens up a range of critical but unexplored challenges at the intersection of DQ, task orchestration, and performance-oriented system development in AVs. It is expected to guide the AV community toward building more adaptive, explainable, and resilient AVs that respond intelligently to dynamic environments and heterogeneous data streams. Code, data, and implementation details are publicly available at: https://anonymous.4open.science/r/dq4av-framework/README.md.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference Acceleration
Authors:
Hanzhi Zhang,
Heng Fan,
Kewei Sha,
Yan Huang,
Yunhe Feng
Abstract:
Long-context understanding is crucial for many NLP applications, yet transformers struggle with efficiency due to the quadratic complexity of self-attention. Sparse attention methods alleviate this cost but often impose static, predefined masks, failing to capture heterogeneous attention patterns. This results in suboptimal token interactions, limiting adaptability and retrieval accuracy in long-s…
▽ More
Long-context understanding is crucial for many NLP applications, yet transformers struggle with efficiency due to the quadratic complexity of self-attention. Sparse attention methods alleviate this cost but often impose static, predefined masks, failing to capture heterogeneous attention patterns. This results in suboptimal token interactions, limiting adaptability and retrieval accuracy in long-sequence tasks. This work introduces a dynamic sparse attention mechanism that assigns adaptive masks at the attention-map level, preserving heterogeneous patterns across layers and heads. Unlike existing approaches, our method eliminates the need for fine-tuning and predefined mask structures while maintaining computational efficiency. By learning context-aware attention structures, it achieves high alignment with full-attention models, ensuring minimal performance degradation while reducing memory and compute overhead. This approach provides a scalable alternative to full attention, enabling the practical deployment of large-scale Large Language Models (LLMs) without sacrificing retrieval performance. DAM is available at: https://github.com/HanzhiZhang-Ulrica/DAM.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
AgentSwift: Efficient LLM Agent Design via Value-guided Hierarchical Search
Authors:
Yu Li,
Lehui Li,
Zhihao Wu,
Qingmin Liao,
Jianye Hao,
Kun Shao,
Fengli Xu,
Yong Li
Abstract:
Large language model (LLM) agents have demonstrated strong capabilities across diverse domains. However, designing high-performing agentic systems remains challenging. Existing agent search methods suffer from three major limitations: (1) an emphasis on optimizing agentic workflows while under-utilizing proven human-designed components such as memory, planning, and tool use; (2) high evaluation co…
▽ More
Large language model (LLM) agents have demonstrated strong capabilities across diverse domains. However, designing high-performing agentic systems remains challenging. Existing agent search methods suffer from three major limitations: (1) an emphasis on optimizing agentic workflows while under-utilizing proven human-designed components such as memory, planning, and tool use; (2) high evaluation costs, as each newly generated agent must be fully evaluated on benchmarks; and (3) inefficient search in large search space. In this work, we introduce a comprehensive framework to address these challenges. First, We propose a hierarchical search space that jointly models agentic workflow and composable functional components, enabling richer agentic system designs. Building on this structured design space, we introduce a predictive value model that estimates agent performance given agentic system and task description, allowing for efficient, low-cost evaluation during the search process. Finally, we present a hierarchical Monte Carlo Tree Search (MCTS) strategy informed by uncertainty to guide the search. Experiments on seven benchmarks, covering embodied, math, web, tool, and game, show that our method achieves an average performance gain of 8.34\% over state-of-the-art baselines and exhibits faster search progress with steeper improvement trajectories. Code repo is available at https://github.com/Ericccc02/AgentSwift.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
HoliTom: Holistic Token Merging for Fast Video Large Language Models
Authors:
Kele Shao,
Keda Tao,
Can Qin,
Haoxuan You,
Yang Sui,
Huan Wang
Abstract:
Video large language models (video LLMs) excel at video comprehension but face significant computational inefficiency due to redundant video tokens. Existing token pruning methods offer solutions. However, approaches operating within the LLM (inner-LLM pruning), such as FastV, incur intrinsic computational overhead in shallow layers. In contrast, methods performing token pruning before the LLM (ou…
▽ More
Video large language models (video LLMs) excel at video comprehension but face significant computational inefficiency due to redundant video tokens. Existing token pruning methods offer solutions. However, approaches operating within the LLM (inner-LLM pruning), such as FastV, incur intrinsic computational overhead in shallow layers. In contrast, methods performing token pruning before the LLM (outer-LLM pruning) primarily address spatial redundancy within individual frames or limited temporal windows, neglecting the crucial global temporal dynamics and correlations across longer video sequences. This leads to sub-optimal spatio-temporal reduction and does not leverage video compressibility fully. Crucially, the synergistic potential and mutual influence of combining these strategies remain unexplored. To further reduce redundancy, we introduce HoliTom, a novel training-free holistic token merging framework. HoliTom employs outer-LLM pruning through global redundancy-aware temporal segmentation, followed by spatial-temporal merging to reduce visual tokens by over 90%, significantly alleviating the LLM's computational burden. Complementing this, we introduce a robust inner-LLM token similarity-based merging approach, designed for superior performance and compatibility with outer-LLM pruning. Evaluations demonstrate our method's promising efficiency-performance trade-off on LLaVA-OneVision-7B, reducing computational costs to 6.9% of FLOPs while maintaining 99.1% of the original performance. Furthermore, we achieve a 2.28x reduction in Time-To-First-Token (TTFT) and a 1.32x acceleration in decoding throughput, highlighting the practical benefits of our integrated pruning approach for efficient video LLMs inference.
△ Less
Submitted 28 May, 2025; v1 submitted 27 May, 2025;
originally announced May 2025.
-
Slicing method for nonlinear integral inequalities related to critical nonlinear wave equations
Authors:
Takiko Sasaki,
Kerun Shao,
Hiroyuki Takamura
Abstract:
The so-called "slicing method" is one of the simple and powerful tools to show the blow-up, as well as optimal upper bound of the lifespan, of solutions to critical nonlinear wave equations by iteration with the logarithmic term. It has made strong advantages in various works on nonlinear hyperbolic PDEs. In this paper, we establish one more example as a short and simple proof of the blow-up theor…
▽ More
The so-called "slicing method" is one of the simple and powerful tools to show the blow-up, as well as optimal upper bound of the lifespan, of solutions to critical nonlinear wave equations by iteration with the logarithmic term. It has made strong advantages in various works on nonlinear hyperbolic PDEs. In this paper, we establish one more example as a short and simple proof of the blow-up theorem for the wave equation with power-type nonlinearities of spatial derivatives of unknown functions. This method may help us to extend the blow-up result for a single nonlinear wave equation to the one for weakly coupled systems.
△ Less
Submitted 27 April, 2025;
originally announced April 2025.
-
ViMo: A Generative Visual GUI World Model for App Agents
Authors:
Dezhao Luo,
Bohan Tang,
Kang Li,
Georgios Papoudakis,
Jifei Song,
Shaogang Gong,
Jianye Hao,
Jun Wang,
Kun Shao
Abstract:
App agents, which autonomously operate mobile Apps through Graphical User Interfaces (GUIs), have gained significant interest in real-world applications. Yet, they often struggle with long-horizon planning, failing to find the optimal actions for complex tasks with longer steps. To address this, world models are used to predict the next GUI observation based on user actions, enabling more effectiv…
▽ More
App agents, which autonomously operate mobile Apps through Graphical User Interfaces (GUIs), have gained significant interest in real-world applications. Yet, they often struggle with long-horizon planning, failing to find the optimal actions for complex tasks with longer steps. To address this, world models are used to predict the next GUI observation based on user actions, enabling more effective agent planning. However, existing world models primarily focus on generating only textual descriptions, lacking essential visual details. To fill this gap, we propose ViMo, the first visual world model designed to generate future App observations as images. For the challenge of generating text in image patches, where even minor pixel errors can distort readability, we decompose GUI generation into graphic and text content generation. We propose a novel data representation, the Symbolic Text Representation~(STR) to overlay text content with symbolic placeholders while preserving graphics. With this design, ViMo employs a STR Predictor to predict future GUIs' graphics and a GUI-text Predictor for generating the corresponding text. Moreover, we deploy ViMo to enhance agent-focused tasks by predicting the outcome of different action options. Experiments show ViMo's ability to generate visually plausible and functionally effective GUIs that enable App agents to make more informed decisions.
△ Less
Submitted 20 May, 2025; v1 submitted 15 April, 2025;
originally announced April 2025.
-
kNN-SVC: Robust Zero-Shot Singing Voice Conversion with Additive Synthesis and Concatenation Smoothness Optimization
Authors:
Keren Shao,
Ke Chen,
Matthew Baas,
Shlomo Dubnov
Abstract:
Robustness is critical in zero-shot singing voice conversion (SVC). This paper introduces two novel methods to strengthen the robustness of the kNN-VC framework for SVC. First, kNN-VC's core representation, WavLM, lacks harmonic emphasis, resulting in dull sounds and ringing artifacts. To address this, we leverage the bijection between WavLM, pitch contours, and spectrograms to perform additive sy…
▽ More
Robustness is critical in zero-shot singing voice conversion (SVC). This paper introduces two novel methods to strengthen the robustness of the kNN-VC framework for SVC. First, kNN-VC's core representation, WavLM, lacks harmonic emphasis, resulting in dull sounds and ringing artifacts. To address this, we leverage the bijection between WavLM, pitch contours, and spectrograms to perform additive synthesis, integrating the resulting waveform into the model to mitigate these issues. Second, kNN-VC overlooks concatenative smoothness, a key perceptual factor in SVC. To enhance smoothness, we propose a new distance metric that filters out unsuitable kNN candidates and optimize the summing weights of the candidates during inference. Although our techniques are built on the kNN-VC framework for implementation convenience, they are broadly applicable to general concatenative neural synthesis models. Experimental results validate the effectiveness of these modifications in achieving robust SVC. Demo: http://knnsvc.com Code: https://github.com/SmoothKen/knn-svc
△ Less
Submitted 8 April, 2025;
originally announced April 2025.
-
VideoAgent2: Enhancing the LLM-Based Agent System for Long-Form Video Understanding by Uncertainty-Aware CoT
Authors:
Zhuo Zhi,
Qiangqiang Wu,
Minghe shen,
Wenbo Li,
Yinchuan Li,
Kun Shao,
Kaiwen Zhou
Abstract:
Long video understanding has emerged as an increasingly important yet challenging task in computer vision. Agent-based approaches are gaining popularity for processing long videos, as they can handle extended sequences and integrate various tools to capture fine-grained information. However, existing methods still face several challenges: (1) they often rely solely on the reasoning ability of larg…
▽ More
Long video understanding has emerged as an increasingly important yet challenging task in computer vision. Agent-based approaches are gaining popularity for processing long videos, as they can handle extended sequences and integrate various tools to capture fine-grained information. However, existing methods still face several challenges: (1) they often rely solely on the reasoning ability of large language models (LLMs) without dedicated mechanisms to enhance reasoning in long video scenarios; and (2) they remain vulnerable to errors or noise from external tools. To address these issues, we propose a specialized chain-of-thought (CoT) process tailored for long video analysis. Our proposed CoT with plan-adjust mode enables the LLM to incrementally plan and adapt its information-gathering strategy. We further incorporate heuristic uncertainty estimation of both the LLM and external tools to guide the CoT process. This allows the LLM to assess the reliability of newly collected information, refine its collection strategy, and make more robust decisions when synthesizing final answers. Empirical experiments show that our uncertainty-aware CoT effectively mitigates noise from external tools, leading to more reliable outputs. We implement our approach in a system called VideoAgent2, which also includes additional modules such as general context acquisition and specialized tool design. Evaluation on three dedicated long video benchmarks (and their subsets) demonstrates that VideoAgent2 outperforms the previous state-of-the-art agent-based method, VideoAgent, by an average of 13.1% and achieves leading performance among all zero-shot approaches
△ Less
Submitted 6 April, 2025;
originally announced April 2025.
-
MouseGPT: A Large-scale Vision-Language Model for Mouse Behavior Analysis
Authors:
Teng Xu,
Taotao Zhou,
Youjia Wang,
Peng Yang,
Simin Tang,
Kuixiang Shao,
Zifeng Tang,
Yifei Liu,
Xinyuan Chen,
Hongshuang Wang,
Xiaohui Wang,
Huoqing Luo,
Jingya Wang,
Ji Hu,
Jingyi Yu
Abstract:
Analyzing animal behavior is crucial in advancing neuroscience, yet quantifying and deciphering its intricate dynamics remains a significant challenge. Traditional machine vision approaches, despite their ability to detect spontaneous behaviors, fall short due to limited interpretability and reliance on manual labeling, which restricts the exploration of the full behavioral spectrum. Here, we intr…
▽ More
Analyzing animal behavior is crucial in advancing neuroscience, yet quantifying and deciphering its intricate dynamics remains a significant challenge. Traditional machine vision approaches, despite their ability to detect spontaneous behaviors, fall short due to limited interpretability and reliance on manual labeling, which restricts the exploration of the full behavioral spectrum. Here, we introduce MouseGPT, a Vision-Language Model (VLM) that integrates visual cues with natural language to revolutionize mouse behavior analysis. Built upon our first-of-its-kind dataset - incorporating pose dynamics and open-vocabulary behavioral annotations across over 42 million frames of diverse psychiatric conditions - MouseGPT provides a novel, context-rich method for comprehensive behavior interpretation. Our holistic analysis framework enables detailed behavior profiling, clustering, and novel behavior discovery, offering deep insights without the need for labor - intensive manual annotation. Evaluations reveal that MouseGPT surpasses existing models in precision, adaptability, and descriptive richness, positioning it as a transformative tool for ethology and for unraveling complex behavioral dynamics in animal models.
△ Less
Submitted 27 March, 2025; v1 submitted 13 March, 2025;
originally announced March 2025.
-
ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning
Authors:
Shulin Huang,
Linyi Yang,
Yan Song,
Shuang Chen,
Leyang Cui,
Ziyu Wan,
Qingcheng Zeng,
Ying Wen,
Kun Shao,
Weinan Zhang,
Jun Wang,
Yue Zhang
Abstract:
Evaluating large language models (LLMs) poses significant challenges, particularly due to issues of data contamination and the leakage of correct answers. To address these challenges, we introduce ThinkBench, a novel evaluation framework designed to evaluate LLMs' reasoning capability robustly. ThinkBench proposes a dynamic data generation method for constructing out-of-distribution (OOD) datasets…
▽ More
Evaluating large language models (LLMs) poses significant challenges, particularly due to issues of data contamination and the leakage of correct answers. To address these challenges, we introduce ThinkBench, a novel evaluation framework designed to evaluate LLMs' reasoning capability robustly. ThinkBench proposes a dynamic data generation method for constructing out-of-distribution (OOD) datasets and offers an OOD dataset that contains 2,912 samples drawn from reasoning tasks. ThinkBench unifies the evaluation of reasoning models and non-reasoning models. We evaluate 16 LLMs and 4 PRMs under identical experimental conditions and show that most of the LLMs' performance are far from robust and they face a certain level of data leakage. By dynamically generating OOD datasets, ThinkBench effectively provides a reliable evaluation of LLMs and reduces the impact of data contamination.
△ Less
Submitted 22 February, 2025;
originally announced February 2025.
-
Advancing Autonomous VLM Agents via Variational Subgoal-Conditioned Reinforcement Learning
Authors:
Qingyuan Wu,
Jianheng Liu,
Jianye Hao,
Jun Wang,
Kun Shao
Abstract:
State-of-the-art (SOTA) reinforcement learning (RL) methods have enabled vision-language model (VLM) agents to learn from interaction with online environments without human supervision. However, these methods often struggle with learning inefficiencies when applied to complex, real-world decision-making tasks with sparse rewards and long-horizon dependencies. We propose a novel framework, Variatio…
▽ More
State-of-the-art (SOTA) reinforcement learning (RL) methods have enabled vision-language model (VLM) agents to learn from interaction with online environments without human supervision. However, these methods often struggle with learning inefficiencies when applied to complex, real-world decision-making tasks with sparse rewards and long-horizon dependencies. We propose a novel framework, Variational Subgoal-Conditioned Reinforcement Learning (VSC-RL), advancing the VLM agents in resolving challenging decision-making tasks. Fundamentally distinct from existing methods, VSC-RL reformulates the decision-making problem as a variational subgoal-conditioned RL problem with the newly derived optimization objective, Subgoal Evidence Lower BOund (SGC-ELBO), which comprises two key components: (a) maximizing the subgoal-conditioned return, and (b) minimizing the divergence from a reference goal-conditioned policy. We theoretically and empirically demonstrate that the VSC-RL can efficiently improve the learning efficiency without compromising performance guarantees. Across a diverse set of challenging benchmarks, including mobile device and web control tasks, VSC-RL consistently outperforms existing SOTA methods, achieving superior learning efficiency and performance.
△ Less
Submitted 20 May, 2025; v1 submitted 11 February, 2025;
originally announced February 2025.
-
AppVLM: A Lightweight Vision Language Model for Online App Control
Authors:
Georgios Papoudakis,
Thomas Coste,
Zhihao Wu,
Jianye Hao,
Jun Wang,
Kun Shao
Abstract:
The utilisation of foundation models as smartphone assistants, termed app agents, is a critical research challenge. These agents aim to execute human instructions on smartphones by interpreting textual instructions and performing actions via the device's interface. While promising, current approaches face significant limitations. Methods that use large proprietary models, such as GPT-4o, are compu…
▽ More
The utilisation of foundation models as smartphone assistants, termed app agents, is a critical research challenge. These agents aim to execute human instructions on smartphones by interpreting textual instructions and performing actions via the device's interface. While promising, current approaches face significant limitations. Methods that use large proprietary models, such as GPT-4o, are computationally expensive, while those that use smaller fine-tuned models often lack adaptability to out-of-distribution tasks. In this work, we introduce AppVLM, a lightweight Vision-Language Model (VLM). First, we fine-tune it offline on the AndroidControl dataset. Then, we refine its policy by collecting data from the AndroidWorld environment and performing further training iterations. Our results indicate that AppVLM achieves the highest action prediction accuracy in offline evaluation on the AndroidControl dataset, compared to all evaluated baselines, and matches GPT-4o in online task completion success rate in the AndroidWorld environment, while being up to ten times faster. This makes AppVLM a practical and efficient solution for real-world deployment.
△ Less
Submitted 10 February, 2025;
originally announced February 2025.
-
Data determination of HQET parameters in inclusive charm decays
Authors:
Kang-Kang Shao,
Chun Huang,
Qin Qin
Abstract:
This work delves into the phenomenology of electronic inclusive decays of $D$ mesons, encompassing $D^0, D^+, D^+_s\to Xe^{+}ν$. The theoretical formulas for the decay widths and electron energy moments of these decays are presented as expansions with powers of $α_s$ and $Λ_{\rm QCD}/m_c$. Remarkably, the expansion exhibits excellent convergence properties when we choose the 1S mass scheme for cha…
▽ More
This work delves into the phenomenology of electronic inclusive decays of $D$ mesons, encompassing $D^0, D^+, D^+_s\to Xe^{+}ν$. The theoretical formulas for the decay widths and electron energy moments of these decays are presented as expansions with powers of $α_s$ and $Λ_{\rm QCD}/m_c$. Remarkably, the expansion exhibits excellent convergence properties when we choose the 1S mass scheme for charm. The formulas are subsequently fitted to experimental data, and the $D$ meson matrix elements of operators in the heavy quark effective theory are hence determined by data for the first time, including \begin{align} μ^2_π(D^{0,+}) &= (0.09\pm 0.05) \mathrm{GeV}^2, \qquad \qquad μ^2_π(D^{+}_s) = (0.11\pm 0.05) \mathrm{GeV}^2, \nonumber \\ μ^2_G(D^{0,+}) &= (0.32\pm 0.02) \mathrm{GeV}^2, \qquad \qquad μ^2_G(D^{+}_s) = (0.43\pm 0.02) \mathrm{GeV}^2, \nonumber \\ ρ_D^3(D^{0,+}) &= (-0.003\pm 0.002) \mathrm{GeV}^3, \qquad\ ρ_D^3(D^{+}_s) = (-0.004\pm 0.002) \mathrm{GeV}^3, \nonumber \\ ρ_{LS}^3(D^{0,+}) &= (0.004\pm 0.002) \mathrm{GeV}^3, \qquad \ \ \ ρ_{LS}^3(D^{+}_s) = (0.005\pm 0.002) \mathrm{GeV}^3 . \nonumber \end{align} These determined parameters will play a crucial role as inputs in various physical quantities, including $D$ meson lifetimes.
△ Less
Submitted 9 February, 2025;
originally announced February 2025.
-
A Flexible Precision Scaling Deep Neural Network Accelerator with Efficient Weight Combination
Authors:
Liang Zhao,
Kunming Shao,
Fengshi Tian,
Tim Kwang-Ting Cheng,
Chi-Ying Tsui,
Yi Zou
Abstract:
Deploying mixed-precision neural networks on edge devices is friendly to hardware resources and power consumption. To support fully mixed-precision neural network inference, it is necessary to design flexible hardware accelerators for continuous varying precision operations. However, the previous works have issues on hardware utilization and overhead of reconfigurable logic. In this paper, we prop…
▽ More
Deploying mixed-precision neural networks on edge devices is friendly to hardware resources and power consumption. To support fully mixed-precision neural network inference, it is necessary to design flexible hardware accelerators for continuous varying precision operations. However, the previous works have issues on hardware utilization and overhead of reconfigurable logic. In this paper, we propose an efficient accelerator for 2~8-bit precision scaling with serial activation input and parallel weight preloaded. First, we set two loading modes for the weight operands and decompose the weight into the corresponding bitwidths, which extends the weight precision support efficiently. Then, to improve hardware utilization of low-precision operations, we design the architecture that performs bit-serial MAC operation with systolic dataflow, and the partial sums are combined spatially. Furthermore, we designed an efficient carry save adder tree supporting both signed and unsigned number summation across rows. The experiment result shows that the proposed accelerator, synthesized with TSMC 28nm CMOS technology, achieves peak throughput of 4.09TOPS and peak energy efficiency of 68.94TOPS/W at 2/2-bit operations.
△ Less
Submitted 2 February, 2025;
originally announced February 2025.
-
A Data-Driven Framework for Koopman Semigroup Estimation in Stochastic Dynamical Systems
Authors:
Yuanchao Xu,
Kaidi Shao,
Isao Ishikawa,
Yuka Hashimoto,
Nikos Logothetis,
Zhongwei Shen
Abstract:
We present Stochastic Dynamic Mode Decomposition (SDMD), a novel data-driven framework for approximating the Koopman semigroup in stochastic dynamical systems. Unlike existing methods, SDMD explicitly incorporates sampling time into its approximation, ensuring numerical stability and precision. By directly approximating the Koopman semigroup instead of the generator, SDMD avoids computationally ex…
▽ More
We present Stochastic Dynamic Mode Decomposition (SDMD), a novel data-driven framework for approximating the Koopman semigroup in stochastic dynamical systems. Unlike existing methods, SDMD explicitly incorporates sampling time into its approximation, ensuring numerical stability and precision. By directly approximating the Koopman semigroup instead of the generator, SDMD avoids computationally expensive matrix exponential computations, which offers a more efficient and practical pathway for analyzing stochastic dynamics. The framework further integrates neural networks to automate basis selection, which reduces the reliance on manual intervention while maintaining computational efficiency. Rigorous theoretical guarantees, including convergence in the large data limit, zero-limit of sampling time, and large dictionary size, establish the method's reliability. Numerical experiments on canonical stochastic systems validate SDMD's effectiveness in approximating eigenvalues and eigenfunctions of the stochastic Koopman operator.
△ Less
Submitted 24 May, 2025; v1 submitted 22 January, 2025;
originally announced January 2025.
-
ResKoopNet: Learning Koopman Representations for Complex Dynamics with Spectral Residuals
Authors:
Yuanchao Xu,
Kaidi Shao,
Nikos Logothetis,
Zhongwei Shen
Abstract:
Analyzing the long-term behavior of high-dimensional nonlinear dynamical systems remains a significant challenge. While the Koopman operator framework provides a powerful global linearization tool, current methods for approximating its spectral components often face theoretical limitations and depend on predefined dictionaries. Residual Dynamic Mode Decomposition (ResDMD) advanced the field by int…
▽ More
Analyzing the long-term behavior of high-dimensional nonlinear dynamical systems remains a significant challenge. While the Koopman operator framework provides a powerful global linearization tool, current methods for approximating its spectral components often face theoretical limitations and depend on predefined dictionaries. Residual Dynamic Mode Decomposition (ResDMD) advanced the field by introducing the \emph{spectral residual} to assess Koopman operator approximation accuracy; however, its approach of only filtering precomputed spectra prevents the discovery of the operator's complete spectral information, a limitation known as the `spectral inclusion' problem. We introduce ResKoopNet (Residual-based Koopman-learning Network), a novel method that directly addresses this by explicitly minimizing the \emph{spectral residual} to compute Koopman eigenpairs. This enables the identification of a more precise and complete Koopman operator spectrum. Using neural networks, our approach provides theoretical guarantees while maintaining computational adaptability. Experiments on a variety of physical and biological systems show that ResKoopNet achieves more accurate spectral approximations than existing methods, particularly for high-dimensional systems and those with continuous spectra, which demonstrates its effectiveness as a tool for analyzing complex dynamical systems.
△ Less
Submitted 27 May, 2025; v1 submitted 31 December, 2024;
originally announced January 2025.
-
Interpreting Graphic Notation with MusicLDM: An AI Improvisation of Cornelius Cardew's Treatise
Authors:
Tornike Karchkhadze,
Keren Shao,
Shlomo Dubnov
Abstract:
This work presents a novel method for composing and improvising music inspired by Cornelius Cardew's Treatise, using AI to bridge graphic notation and musical expression. By leveraging OpenAI's ChatGPT to interpret the abstract visual elements of Treatise, we convert these graphical images into descriptive textual prompts. These prompts are then input into MusicLDM, a pre-trained latent diffusion…
▽ More
This work presents a novel method for composing and improvising music inspired by Cornelius Cardew's Treatise, using AI to bridge graphic notation and musical expression. By leveraging OpenAI's ChatGPT to interpret the abstract visual elements of Treatise, we convert these graphical images into descriptive textual prompts. These prompts are then input into MusicLDM, a pre-trained latent diffusion model designed for music generation. We introduce a technique called "outpainting," which overlaps sections of AI-generated music to create a seamless and cohesive composition. We demostrate a new perspective on performing and interpreting graphic scores, showing how AI can transform visual stimuli into sound and expand the creative possibilities in contemporary/experimental music composition. Musical pieces are available at https://bit.ly/TreatiseAI
△ Less
Submitted 12 December, 2024;
originally announced December 2024.
-
Criteria of the existence of global solutions to semilinear wave equations with first-order derivatives on exterior domains
Authors:
Kerun Shao
Abstract:
We study the existence of global solutions to semilinear wave equations on exterior domains $\mathbb{R}^n\setminus\mathcal{K}$, $n\geq2$, with small initial data and nonlinear terms $F(\partial u)$ where $F\in C^κ$ and $\partial^{\leqκ}F(0)=0$. If $n\geq2$ and $κ>n/2$, criteria of the existence of a global solution for general initial data are provided, except for non-empty obstacles…
▽ More
We study the existence of global solutions to semilinear wave equations on exterior domains $\mathbb{R}^n\setminus\mathcal{K}$, $n\geq2$, with small initial data and nonlinear terms $F(\partial u)$ where $F\in C^κ$ and $\partial^{\leqκ}F(0)=0$. If $n\geq2$ and $κ>n/2$, criteria of the existence of a global solution for general initial data are provided, except for non-empty obstacles $\mathcal{K}$ when $n=2$. For $n\geq3$ and $1\leqκ\leq n/2$, we verify the criteria for radial solutions provided obstacles $\mathcal{K}$ are closed balls centered at origin. These criteria are established by local energy estimates and the weighted Sobolev embedding including trace estimates. Meanwhile, for the sample choice of the nonlinear term and initial data, sharp estimates of lifespan are obtained.
△ Less
Submitted 7 December, 2024;
originally announced December 2024.
-
SynDCIM: A Performance-Aware Digital Computing-in-Memory Compiler with Multi-Spec-Oriented Subcircuit Synthesis
Authors:
Kunming Shao,
Fengshi Tian,
Xiaomeng Wang,
Jiakun Zheng,
Jia Chen,
Jingyu He,
Hui Wu,
Jinbo Chen,
Xihao Guan,
Yi Deng,
Fengbin Tu,
Jie Yang,
Mohamad Sawan,
Tim Kwang-Ting Cheng,
Chi-Ying Tsui
Abstract:
Digital Computing-in-Memory (DCIM) is an innovative technology that integrates multiply-accumulation (MAC) logic directly into memory arrays to enhance the performance of modern AI computing. However, the need for customized memory cells and logic components currently necessitates significant manual effort in DCIM design. Existing tools for facilitating DCIM macro designs struggle to optimize subc…
▽ More
Digital Computing-in-Memory (DCIM) is an innovative technology that integrates multiply-accumulation (MAC) logic directly into memory arrays to enhance the performance of modern AI computing. However, the need for customized memory cells and logic components currently necessitates significant manual effort in DCIM design. Existing tools for facilitating DCIM macro designs struggle to optimize subcircuit synthesis to meet user-defined performance criteria, thereby limiting the potential system-level acceleration that DCIM can offer. To address these challenges and enable agile design of DCIM macros with optimal architectures, we present SynDCIM, a performance-aware DCIM compiler that employs multi-spec-oriented subcircuit synthesis. SynDCIM features an automated performance-to-layout generation process that aligns with user-defined performance expectations. This is supported by a scalable subcircuit library and a multi-spec-oriented searching algorithm for effective subcircuit synthesis. The effectiveness of SynDCIM is demonstrated through extensive experiments and validated with a test chip fabricated in a 40nm CMOS process. Testing results reveal that designs generated by SynDCIM exhibit competitive performance when compared to state-of-the-art manually designed DCIM macros.
△ Less
Submitted 5 January, 2025; v1 submitted 25 November, 2024;
originally announced November 2024.
-
GUI Agents with Foundation Models: A Comprehensive Survey
Authors:
Shuai Wang,
Weiwen Liu,
Jingxuan Chen,
Yuqi Zhou,
Weinan Gan,
Xingshan Zeng,
Yuhan Che,
Shuai Yu,
Xinlong Hao,
Kun Shao,
Bin Wang,
Chuhan Wu,
Yasheng Wang,
Ruiming Tang,
Jianye Hao
Abstract:
Recent advances in foundation models, particularly Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs), have facilitated the development of intelligent agents capable of performing complex tasks. By leveraging the ability of (M)LLMs to process and interpret Graphical User Interfaces (GUIs), these agents can autonomously execute user instructions, simulating human-like interac…
▽ More
Recent advances in foundation models, particularly Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs), have facilitated the development of intelligent agents capable of performing complex tasks. By leveraging the ability of (M)LLMs to process and interpret Graphical User Interfaces (GUIs), these agents can autonomously execute user instructions, simulating human-like interactions such as clicking and typing. This survey consolidates recent research on (M)LLM-based GUI agents, highlighting key innovations in data resources, frameworks, and applications. We begin by reviewing representative datasets and benchmarks, followed by an overview of a generalized, unified framework that encapsulates the essential components of prior studies, supported by a detailed taxonomy. Additionally, we explore relevant commercial applications. Drawing insights from existing work, we identify key challenges and propose future research directions. We hope this survey will inspire further advancements in the field of (M)LLM-based GUI agents.
△ Less
Submitted 13 February, 2025; v1 submitted 7 November, 2024;
originally announced November 2024.
-
Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level
Authors:
Antoine Grosnit,
Alexandre Maraval,
James Doran,
Giuseppe Paolo,
Albert Thomas,
Refinath Shahul Hameed Nabeezath Beevi,
Jonas Gonzalez,
Khyati Khandelwal,
Ignacio Iacobacci,
Abdelhakim Benechehab,
Hamza Cherkaoui,
Youssef Attia El-Hili,
Kun Shao,
Jianye Hao,
Jun Yao,
Balazs Kegl,
Haitham Bou-Ammar,
Jun Wang
Abstract:
We introduce Agent K v1.0, an end-to-end autonomous data science agent designed to automate, optimise, and generalise across diverse data science tasks. Fully automated, Agent K v1.0 manages the entire data science life cycle by learning from experience. It leverages a highly flexible structured reasoning framework to enable it to dynamically process memory in a nested structure, effectively learn…
▽ More
We introduce Agent K v1.0, an end-to-end autonomous data science agent designed to automate, optimise, and generalise across diverse data science tasks. Fully automated, Agent K v1.0 manages the entire data science life cycle by learning from experience. It leverages a highly flexible structured reasoning framework to enable it to dynamically process memory in a nested structure, effectively learning from accumulated experience stored to handle complex reasoning tasks. It optimises long- and short-term memory by selectively storing and retrieving key information, guiding future decisions based on environmental rewards. This iterative approach allows it to refine decisions without fine-tuning or backpropagation, achieving continuous improvement through experiential learning. We evaluate our agent's apabilities using Kaggle competitions as a case study. Following a fully automated protocol, Agent K v1.0 systematically addresses complex and multimodal data science tasks, employing Bayesian optimisation for hyperparameter tuning and feature engineering. Our new evaluation framework rigorously assesses Agent K v1.0's end-to-end capabilities to generate and send submissions starting from a Kaggle competition URL. Results demonstrate that Agent K v1.0 achieves a 92.5\% success rate across tasks, spanning tabular, computer vision, NLP, and multimodal domains. When benchmarking against 5,856 human Kaggle competitors by calculating Elo-MMR scores for each, Agent K v1.0 ranks in the top 38\%, demonstrating an overall skill level comparable to Expert-level users. Notably, its Elo-MMR score falls between the first and third quartiles of scores achieved by human Grandmasters. Furthermore, our results indicate that Agent K v1.0 has reached a performance level equivalent to Kaggle Grandmaster, with a record of 6 gold, 3 silver, and 7 bronze medals, as defined by Kaggle's progression system.
△ Less
Submitted 5 November, 2024;
originally announced November 2024.
-
Lightweight Neural App Control
Authors:
Filippos Christianos,
Georgios Papoudakis,
Thomas Coste,
Jianye Hao,
Jun Wang,
Kun Shao
Abstract:
This paper introduces a novel mobile phone control architecture, Lightweight Multi-modal App Control (LiMAC), for efficient interactions and control across various Android apps. LiMAC takes as input a textual goal and a sequence of past mobile observations, such as screenshots and corresponding UI trees, to generate precise actions. To address the computational constraints inherent to smartphones,…
▽ More
This paper introduces a novel mobile phone control architecture, Lightweight Multi-modal App Control (LiMAC), for efficient interactions and control across various Android apps. LiMAC takes as input a textual goal and a sequence of past mobile observations, such as screenshots and corresponding UI trees, to generate precise actions. To address the computational constraints inherent to smartphones, we introduce a small Action Transformer (AcT) integrated with a fine-tuned vision-language model (VLM) for real-time decision-making and task execution. We evaluate LiMAC on two open-source mobile control datasets, demonstrating the superior performance of our small-form-factor approach against fine-tuned versions of open-source VLMs, such as Florence2 and Qwen2-VL. It also significantly outperforms prompt engineering baselines utilising closed-source foundation models like GPT-4o. More specifically, LiMAC increases the overall action accuracy by up to 19% compared to fine-tuned VLMs, and up to 42% compared to prompt-engineering baselines.
△ Less
Submitted 12 February, 2025; v1 submitted 23 October, 2024;
originally announced October 2024.
-
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation
Authors:
Jingxuan Chen,
Derek Yuen,
Bin Xie,
Yuhao Yang,
Gongwei Chen,
Zhihao Wu,
Li Yixing,
Xurui Zhou,
Weiwen Liu,
Shuai Wang,
Kaiwen Zhou,
Rui Shao,
Liqiang Nie,
Yasheng Wang,
Jianye Hao,
Jun Wang,
Kun Shao
Abstract:
Smartphone agents are increasingly important for helping users control devices efficiently, with (Multimodal) Large Language Model (MLLM)-based approaches emerging as key contenders. Fairly comparing these agents is essential but challenging, requiring a varied task scope, the integration of agents with different implementations, and a generalisable evaluation pipeline to assess their strengths an…
▽ More
Smartphone agents are increasingly important for helping users control devices efficiently, with (Multimodal) Large Language Model (MLLM)-based approaches emerging as key contenders. Fairly comparing these agents is essential but challenging, requiring a varied task scope, the integration of agents with different implementations, and a generalisable evaluation pipeline to assess their strengths and weaknesses. In this paper, we present SPA-Bench, a comprehensive SmartPhone Agent Benchmark designed to evaluate (M)LLM-based agents in an interactive environment that simulates real-world conditions. SPA-Bench offers three key contributions: (1) A diverse set of tasks covering system and third-party apps in both English and Chinese, focusing on features commonly used in daily routines; (2) A plug-and-play framework enabling real-time agent interaction with Android devices, integrating over ten agents with the flexibility to add more; (3) A novel evaluation pipeline that automatically assesses agent performance across multiple dimensions, encompassing seven metrics related to task completion and resource consumption. Our extensive experiments across tasks and agents reveal challenges like interpreting mobile user interfaces, action grounding, memory retention, and execution costs. We propose future research directions to ease these difficulties, moving closer to real-world smartphone agent applications. SPA-Bench is available at https://ai-agents-2030.github.io/SPA-Bench/.
△ Less
Submitted 31 March, 2025; v1 submitted 19 October, 2024;
originally announced October 2024.
-
DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents
Authors:
Taiyi Wang,
Zhihao Wu,
Jianheng Liu,
Jianye Hao,
Jun Wang,
Kun Shao
Abstract:
On-device control agents, especially on mobile devices, are responsible for operating mobile devices to fulfill users' requests, enabling seamless and intuitive interactions. Integrating Multimodal Large Language Models (MLLMs) into these agents enhances their ability to understand and execute complex commands, thereby improving user experience. However, fine-tuning MLLMs for on-device control pre…
▽ More
On-device control agents, especially on mobile devices, are responsible for operating mobile devices to fulfill users' requests, enabling seamless and intuitive interactions. Integrating Multimodal Large Language Models (MLLMs) into these agents enhances their ability to understand and execute complex commands, thereby improving user experience. However, fine-tuning MLLMs for on-device control presents significant challenges due to limited data availability and inefficient online training processes. This paper introduces DistRL, a novel framework designed to enhance the efficiency of online RL fine-tuning for mobile device control agents. DistRL employs centralized training and decentralized data acquisition to ensure efficient fine-tuning in the context of dynamic online interactions. Additionally, the framework is backed by our tailor-made RL algorithm, which effectively balances exploration with the prioritized utilization of collected data to ensure stable and robust training. Our experiments show that, on average, DistRL delivers a 3X improvement in training efficiency and enables training data collection 2.4X faster than the leading synchronous multi-machine methods. Notably, after training, DistRL achieves a 20% relative improvement in success rate compared to state-of-the-art methods on general Android tasks from an open benchmark, significantly outperforming existing approaches while maintaining the same training time. These results validate DistRL as a scalable and efficient solution, offering substantial improvements in both training efficiency and agent performance for real-world, in-the-wild device control tasks.
△ Less
Submitted 21 February, 2025; v1 submitted 18 October, 2024;
originally announced October 2024.
-
Reinforcement Learning for Finite Space Mean-Field Type Games
Authors:
Kai Shao,
Jiacheng Shen,
Chijie An,
Mathieu Laurière
Abstract:
Mean field type games (MFTGs) describe Nash equilibria between large coalitions: each coalition consists of a continuum of cooperative agents who maximize the average reward of their coalition while interacting non-cooperatively with a finite number of other coalitions. Although the theory has been extensively developed, we are still lacking efficient and scalable computational methods. Here, we d…
▽ More
Mean field type games (MFTGs) describe Nash equilibria between large coalitions: each coalition consists of a continuum of cooperative agents who maximize the average reward of their coalition while interacting non-cooperatively with a finite number of other coalitions. Although the theory has been extensively developed, we are still lacking efficient and scalable computational methods. Here, we develop reinforcement learning methods for such games in a finite space setting with general dynamics and reward functions. We start by proving that MFTG solution yields approximate Nash equilibria in finite-size coalition games. We then propose two algorithms. The first is based on quantization of mean-field spaces and Nash Q-learning. We provide convergence and stability analysis. We then propose a deep reinforcement learning algorithm, which can scale to larger spaces. Numerical experiments in 5 environments with mean-field distributions of dimension up to $200$ show the scalability and efficiency of the proposed method.
△ Less
Submitted 4 December, 2024; v1 submitted 25 September, 2024;
originally announced September 2024.
-
LI-GS: Gaussian Splatting with LiDAR Incorporated for Accurate Large-Scale Reconstruction
Authors:
Changjian Jiang,
Ruilan Gao,
Kele Shao,
Yue Wang,
Rong Xiong,
Yu Zhang
Abstract:
Large-scale 3D reconstruction is critical in the field of robotics, and the potential of 3D Gaussian Splatting (3DGS) for achieving accurate object-level reconstruction has been demonstrated. However, ensuring geometric accuracy in outdoor and unbounded scenes remains a significant challenge. This study introduces LI-GS, a reconstruction system that incorporates LiDAR and Gaussian Splatting to enh…
▽ More
Large-scale 3D reconstruction is critical in the field of robotics, and the potential of 3D Gaussian Splatting (3DGS) for achieving accurate object-level reconstruction has been demonstrated. However, ensuring geometric accuracy in outdoor and unbounded scenes remains a significant challenge. This study introduces LI-GS, a reconstruction system that incorporates LiDAR and Gaussian Splatting to enhance geometric accuracy in large-scale scenes. 2D Gaussain surfels are employed as the map representation to enhance surface alignment. Additionally, a novel modeling method is proposed to convert LiDAR point clouds to plane-constrained multimodal Gaussian Mixture Models (GMMs). The GMMs are utilized during both initialization and optimization stages to ensure sufficient and continuous supervision over the entire scene while mitigating the risk of over-fitting. Furthermore, GMMs are employed in mesh extraction to eliminate artifacts and improve the overall geometric quality. Experiments demonstrate that our method outperforms state-of-the-art methods in large-scale 3D reconstruction, achieving higher accuracy compared to both LiDAR-based methods and Gaussian-based methods with improvements of 52.6% and 68.7%, respectively.
△ Less
Submitted 19 September, 2024;
originally announced September 2024.
-
Learning Precise Affordances from Egocentric Videos for Robotic Manipulation
Authors:
Gen Li,
Nikolaos Tsagkas,
Jifei Song,
Ruaridh Mon-Williams,
Sethu Vijayakumar,
Kun Shao,
Laura Sevilla-Lara
Abstract:
Affordance, defined as the potential actions that an object offers, is crucial for robotic manipulation tasks. A deep understanding of affordance can lead to more intelligent AI systems. For example, such knowledge directs an agent to grasp a knife by the handle for cutting and by the blade when passing it to someone. In this paper, we present a streamlined affordance learning system that encompas…
▽ More
Affordance, defined as the potential actions that an object offers, is crucial for robotic manipulation tasks. A deep understanding of affordance can lead to more intelligent AI systems. For example, such knowledge directs an agent to grasp a knife by the handle for cutting and by the blade when passing it to someone. In this paper, we present a streamlined affordance learning system that encompasses data collection, effective model training, and robot deployment. First, we collect training data from egocentric videos in an automatic manner. Different from previous methods that focus only on the object graspable affordance and represent it as coarse heatmaps, we cover both graspable (e.g., object handles) and functional affordances (e.g., knife blades, hammer heads) and extract data with precise segmentation masks. We then propose an effective model, termed Geometry-guided Affordance Transformer (GKT), to train on the collected data. GKT integrates an innovative Depth Feature Injector (DFI) to incorporate 3D shape and geometric priors, enhancing the model's understanding of affordances. To enable affordance-oriented manipulation, we further introduce Aff-Grasp, a framework that combines GKT with a grasp generation model. For comprehensive evaluation, we create an affordance evaluation dataset with pixel-wise annotations, and design real-world tasks for robot experiments. The results show that GKT surpasses the state-of-the-art by 15.9% in mIoU, and Aff-Grasp achieves high success rates of 95.5% in affordance prediction and 77.1% in successful grasping among 179 trials, including evaluations with seen, unseen objects, and cluttered scenes.
△ Less
Submitted 19 August, 2024;
originally announced August 2024.
-
Intermittent Semi-working Mask: A New Masking Paradigm for LLMs
Authors:
Mingcong Lu,
Jiangcai Zhu,
Wang Hao,
Zheng Li,
Shusheng Zhang,
Kailai Shao,
Chao Chen,
Nan Li,
Feng Wang,
Xin Lu
Abstract:
Multi-turn dialogues are a key interaction method between humans and Large Language Models (LLMs), as conversations extend over multiple rounds, keeping LLMs' high generation quality and low latency is a challenge. Mainstream LLMs can be grouped into two categories based on masking strategy: causal LLM and prefix LLM. Several works have demonstrated that prefix LLMs tend to outperform causal ones…
▽ More
Multi-turn dialogues are a key interaction method between humans and Large Language Models (LLMs), as conversations extend over multiple rounds, keeping LLMs' high generation quality and low latency is a challenge. Mainstream LLMs can be grouped into two categories based on masking strategy: causal LLM and prefix LLM. Several works have demonstrated that prefix LLMs tend to outperform causal ones in scenarios that heavily depend on historical context such as multi-turn dialogues or in-context learning, thanks to their bidirectional attention on prefix sequences. However, prefix LLMs have an inherent inefficient training problem in multi-turn dialogue datasets. In addition, the attention mechanism of prefix LLM makes it unable to reuse Key-Value Cache (KV Cache) across dialogue rounds to reduce generation latency. In this paper, we propose a novel masking scheme called Intermittent Semi-working Mask (ISM) to address these problems. Specifically, we apply alternate bidirectional and unidirectional attention on queries and answers in the dialogue history. In this way, ISM is able to maintain the high quality of prefix LLM and low generation latency of causal LLM, simultaneously. Extensive experiments illustrate that our ISM achieves significant performance.
△ Less
Submitted 1 August, 2024;
originally announced August 2024.
-
ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning
Authors:
Christopher E. Mower,
Yuhui Wan,
Hongzhan Yu,
Antoine Grosnit,
Jonas Gonzalez-Billandon,
Matthieu Zimmer,
Jinlong Wang,
Xinyu Zhang,
Yao Zhao,
Anbang Zhai,
Puze Liu,
Daniel Palenicek,
Davide Tateo,
Cesar Cadena,
Marco Hutter,
Jan Peters,
Guangjian Tian,
Yuzheng Zhuang,
Kun Shao,
Xingyue Quan,
Jianye Hao,
Jun Wang,
Haitham Bou-Ammar
Abstract:
We present a framework for intuitive robot programming by non-experts, leveraging natural language prompts and contextual information from the Robot Operating System (ROS). Our system integrates large language models (LLMs), enabling non-experts to articulate task requirements to the system through a chat interface. Key features of the framework include: integration of ROS with an AI agent connect…
▽ More
We present a framework for intuitive robot programming by non-experts, leveraging natural language prompts and contextual information from the Robot Operating System (ROS). Our system integrates large language models (LLMs), enabling non-experts to articulate task requirements to the system through a chat interface. Key features of the framework include: integration of ROS with an AI agent connected to a plethora of open-source and commercial LLMs, automatic extraction of a behavior from the LLM output and execution of ROS actions/services, support for three behavior modes (sequence, behavior tree, state machine), imitation learning for adding new robot actions to the library of possible actions, and LLM reflection via human and environment feedback. Extensive experiments validate the framework, showcasing robustness, scalability, and versatility in diverse scenarios, including long-horizon tasks, tabletop rearrangements, and remote supervisory control. To facilitate the adoption of our framework and support the reproduction of our results, we have made our code open-source. You can access it at: https://github.com/huawei-noah/HEBO/tree/master/ROSLLM.
△ Less
Submitted 12 July, 2024; v1 submitted 28 June, 2024;
originally announced June 2024.
-
A Survey on Data Quality Dimensions and Tools for Machine Learning
Authors:
Yuhan Zhou,
Fengjiao Tu,
Kewei Sha,
Junhua Ding,
Haihua Chen
Abstract:
Machine learning (ML) technologies have become substantial in practically all aspects of our society, and data quality (DQ) is critical for the performance, fairness, robustness, safety, and scalability of ML models. With the large and complex data in data-centric AI, traditional methods like exploratory data analysis (EDA) and cross-validation (CV) face challenges, highlighting the importance of…
▽ More
Machine learning (ML) technologies have become substantial in practically all aspects of our society, and data quality (DQ) is critical for the performance, fairness, robustness, safety, and scalability of ML models. With the large and complex data in data-centric AI, traditional methods like exploratory data analysis (EDA) and cross-validation (CV) face challenges, highlighting the importance of mastering DQ tools. In this survey, we review 17 DQ evaluation and improvement tools in the last 5 years. By introducing the DQ dimensions, metrics, and main functions embedded in these tools, we compare their strengths and limitations and propose a roadmap for developing open-source DQ tools for ML. Based on the discussions on the challenges and emerging trends, we further highlight the potential applications of large language models (LLMs) and generative AI in DQ evaluation and improvement for ML. We believe this comprehensive survey can enhance understanding of DQ in ML and could drive progress in data-centric AI. A complete list of the literature investigated in this survey is available on GitHub at: https://github.com/haihua0913/awesome-dq4ml.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
Multimodal Physiological Signals Representation Learning via Multiscale Contrasting for Depression Recognition
Authors:
Kai Shao,
Rui Wang,
Yixue Hao,
Long Hu,
Min Chen,
Hans Arno Jacobsen
Abstract:
Depression recognition based on physiological signals such as functional near-infrared spectroscopy (fNIRS) and electroencephalogram (EEG) has made considerable progress. However, most existing studies ignore the complementarity and semantic consistency of multimodal physiological signals under the same stimulation task in complex spatio-temporal patterns. In this paper, we introduce a multimodal…
▽ More
Depression recognition based on physiological signals such as functional near-infrared spectroscopy (fNIRS) and electroencephalogram (EEG) has made considerable progress. However, most existing studies ignore the complementarity and semantic consistency of multimodal physiological signals under the same stimulation task in complex spatio-temporal patterns. In this paper, we introduce a multimodal physiological signals representation learning framework using Siamese architecture via multiscale contrasting for depression recognition (MRLMC). First, fNIRS and EEG are transformed into different but correlated data based on a time-domain data augmentation strategy. Then, we design a spatio-temporal contrasting module to learn the representation of fNIRS and EEG through weight-sharing multiscale spatio-temporal convolution. Furthermore, to enhance the learning of semantic representation associated with stimulation tasks, a semantic consistency contrast module is proposed, aiming to maximize the semantic similarity of fNIRS and EEG. Extensive experiments on publicly available and self-collected multimodal physiological signals datasets indicate that MRLMC outperforms the state-of-the-art models. Moreover, our proposed framework is capable of transferring to multimodal time series downstream tasks.
△ Less
Submitted 25 June, 2024; v1 submitted 22 June, 2024;
originally announced June 2024.
-
Blow-up of solutions to semilinear wave equations with spatial derivatives
Authors:
Kerun Shao,
Hiroyuki Takamura,
Chengbo Wang
Abstract:
For small-amplitude semilinear wave equations with power type nonlinearity on the first-order spatial derivative, the expected sharp upper bound on the lifespan of solutions is obtained for both critical cases and subcritical cases, for all spatial dimensions $n>1$. It is achieved uniformly by constructing the integral equations, deriving the ordinary differential inequality system, and iteration…
▽ More
For small-amplitude semilinear wave equations with power type nonlinearity on the first-order spatial derivative, the expected sharp upper bound on the lifespan of solutions is obtained for both critical cases and subcritical cases, for all spatial dimensions $n>1$. It is achieved uniformly by constructing the integral equations, deriving the ordinary differential inequality system, and iteration argument. Combined with the former works, the sharp lifespan estimates for this problem are completely established, at least for the spherical symmetric case.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
SFANet: Spatial-Frequency Attention Network for Weather Forecasting
Authors:
Jiaze Wang,
Hao Chen,
Hongcan Xu,
Jinpeng Li,
Bowen Wang,
Kun Shao,
Furui Liu,
Huaxi Chen,
Guangyong Chen,
Pheng-Ann Heng
Abstract:
Weather forecasting plays a critical role in various sectors, driving decision-making and risk management. However, traditional methods often struggle to capture the complex dynamics of meteorological systems, particularly in the presence of high-resolution data. In this paper, we propose the Spatial-Frequency Attention Network (SFANet), a novel deep learning framework designed to address these ch…
▽ More
Weather forecasting plays a critical role in various sectors, driving decision-making and risk management. However, traditional methods often struggle to capture the complex dynamics of meteorological systems, particularly in the presence of high-resolution data. In this paper, we propose the Spatial-Frequency Attention Network (SFANet), a novel deep learning framework designed to address these challenges and enhance the accuracy of spatiotemporal weather prediction. Drawing inspiration from the limitations of existing methodologies, we present an innovative approach that seamlessly integrates advanced token mixing and attention mechanisms. By leveraging both pooling and spatial mixing strategies, SFANet optimizes the processing of high-dimensional spatiotemporal sequences, preserving inter-component relational information and modeling extensive long-range relationships. To further enhance feature integration, we introduce a novel spatial-frequency attention module, enabling the model to capture intricate cross-modal correlations. Our extensive experimental evaluation on two distinct datasets, the Storm EVent ImageRy (SEVIR) and the Institute for Climate and Application Research (ICAR) - El Niño Southern Oscillation (ENSO) dataset, demonstrates the remarkable performance of SFANet. Notably, SFANet achieves substantial advancements over state-of-the-art methods, showcasing its proficiency in forecasting precipitation patterns and predicting El Niño events.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Vim-F: Visual State Space Model Benefiting from Learning in the Frequency Domain
Authors:
Juntao Zhang,
Shaogeng Liu,
Kun Bian,
You Zhou,
Pei Zhang,
Wenbo An,
Jun Zhou,
Kun Shao
Abstract:
In recent years, State Space Models (SSMs) with efficient hardware-aware designs, known as the Mamba deep learning models, have made significant progress in modeling long sequences such as language understanding. Therefore, building efficient and general-purpose visual backbones based on SSMs is a promising direction. Compared to traditional convolutional neural networks (CNNs) and Vision Transfor…
▽ More
In recent years, State Space Models (SSMs) with efficient hardware-aware designs, known as the Mamba deep learning models, have made significant progress in modeling long sequences such as language understanding. Therefore, building efficient and general-purpose visual backbones based on SSMs is a promising direction. Compared to traditional convolutional neural networks (CNNs) and Vision Transformers (ViTs), the performance of Vision Mamba (ViM) methods is not yet fully competitive. To enable SSMs to process image data, ViMs typically flatten 2D images into 1D sequences, inevitably ignoring some 2D local dependencies, thereby weakening the model's ability to interpret spatial relationships from a global perspective. We use Fast Fourier Transform (FFT) to obtain the spectrum of the feature map and add it to the original feature map, enabling ViM to model a unified visual representation in both frequency and spatial domains. The introduction of frequency domain information enables ViM to have a global receptive field during scanning. We propose a novel model called Vim-F, which employs pure Mamba encoders and scans in both the frequency and spatial domains. Moreover, we question the necessity of position embedding in ViM and remove it accordingly in Vim-F, which helps to fully utilize the efficient long-sequence modeling capability of ViM. Finally, we redesign a patch embedding for Vim-F, leveraging a convolutional stem to capture more local correlations, further improving the performance of Vim-F. Code is available at: \url{https://github.com/yws-wxs/Vim-F}.
△ Less
Submitted 7 January, 2025; v1 submitted 28 May, 2024;
originally announced May 2024.
-
Music Enhancement with Deep Filters: A Technical Report for The ICASSP 2024 Cadenza Challenge
Authors:
Keren Shao,
Ke Chen,
Shlomo Dubnov
Abstract:
In this challenge, we disentangle the deep filters from the original DeepfilterNet and incorporate them into our Spec-UNet-based network to further improve a hybrid Demucs (hdemucs) based remixing pipeline. The motivation behind the use of the deep filter component lies at its potential in better handling temporal fine structures. We demonstrate an incremental improvement in both the Signal-to-Dis…
▽ More
In this challenge, we disentangle the deep filters from the original DeepfilterNet and incorporate them into our Spec-UNet-based network to further improve a hybrid Demucs (hdemucs) based remixing pipeline. The motivation behind the use of the deep filter component lies at its potential in better handling temporal fine structures. We demonstrate an incremental improvement in both the Signal-to-Distortion Ratio (SDR) and the Hearing Aid Audio Quality Index (HAAQI) metrics when comparing the performance of hdemucs against different versions of our model.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
On the asymptotic behavior of solutions to the steady Navier-Stokes system in two-dimensional channels
Authors:
Han Li,
Kaijian Sha
Abstract:
In this paper, we investigate the incompressible steady Navier-Stokes system with no-slip boundary condition in a two-dimensional channel. Given any flux, the existence of solutions is proved as long as the width of cross-section of the channel grows more slowly than the linear growth. Furthermore, if the flux is suitably small, the solution is unique even when the width of the channel is unbounde…
▽ More
In this paper, we investigate the incompressible steady Navier-Stokes system with no-slip boundary condition in a two-dimensional channel. Given any flux, the existence of solutions is proved as long as the width of cross-section of the channel grows more slowly than the linear growth. Furthermore, if the flux is suitably small, the solution is unique even when the width of the channel is unbounded. Finally, based on the estimate of Dirichlet norm on the truncated domain, one could obtain the pointwise decay rate of the solution for arbitrary flux.
△ Less
Submitted 26 March, 2024;
originally announced April 2024.
-
Distilling Morphology-Conditioned Hypernetworks for Efficient Universal Morphology Control
Authors:
Zheng Xiong,
Risto Vuorio,
Jacob Beck,
Matthieu Zimmer,
Kun Shao,
Shimon Whiteson
Abstract:
Learning a universal policy across different robot morphologies can significantly improve learning efficiency and enable zero-shot generalization to unseen morphologies. However, learning a highly performant universal policy requires sophisticated architectures like transformers (TF) that have larger memory and computational cost than simpler multi-layer perceptrons (MLP). To achieve both good per…
▽ More
Learning a universal policy across different robot morphologies can significantly improve learning efficiency and enable zero-shot generalization to unseen morphologies. However, learning a highly performant universal policy requires sophisticated architectures like transformers (TF) that have larger memory and computational cost than simpler multi-layer perceptrons (MLP). To achieve both good performance like TF and high efficiency like MLP at inference time, we propose HyperDistill, which consists of: (1) A morphology-conditioned hypernetwork (HN) that generates robot-wise MLP policies, and (2) A policy distillation approach that is essential for successful training. We show that on UNIMAL, a benchmark with hundreds of diverse morphologies, HyperDistill performs as well as a universal TF teacher policy on both training and unseen test robots, but reduces model size by 6-14 times, and computational cost by 67-160 times in different environments. Our analysis attributes the efficiency advantage of HyperDistill at inference time to knowledge decoupling, i.e., the ability to decouple inter-task and intra-task knowledge, a general principle that could also be applied to improve inference efficiency in other domains.
△ Less
Submitted 3 June, 2024; v1 submitted 9 February, 2024;
originally announced February 2024.
-
Pangu-Agent: A Fine-Tunable Generalist Agent with Structured Reasoning
Authors:
Filippos Christianos,
Georgios Papoudakis,
Matthieu Zimmer,
Thomas Coste,
Zhihao Wu,
Jingxuan Chen,
Khyati Khandelwal,
James Doran,
Xidong Feng,
Jiacheng Liu,
Zheng Xiong,
Yicheng Luo,
Jianye Hao,
Kun Shao,
Haitham Bou-Ammar,
Jun Wang
Abstract:
A key method for creating Artificial Intelligence (AI) agents is Reinforcement Learning (RL). However, constructing a standalone RL policy that maps perception to action directly encounters severe problems, chief among them being its lack of generality across multiple tasks and the need for a large amount of training data. The leading cause is that it cannot effectively integrate prior information…
▽ More
A key method for creating Artificial Intelligence (AI) agents is Reinforcement Learning (RL). However, constructing a standalone RL policy that maps perception to action directly encounters severe problems, chief among them being its lack of generality across multiple tasks and the need for a large amount of training data. The leading cause is that it cannot effectively integrate prior information into the perception-action cycle when devising the policy. Large language models (LLMs) emerged as a fundamental way to incorporate cross-domain knowledge into AI agents but lack crucial learning and adaptation toward specific decision problems. This paper presents a general framework model for integrating and learning structured reasoning into AI agents' policies. Our methodology is motivated by the modularity found in the human brain. The framework utilises the construction of intrinsic and extrinsic functions to add previous understandings of reasoning structures. It also provides the adaptive ability to learn models inside every module or function, consistent with the modular structure of cognitive processes. We describe the framework in-depth and compare it with other AI pipelines and existing frameworks. The paper explores practical applications, covering experiments that show the effectiveness of our method. Our results indicate that AI agents perform and adapt far better when organised reasoning and prior knowledge are embedded. This opens the door to more resilient and general AI agent systems.
△ Less
Submitted 22 December, 2023;
originally announced December 2023.
-
A survey on algorithms for Nash equilibria in finite normal-form games
Authors:
Hanyu Li,
Wenhan Huang,
Zhijian Duan,
David Henry Mguni,
Kun Shao,
Jun Wang,
Xiaotie Deng
Abstract:
Nash equilibrium is one of the most influential solution concepts in game theory. With the development of computer science and artificial intelligence, there is an increasing demand on Nash equilibrium computation, especially for Internet economics and multi-agent learning. This paper reviews various algorithms computing the Nash equilibrium and its approximation solutions in finite normal-form ga…
▽ More
Nash equilibrium is one of the most influential solution concepts in game theory. With the development of computer science and artificial intelligence, there is an increasing demand on Nash equilibrium computation, especially for Internet economics and multi-agent learning. This paper reviews various algorithms computing the Nash equilibrium and its approximation solutions in finite normal-form games from both theoretical and empirical perspectives. For the theoretical part, we classify algorithms in the literature and present basic ideas on algorithm design and analysis. For the empirical part, we present a comprehensive comparison on the algorithms in the literature over different kinds of games. Based on these results, we provide practical suggestions on implementations and uses of these algorithms. Finally, we present a series of open problems from both theoretical and practical considerations.
△ Less
Submitted 18 December, 2023;
originally announced December 2023.
-
Transformer-QEC: Quantum Error Correction Code Decoding with Transferable Transformers
Authors:
Hanrui Wang,
Pengyu Liu,
Kevin Shao,
Dantong Li,
Jiaqi Gu,
David Z. Pan,
Yongshan Ding,
Song Han
Abstract:
Quantum computing has the potential to solve problems that are intractable for classical systems, yet the high error rates in contemporary quantum devices often exceed tolerable limits for useful algorithm execution. Quantum Error Correction (QEC) mitigates this by employing redundancy, distributing quantum information across multiple data qubits and utilizing syndrome qubits to monitor their stat…
▽ More
Quantum computing has the potential to solve problems that are intractable for classical systems, yet the high error rates in contemporary quantum devices often exceed tolerable limits for useful algorithm execution. Quantum Error Correction (QEC) mitigates this by employing redundancy, distributing quantum information across multiple data qubits and utilizing syndrome qubits to monitor their states for errors. The syndromes are subsequently interpreted by a decoding algorithm to identify and correct errors in the data qubits. This task is complex due to the multiplicity of error sources affecting both data and syndrome qubits as well as syndrome extraction operations. Additionally, identical syndromes can emanate from different error sources, necessitating a decoding algorithm that evaluates syndromes collectively. Although machine learning (ML) decoders such as multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs) have been proposed, they often focus on local syndrome regions and require retraining when adjusting for different code distances. We introduce a transformer-based QEC decoder which employs self-attention to achieve a global receptive field across all input syndromes. It incorporates a mixed loss training approach, combining both local physical error and global parity label losses. Moreover, the transformer architecture's inherent adaptability to variable-length inputs allows for efficient transfer learning, enabling the decoder to adapt to varying code distances without retraining.
Evaluation on six code distances and ten different error configurations demonstrates that our model consistently outperforms non-ML decoders, such as Union Find (UF) and Minimum Weight Perfect Matching (MWPM), and other ML decoders, thereby achieving best logical error rates. Moreover, the transfer learning can save over 10x of training cost.
△ Less
Submitted 27 November, 2023;
originally announced November 2023.
-
A Non-Hermitian Moiré Valley Filter
Authors:
Kai Shao,
Hao Geng,
Erfu Liu,
Jose L. Lado,
Wei Chen,
D. Y. Xing
Abstract:
A valley filter capable of generating a valley-polarized current is a crucial element in valleytronics, yet its implementation remains challenging. Here, we propose a valley filter made of a graphene bilayer which exhibits a 1D moiré pattern in the overlapping region of the two layers controlled by heterostrain. In the presence of a lattice modulation between layers, electrons propagating in one l…
▽ More
A valley filter capable of generating a valley-polarized current is a crucial element in valleytronics, yet its implementation remains challenging. Here, we propose a valley filter made of a graphene bilayer which exhibits a 1D moiré pattern in the overlapping region of the two layers controlled by heterostrain. In the presence of a lattice modulation between layers, electrons propagating in one layer can have valley-dependent dissipation due to valley asymmetric interlayer coupling, thus giving rise to a valley-polarized current. Such a process can be described by an effective non-Hermitian theory, in which the valley filter is driven by a valley-resolved non-Hermitian skin effect. Nearly 100\% valley-polarization can be achieved within a wide parameter range and the functionality of the valley filter is electrically tunable. The non-Hermitian topological scenario of the valley filter ensures high tolerance against imperfections such as disorder and edge defects. Our work opens a new route for efficient and robust valley filters while significantly relaxing the stringent implementation requirements.
△ Less
Submitted 18 April, 2024; v1 submitted 16 October, 2023;
originally announced October 2023.
-
Maximal Martingale Wasserstein Inequality
Authors:
Benjamin Jourdain,
Kexin Shao
Abstract:
In this note, we complete the analysis of the Martingale Wasserstein Inequality started in arXiv:2011.11599 by checking that this inequality fails in dimension $d\ge 2$ when the integrability parameter $ρ$ belongs to $[1,2)$ while a stronger Maximal Martingale Wasserstein Inequality holds whatever the dimension $d$ when $ρ\ge 2$.
In this note, we complete the analysis of the Martingale Wasserstein Inequality started in arXiv:2011.11599 by checking that this inequality fails in dimension $d\ge 2$ when the integrability parameter $ρ$ belongs to $[1,2)$ while a stronger Maximal Martingale Wasserstein Inequality holds whatever the dimension $d$ when $ρ\ge 2$.
△ Less
Submitted 12 October, 2023;
originally announced October 2023.
-
Towards Improving Harmonic Sensitivity and Prediction Stability for Singing Melody Extraction
Authors:
Keren Shao,
Ke Chen,
Taylor Berg-Kirkpatrick,
Shlomo Dubnov
Abstract:
In deep learning research, many melody extraction models rely on redesigning neural network architectures to improve performance. In this paper, we propose an input feature modification and a training objective modification based on two assumptions. First, harmonics in the spectrograms of audio data decay rapidly along the frequency axis. To enhance the model's sensitivity on the trailing harmonic…
▽ More
In deep learning research, many melody extraction models rely on redesigning neural network architectures to improve performance. In this paper, we propose an input feature modification and a training objective modification based on two assumptions. First, harmonics in the spectrograms of audio data decay rapidly along the frequency axis. To enhance the model's sensitivity on the trailing harmonics, we modify the Combined Frequency and Periodicity (CFP) representation using discrete z-transform. Second, the vocal and non-vocal segments with extremely short duration are uncommon. To ensure a more stable melody contour, we design a differentiable loss function that prevents the model from predicting such segments. We apply these modifications to several models, including MSNet, FTANet, and a newly introduced model, PianoNet, modified from a piano transcription network. Our experimental results demonstrate that the proposed modifications are empirically effective for singing melody extraction.
△ Less
Submitted 4 August, 2023;
originally announced August 2023.
-
ChessGPT: Bridging Policy Learning and Language Modeling
Authors:
Xidong Feng,
Yicheng Luo,
Ziyan Wang,
Hongrui Tang,
Mengyue Yang,
Kun Shao,
David Mguni,
Yali Du,
Jun Wang
Abstract:
When solving decision-making tasks, humans typically depend on information from two key sources: (1) Historical policy data, which provides interaction replay from the environment, and (2) Analytical insights in natural language form, exposing the invaluable thought process or strategic considerations. Despite this, the majority of preceding research focuses on only one source: they either use his…
▽ More
When solving decision-making tasks, humans typically depend on information from two key sources: (1) Historical policy data, which provides interaction replay from the environment, and (2) Analytical insights in natural language form, exposing the invaluable thought process or strategic considerations. Despite this, the majority of preceding research focuses on only one source: they either use historical replay exclusively to directly learn policy or value functions, or engaged in language model training utilizing mere language corpus. In this paper, we argue that a powerful autonomous agent should cover both sources. Thus, we propose ChessGPT, a GPT model bridging policy learning and language modeling by integrating data from these two sources in Chess games. Specifically, we build a large-scale game and language dataset related to chess. Leveraging the dataset, we showcase two model examples ChessCLIP and ChessGPT, integrating policy learning and language modeling. Finally, we propose a full evaluation framework for evaluating language model's chess ability. Experimental results validate our model and dataset's effectiveness. We open source our code, model, and dataset at https://github.com/waterhorse1/ChessGPT.
△ Less
Submitted 21 December, 2023; v1 submitted 15 June, 2023;
originally announced June 2023.
-
Uniqueness and uniform structural stability of Poiseuille flows with large fluxes in two-dimensional strips
Authors:
Kaijian Sha,
Yun Wang,
Chunjing Xie
Abstract:
In this paper, we prove the uniform nonlinear structural stability of Poiseuille flows with suitably large flux for the steady Navier-Stokes system in a two-dimensional strip with arbitrary period. Furthermore, the well-posedness theory for the Navier-Stokes system is also proved even when the $L^2$-norm of the external force is large. In particular, if the vertical velocity is suitably small wher…
▽ More
In this paper, we prove the uniform nonlinear structural stability of Poiseuille flows with suitably large flux for the steady Navier-Stokes system in a two-dimensional strip with arbitrary period. Furthermore, the well-posedness theory for the Navier-Stokes system is also proved even when the $L^2$-norm of the external force is large. In particular, if the vertical velocity is suitably small where the smallness is independent of the flux, then Poiseuille flow is the unique solution of the steady Navier-Stokes system in the periodic strip. The key point is to establish uniform a priori estimates for the corresponding linearized problem via the boundary layer analysis, where we explore the particular features of odd and even stream functions. The analysis for the even stream function is new, which not only generalizes the previous study for the symmetric flows in \cite{Rabier1}, but also provides an explicit relation between the flux and period.
△ Less
Submitted 24 May, 2023;
originally announced May 2023.
-
Non-decreasing martingale couplings
Authors:
Benjamin Jourdain,
Kexin Shao
Abstract:
For many examples of couples $(μ,ν)$ of probability measures on the real line in the convex order, we observe numerically that the Hobson and Neuberger martingale coupling, which maximizes for $ρ=1$ the integral of $|y-x|^ρ$ with respect to any martingale coupling between $μ$ and $ν$, is still a maximizer for $ρ\in(0,2)$ and a minimizer for $ρ>2$. We investigate the theoretical validity of this nu…
▽ More
For many examples of couples $(μ,ν)$ of probability measures on the real line in the convex order, we observe numerically that the Hobson and Neuberger martingale coupling, which maximizes for $ρ=1$ the integral of $|y-x|^ρ$ with respect to any martingale coupling between $μ$ and $ν$, is still a maximizer for $ρ\in(0,2)$ and a minimizer for $ρ>2$. We investigate the theoretical validity of this numerical observation and give rather restrictive sufficient conditions for the property to hold. We also exhibit couples $(μ,ν)$ such that it does not hold. The support of the Hobson and Neuberger coupling is known to satisfy some monotonicity property which we call non-decreasing. We check that the non-decreasing property is preserved for maximizers when $ρ\in(0,1]$. In general, there exist distinct non-decreasing martingale couplings, and we find some decomposition of $ν$ which is in one-to-one correspondence with martingale couplings non-decreasing in a generalized sense.
△ Less
Submitted 30 April, 2023;
originally announced May 2023.
-
DropDim: A Regularization Method for Transformer Networks
Authors:
Hao Zhang,
Dan Qu,
Keji Shao,
Xukui Yang
Abstract:
We introduceDropDim, a structured dropout method designed for regularizing the self-attention mechanism, which is a key component of the transformer. In contrast to the general dropout method, which randomly drops neurons, DropDim drops part of the embedding dimensions. In this way, the semantic information can be completely discarded. Thus, the excessive coadapting between different embedding dim…
▽ More
We introduceDropDim, a structured dropout method designed for regularizing the self-attention mechanism, which is a key component of the transformer. In contrast to the general dropout method, which randomly drops neurons, DropDim drops part of the embedding dimensions. In this way, the semantic information can be completely discarded. Thus, the excessive coadapting between different embedding dimensions can be broken, and the self-attention is forced to encode meaningful featureswith a certain number of embedding dimensions erased. Experiments on a wide range of tasks executed on the MUST-C English-Germany dataset show that DropDim can effectively improve model performance, reduce over-fitting, and show complementary effects with other regularization methods. When combined with label smoothing, the WER can be reduced from 19.1% to 15.1% on the ASR task, and the BLEU value can be increased from26.90 to 28.38 on the MT task. On the ST task, the model can reach a BLEU score of 22.99, an increase by 1.86 BLEU points compared to the strong baseline.
△ Less
Submitted 20 April, 2023;
originally announced April 2023.
-
Traj-MAE: Masked Autoencoders for Trajectory Prediction
Authors:
Hao Chen,
Jiaze Wang,
Kun Shao,
Furui Liu,
Jianye Hao,
Chenyong Guan,
Guangyong Chen,
Pheng-Ann Heng
Abstract:
Trajectory prediction has been a crucial task in building a reliable autonomous driving system by anticipating possible dangers. One key issue is to generate consistent trajectory predictions without colliding. To overcome the challenge, we propose an efficient masked autoencoder for trajectory prediction (Traj-MAE) that better represents the complicated behaviors of agents in the driving environm…
▽ More
Trajectory prediction has been a crucial task in building a reliable autonomous driving system by anticipating possible dangers. One key issue is to generate consistent trajectory predictions without colliding. To overcome the challenge, we propose an efficient masked autoencoder for trajectory prediction (Traj-MAE) that better represents the complicated behaviors of agents in the driving environment. Specifically, our Traj-MAE employs diverse masking strategies to pre-train the trajectory encoder and map encoder, allowing for the capture of social and temporal information among agents while leveraging the effect of environment from multiple granularities. To address the catastrophic forgetting problem that arises when pre-training the network with multiple masking strategies, we introduce a continual pre-training framework, which can help Traj-MAE learn valuable and diverse information from various strategies efficiently. Our experimental results in both multi-agent and single-agent settings demonstrate that Traj-MAE achieves competitive results with state-of-the-art methods and significantly outperforms our baseline model.
△ Less
Submitted 12 March, 2023;
originally announced March 2023.