Search | arXiv e-print repository

Structural-Temporal Coupling Anomaly Detection with Dynamic Graph Transformer

Authors: Chang Zong, Yueting Zhuang, Jian Shao, Weiming Lu

Abstract: Detecting anomalous edges in dynamic graphs is an important task in many applications over evolving triple-based data, such as social networks, transaction management, and epidemiology. A major challenge with this task is the absence of structural-temporal coupling information, which decreases the ability of the representation to distinguish anomalies from normal instances. Existing methods focus… ▽ More Detecting anomalous edges in dynamic graphs is an important task in many applications over evolving triple-based data, such as social networks, transaction management, and epidemiology. A major challenge with this task is the absence of structural-temporal coupling information, which decreases the ability of the representation to distinguish anomalies from normal instances. Existing methods focus on handling independent structural and temporal features with embedding models, which ignore the deep interaction between these two types of information. In this paper, we propose a structural-temporal coupling anomaly detection architecture with a dynamic graph transformer model. Specifically, we introduce structural and temporal features from two integration levels to provide anomaly-aware graph evolutionary patterns. Then, a dynamic graph transformer enhanced by two-dimensional positional encoding is implemented to capture both discrimination and contextual consistency signals. Extensive experiments on six datasets demonstrate that our method outperforms current state-of-the-art models. Finally, a case study illustrates the strength of our method when applied to a real-world task. △ Less

Submitted 13 May, 2025; originally announced May 2025.

Comments: 20 pages, 6 figures

MSC Class: 68T07; 68T09

arXiv:2505.07782 [pdf, ps, other]

MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering

Authors: Rushi Qiang, Yuchen Zhuang, Yinghao Li, Dingu Sagar V K, Rongzhi Zhang, Changhao Li, Ian Shu-Hei Wong, Sherry Yang, Percy Liang, Chao Zhang, Bo Dai

Abstract: We introduce MLE-Dojo, a Gym-style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents in iterative machine learning engineering (MLE) workflows. Unlike existing benchmarks that primarily rely on static datasets or single-attempt evaluations, MLE-Dojo provides an interactive environment enabling agents to iteratively experimen… ▽ More We introduce MLE-Dojo, a Gym-style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents in iterative machine learning engineering (MLE) workflows. Unlike existing benchmarks that primarily rely on static datasets or single-attempt evaluations, MLE-Dojo provides an interactive environment enabling agents to iteratively experiment, debug, and refine solutions through structured feedback loops. Built upon 200+ real-world Kaggle challenges, MLE-Dojo covers diverse, open-ended MLE tasks carefully curated to reflect realistic engineering scenarios such as data processing, architecture search, hyperparameter tuning, and code debugging. Its fully executable environment supports comprehensive agent training via both supervised fine-tuning and reinforcement learning, facilitating iterative experimentation, realistic data sampling, and real-time outcome verification. Extensive evaluations of eight frontier LLMs reveal that while current models achieve meaningful iterative improvements, they still exhibit significant limitations in autonomously generating long-horizon solutions and efficiently resolving complex errors. Furthermore, MLE-Dojo's flexible and extensible architecture seamlessly integrates diverse data sources, tools, and evaluation protocols, uniquely enabling model-based agent tuning and promoting interoperability, scalability, and reproducibility. We open-source our framework and benchmarks to foster community-driven innovation towards next-generation MLE agents. △ Less

Submitted 12 May, 2025; originally announced May 2025.

arXiv:2505.07110 [pdf]

DeepSORT-Driven Visual Tracking Approach for Gesture Recognition in Interactive Systems

Authors: Tong Zhang, Fenghua Shao, Runsheng Zhang, Yifan Zhuang, Liuqingqing Yang

Abstract: Based on the DeepSORT algorithm, this study explores the application of visual tracking technology in intelligent human-computer interaction, especially in the field of gesture recognition and tracking. With the rapid development of artificial intelligence and deep learning technology, visual-based interaction has gradually replaced traditional input devices and become an important way for intelli… ▽ More Based on the DeepSORT algorithm, this study explores the application of visual tracking technology in intelligent human-computer interaction, especially in the field of gesture recognition and tracking. With the rapid development of artificial intelligence and deep learning technology, visual-based interaction has gradually replaced traditional input devices and become an important way for intelligent systems to interact with users. The DeepSORT algorithm can achieve accurate target tracking in dynamic environments by combining Kalman filters and deep learning feature extraction methods. It is especially suitable for complex scenes with multi-target tracking and fast movements. This study experimentally verifies the superior performance of DeepSORT in gesture recognition and tracking. It can accurately capture and track the user's gesture trajectory and is superior to traditional tracking methods in terms of real-time and accuracy. In addition, this study also combines gesture recognition experiments to evaluate the recognition ability and feedback response of the DeepSORT algorithm under different gestures (such as sliding, clicking, and zooming). The experimental results show that DeepSORT can not only effectively deal with target occlusion and motion blur but also can stably track in a multi-target environment, achieving a smooth user interaction experience. Finally, this paper looks forward to the future development direction of intelligent human-computer interaction systems based on visual tracking and proposes future research focuses such as algorithm optimization, data fusion, and multimodal interaction in order to promote a more intelligent and personalized interactive experience. Keywords-DeepSORT, visual tracking, gesture recognition, human-computer interaction △ Less

Submitted 11 May, 2025; originally announced May 2025.

arXiv:2505.04548 [pdf, other]

Accelerating Audio Research with Robotic Dummy Heads

Authors: Austin Lu, Kanad Sarkar, Yongjie Zhuang, Leo Lin, Ryan M Corey, Andrew C Singer

Abstract: This work introduces a robotic dummy head that fuses the acoustic realism of conventional audiological mannequins with the mobility of robots. The proposed device is capable of moving, talking, and listening as people do, and can be used to automate spatially-stationary audio experiments, thus accelerating the pace of audio research. Critically, the device may also be used as a moving sound source… ▽ More This work introduces a robotic dummy head that fuses the acoustic realism of conventional audiological mannequins with the mobility of robots. The proposed device is capable of moving, talking, and listening as people do, and can be used to automate spatially-stationary audio experiments, thus accelerating the pace of audio research. Critically, the device may also be used as a moving sound source in dynamic experiments, due to its quiet motor. This feature differentiates our work from previous robotic acoustic research platforms. Validation that the robot enables high quality audio data collection is provided through various experiments and acoustic measurements. These experiments also demonstrate how the robot might be used to study adaptive binaural beamforming. Design files are provided as open-source to stimulate novel audio research. △ Less

Submitted 7 May, 2025; originally announced May 2025.

Comments: WASPAA 2025

arXiv:2505.03475 [pdf, other]

am-ELO: A Stable Framework for Arena-based LLM Evaluation

Authors: Zirui Liu, Jiatong Li, Yan Zhuang, Qi Liu, Shuanghong Shen, Jie Ouyang, Mingyue Cheng, Shijin Wang

Abstract: Arena-based evaluation is a fundamental yet significant evaluation paradigm for modern AI models, especially large language models (LLMs). Existing framework based on ELO rating system suffers from the inevitable instability problem due to ranking inconsistency and the lack of attention to the varying abilities of annotators. In this paper, we introduce a novel stable arena framework to address th… ▽ More Arena-based evaluation is a fundamental yet significant evaluation paradigm for modern AI models, especially large language models (LLMs). Existing framework based on ELO rating system suffers from the inevitable instability problem due to ranking inconsistency and the lack of attention to the varying abilities of annotators. In this paper, we introduce a novel stable arena framework to address these issues by enhancing the ELO Rating System. Specifically, we replace the iterative update method with a Maximum Likelihood Estimation (MLE) approach, m-ELO, and provide theoretical proof of the consistency and stability of the MLE approach for model ranking. Additionally, we proposed the am-ELO, which modify the Elo Rating's probability function to incorporate annotator abilities, enabling the simultaneous estimation of model scores and annotator reliability. Experiments demonstrate that this method ensures stability, proving that this framework offers a more robust, accurate, and stable evaluation method for LLMs. △ Less

Submitted 29 May, 2025; v1 submitted 6 May, 2025; originally announced May 2025.

Comments: ICML2025 Accepted

arXiv:2504.21539 [pdf, other]

Search for the lepton number violation decay $ω\to π^+ π^+ e^-e^- +c.c.$

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (698 additional authors not shown)

Abstract: The lepton number violation decay $ω\to π^+ π^+ e^-e^- +c.c.$ is searched for via $J/ψ\to ωη$ using a data sample of $(1.0087 \pm 0.0044) \times 10^{10}$ $J/ψ$ events collected by the BESIII detector at the BEPCII collider. No significant signal is observed, and the upper limit on the branching fraction of $ω\to π^+ π^+ e^-e^- +c.c.$ at the 90\% confidence level is determined for the first time to… ▽ More The lepton number violation decay $ω\to π^+ π^+ e^-e^- +c.c.$ is searched for via $J/ψ\to ωη$ using a data sample of $(1.0087 \pm 0.0044) \times 10^{10}$ $J/ψ$ events collected by the BESIII detector at the BEPCII collider. No significant signal is observed, and the upper limit on the branching fraction of $ω\to π^+ π^+ e^-e^- +c.c.$ at the 90\% confidence level is determined for the first time to be $2.8 \times 10^{-6}$. △ Less

Submitted 30 April, 2025; originally announced April 2025.

Comments: 9 pages, 3 figures

arXiv:2504.19213 [pdf, other]

Measurements of branching fractions of $D^0\to K^- 3π^+2π^-$, $D^0\to K^- 2π^+π^-2π^0$ and $D^+\to K^- 3π^+π^-π^0$

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (693 additional authors not shown)

Abstract: Utilizing $7.9\,\rm fb^{-1}$ of $e^+e^-$ collision data taken with the BESIII detector at the center-of-mass energy of 3.773 GeV, we report the measurements of absolute branching fractions of the hadronic decays $D^0\to K^- 3π^+2π^-$, $D^0\to K^- 2π^+π^-2π^0$ and $D^+\to K^- 3π^+π^-π^0$. The $D^0\to K^- 3π^+2π^-$ decay is measured with improved precision, while the latter two decays are observed w… ▽ More Utilizing $7.9\,\rm fb^{-1}$ of $e^+e^-$ collision data taken with the BESIII detector at the center-of-mass energy of 3.773 GeV, we report the measurements of absolute branching fractions of the hadronic decays $D^0\to K^- 3π^+2π^-$, $D^0\to K^- 2π^+π^-2π^0$ and $D^+\to K^- 3π^+π^-π^0$. The $D^0\to K^- 3π^+2π^-$ decay is measured with improved precision, while the latter two decays are observed with statistical significance higher than $5σ$ for the first time. The absolute branching fractions of these decays are determined to be ${\mathcal B}(D^0\to K^- 3π^+2π^-)=( 1.35\pm 0.23\pm 0.08 )\times 10^{-4}$, ${\mathcal B}(D^0\to K^- 2π^+π^-2π^0)=( 19.0\pm 1.1\pm 1.5)\times 10^{-4}$, and ${\mathcal B}(D^+\to K^- 3π^+π^-π^0)=( 6.57\pm 0.69\pm 0.33)\times 10^{-4}$, where the first uncertainties are statistical and the second systematic. △ Less

Submitted 27 April, 2025; originally announced April 2025.

Comments: 12pages, 6 figures, 4 tables

Report number: BAM-00843

arXiv:2504.19087 [pdf, ps, other]

Search for $η_{1}(1855)$ in $χ_{cJ}\toηηη^{\prime}$ decays

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (697 additional authors not shown)

Abstract: Based on a sample of $2.7\times10^{9}$ $ψ(3686)$ events collected by the BESIII detector operating at the BEPCII collider, an analysis of the decay $ψ(3686)\toγχ_{cJ}, χ_{cJ}\toηηη^{\prime}$ is performed. The decay modes $χ_{c1}$ and $χ_{c2}\toηηη^{\prime}$ are observed for the first time, and their corresponding branching fractions are determined to be… ▽ More Based on a sample of $2.7\times10^{9}$ $ψ(3686)$ events collected by the BESIII detector operating at the BEPCII collider, an analysis of the decay $ψ(3686)\toγχ_{cJ}, χ_{cJ}\toηηη^{\prime}$ is performed. The decay modes $χ_{c1}$ and $χ_{c2}\toηηη^{\prime}$ are observed for the first time, and their corresponding branching fractions are determined to be $\mathcal{B}(χ_{c1}\toηηη^{\prime}) = (1.40\, \pm 0.13\, (\text{stat.}) \pm 0.09\, (\text{sys.})) \times 10^{-4}$ and $\mathcal{B}(χ_{c2}\toηηη^{\prime}) = (4.18\, \pm 0.84\, (\text{stat.}) \pm 0.48\, (\text{sys.})) \times 10^{-5}$. An upper limit on the branching fraction of $χ_{c0}\toηηη^{\prime}$ is set as $2.59 \times 10^{-5}$ at 90\% confidence level (CL). A partial wave analysis (PWA) of the decay $χ_{c1}\toηηη^{\prime}$ is performed to search for the $1^{-+}$ exotic state $η_1(1855)$. The PWA result indicates that the structure in the $ηη^{\prime}$ mass spectrum is mainly attributed to the $f_0(1500)$, while in the $ηη$ mass spectrum, it is primarily the $0^{++}$ phase space. The upper limit of $\mathcal{B}(χ_{c1}\toη_{1}(1855)η) \cdot \mathcal{B}(η_{1}(1855)\toηη^{\prime})< 9.79 \times 10^{-5}$ is set based on the PWA at 90\% CL. △ Less

Submitted 3 June, 2025; v1 submitted 26 April, 2025; originally announced April 2025.

arXiv:2504.13865 [pdf, ps, other]

A Survey on (M)LLM-Based GUI Agents

Authors: Fei Tang, Haolei Xu, Hang Zhang, Siqi Chen, Xingyu Wu, Yongliang Shen, Wenqi Zhang, Guiyang Hou, Zeqi Tan, Yuchen Yan, Kaitao Song, Jian Shao, Weiming Lu, Jun Xiao, Yueting Zhuang

Abstract: Graphical User Interface (GUI) Agents have emerged as a transformative paradigm in human-computer interaction, evolving from rule-based automation scripts to sophisticated AI-driven systems capable of understanding and executing complex interface operations. This survey provides a comprehensive examination of the rapidly advancing field of LLM-based GUI Agents, systematically analyzing their archi… ▽ More Graphical User Interface (GUI) Agents have emerged as a transformative paradigm in human-computer interaction, evolving from rule-based automation scripts to sophisticated AI-driven systems capable of understanding and executing complex interface operations. This survey provides a comprehensive examination of the rapidly advancing field of LLM-based GUI Agents, systematically analyzing their architectural foundations, technical components, and evaluation methodologies. We identify and analyze four fundamental components that constitute modern GUI Agents: (1) perception systems that integrate text-based parsing with multimodal understanding for comprehensive interface comprehension; (2) exploration mechanisms that construct and maintain knowledge bases through internal modeling, historical experience, and external information retrieval; (3) planning frameworks that leverage advanced reasoning methodologies for task decomposition and execution; and (4) interaction systems that manage action generation with robust safety controls. Through rigorous analysis of these components, we reveal how recent advances in large language models and multimodal learning have revolutionized GUI automation across desktop, mobile, and web platforms. We critically examine current evaluation frameworks, highlighting methodological limitations in existing benchmarks while proposing directions for standardization. This survey also identifies key technical challenges, including accurate element localization, effective knowledge retrieval, long-horizon planning, and safety-aware execution control, while outlining promising research directions for enhancing GUI Agents' capabilities. Our systematic review provides researchers and practitioners with a thorough understanding of the field's current state and offers insights into future developments in intelligent interface automation. △ Less

Submitted 4 June, 2025; v1 submitted 27 March, 2025; originally announced April 2025.

arXiv:2504.13771 [pdf, other]

Search for $J/ψ\rightarrow K^{0}_{S}K^{0}_{S}$ and $ψ(3686)\rightarrow K^{0}_{S}K^{0}_{S}$

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (680 additional authors not shown)

Abstract: Using data samples of $(10087\pm 44)\times10^{6}$ $J/ψ$ events and $(2712.4\pm 14.3)\times10^{6}$ $ψ(3686)$ events collected with the BESIII detector at the BEPCII collider, we search for the CP violating decays $J/ψ\rightarrow K^{0}_{S}K^{0}_{S}$ and $ψ(3686)\rightarrow K^{0}_{S}K^{0}_{S}$. No significant signals are observed over the expected background yields. The upper limits on their branchin… ▽ More Using data samples of $(10087\pm 44)\times10^{6}$ $J/ψ$ events and $(2712.4\pm 14.3)\times10^{6}$ $ψ(3686)$ events collected with the BESIII detector at the BEPCII collider, we search for the CP violating decays $J/ψ\rightarrow K^{0}_{S}K^{0}_{S}$ and $ψ(3686)\rightarrow K^{0}_{S}K^{0}_{S}$. No significant signals are observed over the expected background yields. The upper limits on their branching fractions are set as $\mathcal{B}(J/ψ\rightarrow K^{0}_{S}K^{0}_{S}) <4.7\times 10^{-9}$ and $\mathcal{B}(ψ(3686)\rightarrow K^{0}_{S}K^{0}_{S}) <1.1\times 10^{-8}$ at the 90% confidence level. These results improve the previous limits by a factor of three for $J/ψ\rightarrow K^{0}_{S} K^{0}_{S}$ and two orders of magnitude for $ψ(3686)\rightarrow K^{0}_{S} K^{0}_{S}$. △ Less

Submitted 18 April, 2025; originally announced April 2025.

arXiv:2504.13650 [pdf, other]

EyecareGPT: Boosting Comprehensive Ophthalmology Understanding with Tailored Dataset, Benchmark and Model

Authors: Sijing Li, Tianwei Lin, Lingshuai Lin, Wenqiao Zhang, Jiang Liu, Xiaoda Yang, Juncheng Li, Yucheng He, Xiaohui Song, Jun Xiao, Yueting Zhuang, Beng Chin Ooi

Abstract: Medical Large Vision-Language Models (Med-LVLMs) demonstrate significant potential in healthcare, but their reliance on general medical data and coarse-grained global visual understanding limits them in intelligent ophthalmic diagnosis. Currently, intelligent ophthalmic diagnosis faces three major challenges: (i) Data. The lack of deeply annotated, high-quality, multi-modal ophthalmic visual instr… ▽ More Medical Large Vision-Language Models (Med-LVLMs) demonstrate significant potential in healthcare, but their reliance on general medical data and coarse-grained global visual understanding limits them in intelligent ophthalmic diagnosis. Currently, intelligent ophthalmic diagnosis faces three major challenges: (i) Data. The lack of deeply annotated, high-quality, multi-modal ophthalmic visual instruction data; (ii) Benchmark. The absence of a comprehensive and systematic benchmark for evaluating diagnostic performance; (iii) Model. The difficulty of adapting holistic visual architectures to fine-grained, region-specific ophthalmic lesion identification. In this paper, we propose the Eyecare Kit, which systematically tackles the aforementioned three key challenges with the tailored dataset, benchmark and model: First, we construct a multi-agent data engine with real-life ophthalmology data to produce Eyecare-100K, a high-quality ophthalmic visual instruction dataset. Subsequently, we design Eyecare-Bench, a benchmark that comprehensively evaluates the overall performance of LVLMs on intelligent ophthalmic diagnosis tasks across multiple dimensions. Finally, we develop the EyecareGPT, optimized for fine-grained ophthalmic visual understanding thoroughly, which incorporates an adaptive resolution mechanism and a layer-wise dense connector. Extensive experimental results indicate that the EyecareGPT achieves state-of-the-art performance in a range of ophthalmic tasks, underscoring its significant potential for the advancement of open research in intelligent ophthalmic diagnosis. Our project is available at https://github.com/DCDmllm/EyecareGPT. △ Less

Submitted 18 April, 2025; originally announced April 2025.

arXiv:2504.13539 [pdf, other]

Search for $1^{-+}$ charmonium-like hybrid via $e^{+}e^{-}\rightarrow γη^{(\prime)} η_{c}$ at center-of-mass energies between 4.258 and 4.681 GeV

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (696 additional authors not shown)

Abstract: Using $e^{+}e^{-}$ collision data corresponding to an integrated luminosity of 10.6 fb$^{-1}$ collected at center-of-mass energies between 4.258 and 4.681 GeV with the BESIII detector at the BEPCII collider, we search for the $1^{- +}$ charmonium-like hybrid via $e^{+}e^{-}\rightarrowγηη_{c}$ and $e^{+}e^{-}\rightarrowγη^{\prime}η_{c}$ decays for the first time. No significant signal is observed a… ▽ More Using $e^{+}e^{-}$ collision data corresponding to an integrated luminosity of 10.6 fb$^{-1}$ collected at center-of-mass energies between 4.258 and 4.681 GeV with the BESIII detector at the BEPCII collider, we search for the $1^{- +}$ charmonium-like hybrid via $e^{+}e^{-}\rightarrowγηη_{c}$ and $e^{+}e^{-}\rightarrowγη^{\prime}η_{c}$ decays for the first time. No significant signal is observed and the upper limits on the Born cross sections for both processes are set at the 90% confidence level. △ Less

Submitted 18 April, 2025; originally announced April 2025.

arXiv:2504.12795 [pdf, other]

EarthGPT-X: Enabling MLLMs to Flexibly and Comprehensively Understand Multi-Source Remote Sensing Imagery

Authors: Wei Zhang, Miaoxin Cai, Yaqian Ning, Tong Zhang, Yin Zhuang, He Chen, Jun Li, Xuerui Mao

Abstract: Recent advances in the visual-language area have developed natural multi-modal large language models (MLLMs) for spatial reasoning through visual prompting. However, due to remote sensing (RS) imagery containing abundant geospatial information that differs from natural images, it is challenging to effectively adapt natural spatial models to the RS domain. Moreover, current RS MLLMs are limited in… ▽ More Recent advances in the visual-language area have developed natural multi-modal large language models (MLLMs) for spatial reasoning through visual prompting. However, due to remote sensing (RS) imagery containing abundant geospatial information that differs from natural images, it is challenging to effectively adapt natural spatial models to the RS domain. Moreover, current RS MLLMs are limited in overly narrow interpretation levels and interaction manner, hindering their applicability in real-world scenarios. To address those challenges, a spatial MLLM named EarthGPT-X is proposed, enabling a comprehensive understanding of multi-source RS imagery, such as optical, synthetic aperture radar (SAR), and infrared. EarthGPT-X offers zoom-in and zoom-out insight, and possesses flexible multi-grained interactive abilities. Moreover, EarthGPT-X unifies two types of critical spatial tasks (i.e., referring and grounding) into a visual prompting framework. To achieve these versatile capabilities, several key strategies are developed. The first is the multi-modal content integration method, which enhances the interplay between images, visual prompts, and text instructions. Subsequently, a cross-domain one-stage fusion training strategy is proposed, utilizing the large language model (LLM) as a unified interface for multi-source multi-task learning. Furthermore, by incorporating a pixel perception module, the referring and grounding tasks are seamlessly unified within a single framework. In addition, the experiments conducted demonstrate the superiority of the proposed EarthGPT-X in multi-grained tasks and its impressive flexibility in multi-modal interaction, revealing significant advancements of MLLM in the RS field. △ Less

Submitted 17 April, 2025; originally announced April 2025.

arXiv:2504.12100 [pdf, other]

Generalized Visual Relation Detection with Diffusion Models

Authors: Kaifeng Gao, Siqi Chen, Hanwang Zhang, Jun Xiao, Yueting Zhuang, Qianru Sun

Abstract: Visual relation detection (VRD) aims to identify relationships (or interactions) between object pairs in an image. Although recent VRD models have achieved impressive performance, they are all restricted to pre-defined relation categories, while failing to consider the semantic ambiguity characteristic of visual relations. Unlike objects, the appearance of visual relations is always subtle and can… ▽ More Visual relation detection (VRD) aims to identify relationships (or interactions) between object pairs in an image. Although recent VRD models have achieved impressive performance, they are all restricted to pre-defined relation categories, while failing to consider the semantic ambiguity characteristic of visual relations. Unlike objects, the appearance of visual relations is always subtle and can be described by multiple predicate words from different perspectives, e.g., ``ride'' can be depicted as ``race'' and ``sit on'', from the sports and spatial position views, respectively. To this end, we propose to model visual relations as continuous embeddings, and design diffusion models to achieve generalized VRD in a conditional generative manner, termed Diff-VRD. We model the diffusion process in a latent space and generate all possible relations in the image as an embedding sequence. During the generation, the visual and text embeddings of subject-object pairs serve as conditional signals and are injected via cross-attention. After the generation, we design a subsequent matching stage to assign the relation words to subject-object pairs by considering their semantic similarities. Benefiting from the diffusion-based generative process, our Diff-VRD is able to generate visual relations beyond the pre-defined category labels of datasets. To properly evaluate this generalized VRD task, we introduce two evaluation metrics, i.e., text-to-image retrieval and SPICE PR Curve inspired by image captioning. Extensive experiments in both human-object interaction (HOI) detection and scene graph generation (SGG) benchmarks attest to the superiority and effectiveness of Diff-VRD. △ Less

Submitted 16 April, 2025; originally announced April 2025.

Comments: Under review at IEEE TCSVT. The Appendix is provided additionally

arXiv:2504.11301 [pdf, other]

Learning to Be A Doctor: Searching for Effective Medical Agent Architectures

Authors: Yangyang Zhuang, Wenjia Jiang, Jiayu Zhang, Ze Yang, Joey Tianyi Zhou, Chi Zhang

Abstract: Large Language Model (LLM)-based agents have demonstrated strong capabilities across a wide range of tasks, and their application in the medical domain holds particular promise due to the demand for high generalizability and reliance on interdisciplinary knowledge. However, existing medical agent systems often rely on static, manually crafted workflows that lack the flexibility to accommodate dive… ▽ More Large Language Model (LLM)-based agents have demonstrated strong capabilities across a wide range of tasks, and their application in the medical domain holds particular promise due to the demand for high generalizability and reliance on interdisciplinary knowledge. However, existing medical agent systems often rely on static, manually crafted workflows that lack the flexibility to accommodate diverse diagnostic requirements and adapt to emerging clinical scenarios. Motivated by the success of automated machine learning (AutoML), this paper introduces a novel framework for the automated design of medical agent architectures. Specifically, we define a hierarchical and expressive agent search space that enables dynamic workflow adaptation through structured modifications at the node, structural, and framework levels. Our framework conceptualizes medical agents as graph-based architectures composed of diverse, functional node types and supports iterative self-improvement guided by diagnostic feedback. Experimental results on skin disease diagnosis tasks demonstrate that the proposed method effectively evolves workflow structures and significantly enhances diagnostic accuracy over time. This work represents the first fully automated framework for medical agent architecture design and offers a scalable, adaptable foundation for deploying intelligent agents in real-world clinical environments. △ Less

Submitted 15 April, 2025; originally announced April 2025.

arXiv:2504.10867 [pdf, other]

Precise measurement of the form factors in $D^0\rightarrow K^*(892)^-μ^+ν_μ$ and test of lepton universality with $D^0\rightarrow K^*(892)^-\ell^+ν_{\ell}$ decays

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (696 additional authors not shown)

Abstract: We report a study of the semileptonic decay $D^0 \rightarrow \bar{K}^0π^-μ^+ν_μ$ based on a sample of $7.9~\mathrm{fb}^{-1}$ of $e^+e^-$ annihilation data collected at a center-of-mass energy of 3.773~GeV with the BESIII detector at the BEPCII collider. The branching fraction of the decay is measured for the first time to be… ▽ More We report a study of the semileptonic decay $D^0 \rightarrow \bar{K}^0π^-μ^+ν_μ$ based on a sample of $7.9~\mathrm{fb}^{-1}$ of $e^+e^-$ annihilation data collected at a center-of-mass energy of 3.773~GeV with the BESIII detector at the BEPCII collider. The branching fraction of the decay is measured for the first time to be $\mathcal{B}(D^0\rightarrow \bar{K}^0π^-μ^+ν_μ) = (1.373 \pm 0.020_{\rm stat} \pm 0.023_{\rm syst})\%$, where the first uncertainty is statistical and the second is systematic. Based on the investigation of the decay dynamics, we find that the decay is dominated by the $K^{*}(892)^-$ resonance with the branching fraction measured to be $\mathcal{B}(D^0\rightarrow K^{*}(892)^-μ^+ν_μ) = (1.948 \pm 0.033_{\rm stat} \pm 0.036_{\rm syst})\%$. We also determine the hadronic form factors for the $D^0\rightarrow K^{*}(892)^-μ^+ν_μ$ decay to be $r_{V} = V(0)/A_1(0) = 1.46 \pm 0.11_{\rm stat} \pm 0.04_{\rm syst}$, $r_{2} = A_2(0)/A_1(0) = 0.71 \pm 0.08_{\rm stat} \pm 0.03_{\rm syst}$, and $A_1(0)=0.609 \pm 0.008_{\rm stat} \pm 0.008_{\rm syst}$, where $V(0)$ is the vector form factor and $A_{1,2}(0)$ are the axial form factors evaluated at $q^2=0$. The $A_1(0)$ is measured for the first time in $D^0\rightarrow K^{*}(892)^-μ^+ν_μ$ decay. Averaging the form-factor parameters that we reported previously in $D^0\rightarrow K^*(892)^-(\rightarrow \bar{K}^0π^-)e^+ν_{e}$ and $D^0\rightarrow K^*(892)^-(\rightarrow K^-π^0)μ^+ν_μ$ decays, we obtain $r_{V}=1.456\pm0.040_{\rm stat}\pm0.016_{\rm syst}$, $r_{2}=0.715\pm0.031_{\rm stat}\pm0.014_{\rm stat}$, and $A_1(0)=0.614\pm0.005_{\rm stat}\pm0.004_{\rm syst}$. This is the most precise determination of the form-factor parameters to date measured in $D\rightarrow K^*(892)$ transition, which provide the most stringent test on various theoretical models. △ Less

Submitted 15 April, 2025; originally announced April 2025.

Comments: 9 pages, 4 figures

arXiv:2504.08703 [pdf, other]

SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents

Authors: Muhammad Shihab Rashid, Christian Bock, Yuan Zhuang, Alexander Buchholz, Tim Esler, Simon Valentin, Luca Franceschi, Martin Wistuba, Prabhu Teja Sivaprasad, Woo Jung Kim, Anoop Deoras, Giovanni Zappella, Laurent Callot

Abstract: Coding agents powered by large language models have shown impressive capabilities in software engineering tasks, but evaluating their performance across diverse programming languages and real-world scenarios remains challenging. We introduce SWE-PolyBench, a new multi-language benchmark for repository-level, execution-based evaluation of coding agents. SWE-PolyBench contains 2110 instances from 21… ▽ More Coding agents powered by large language models have shown impressive capabilities in software engineering tasks, but evaluating their performance across diverse programming languages and real-world scenarios remains challenging. We introduce SWE-PolyBench, a new multi-language benchmark for repository-level, execution-based evaluation of coding agents. SWE-PolyBench contains 2110 instances from 21 repositories and includes tasks in Java (165), JavaScript (1017), TypeScript (729) and Python (199), covering bug fixes, feature additions, and code refactoring. We provide a task and repository-stratified subsample (SWE-PolyBench500) and release an evaluation harness allowing for fully automated evaluation. To enable a more comprehensive comparison of coding agents, this work also presents a novel set of metrics rooted in syntax tree analysis. We evaluate leading open source coding agents on SWE-PolyBench, revealing their strengths and limitations across languages, task types, and complexity classes. Our experiments show that current agents exhibit uneven performances across languages and struggle with complex problems while showing higher performance on simpler tasks. SWE-PolyBench aims to drive progress in developing more versatile and robust AI coding assistants for real-world software engineering. Our datasets and code are available at: https://github.com/amazon-science/SWE-PolyBench △ Less

Submitted 23 April, 2025; v1 submitted 11 April, 2025; originally announced April 2025.

Comments: 20 pages, 6 figures, corrected author name spelling

arXiv:2504.07817 [pdf, other]

Search for the baryon and lepton number violating decay $J/ψ\to pe^-$ + c.c

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (664 additional authors not shown)

Abstract: Based on $(2712.4\pm 14.3) \times 10^{6} $ ${ψ(3686)}$ events collected by the BESIII detector operating at the BEPCII storage ring, we perform a search for the baryon- and lepton-number violating decay $J/ψ\to pe^{-}+c.c.$ via $ψ(3686) \to π^{+}π^{-}J/ψ$. No significant signal is found. An upper limit on the branching fraction of $\mathcal{B}(J/ψ\to p e^{-}+ c.c.) < 3.1 \times 10^{-8}$ at 90\% co… ▽ More Based on $(2712.4\pm 14.3) \times 10^{6} $ ${ψ(3686)}$ events collected by the BESIII detector operating at the BEPCII storage ring, we perform a search for the baryon- and lepton-number violating decay $J/ψ\to pe^{-}+c.c.$ via $ψ(3686) \to π^{+}π^{-}J/ψ$. No significant signal is found. An upper limit on the branching fraction of $\mathcal{B}(J/ψ\to p e^{-}+ c.c.) < 3.1 \times 10^{-8}$ at 90\% confidence level. △ Less

Submitted 10 April, 2025; originally announced April 2025.

Comments: 8 pages, 1 figure

arXiv:2504.07729 [pdf, other]

Benchmarking Multi-Organ Segmentation Tools for Multi-Parametric T1-weighted Abdominal MRI

Authors: Nicole Tran, Anisa Prasad, Yan Zhuang, Tejas Sudharshan Mathai, Boah Kim, Sydney Lewis, Pritam Mukherjee, Jianfei Liu, Ronald M. Summers

Abstract: The segmentation of multiple organs in multi-parametric MRI studies is critical for many applications in radiology, such as correlating imaging biomarkers with disease status (e.g., cirrhosis, diabetes). Recently, three publicly available tools, such as MRSegmentator (MRSeg), TotalSegmentator MRI (TS), and TotalVibeSegmentator (VIBE), have been proposed for multi-organ segmentation in MRI. However… ▽ More The segmentation of multiple organs in multi-parametric MRI studies is critical for many applications in radiology, such as correlating imaging biomarkers with disease status (e.g., cirrhosis, diabetes). Recently, three publicly available tools, such as MRSegmentator (MRSeg), TotalSegmentator MRI (TS), and TotalVibeSegmentator (VIBE), have been proposed for multi-organ segmentation in MRI. However, the performance of these tools on specific MRI sequence types has not yet been quantified. In this work, a subset of 40 volumes from the public Duke Liver Dataset was curated. The curated dataset contained 10 volumes each from the pre-contrast fat saturated T1, arterial T1w, venous T1w, and delayed T1w phases, respectively. Ten abdominal structures were manually annotated in these volumes. Next, the performance of the three public tools was benchmarked on this curated dataset. The results indicated that MRSeg obtained a Dice score of 80.7 $\pm$ 18.6 and Hausdorff Distance (HD) error of 8.9 $\pm$ 10.4 mm. It fared the best ($p < .05$) across the different sequence types in contrast to TS and VIBE. △ Less

Submitted 10 April, 2025; originally announced April 2025.

Comments: Published at SPIE Medical Imaging 2025

arXiv:2504.06606 [pdf, other]

Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program

Authors: Minghe Gao, Xuqi Liu, Zhongqi Yue, Yang Wu, Shuang Chen, Juncheng Li, Siliang Tang, Fei Wu, Tat-Seng Chua, Yueting Zhuang

Abstract: Recent advancements in reward signal usage for Large Language Models (LLMs) are remarkable. However, significant challenges exist when transitioning reward signal to the multimodal domain, including labor-intensive annotations, over-reliance on one-step rewards, and inadequate evaluation. To address these issues, we propose SVIP, a novel approach to train a step-level multi-dimensional Chain-of-Th… ▽ More Recent advancements in reward signal usage for Large Language Models (LLMs) are remarkable. However, significant challenges exist when transitioning reward signal to the multimodal domain, including labor-intensive annotations, over-reliance on one-step rewards, and inadequate evaluation. To address these issues, we propose SVIP, a novel approach to train a step-level multi-dimensional Chain-of-Thought~(CoT) reward model automatically. It generates code for solving visual tasks and transforms the analysis of code blocks into the evaluation of CoT step as training samples. Then, we train SVIP-Reward model using a multi-head attention mechanism called TriAtt-CoT. The advantages of SVIP-Reward are evident throughout the entire process of MLLM. We also introduce a benchmark for CoT reward model training and testing. Experimental results demonstrate that SVIP-Reward improves MLLM performance across training and inference-time scaling, yielding better results on benchmarks while reducing hallucinations and enhancing reasoning ability. △ Less

Submitted 9 April, 2025; originally announced April 2025.

arXiv:2504.05584 [pdf, other]

Observation of Transverse Polarization and Determination of Electromagnetic Form Factor of $Λ$ Hyperon at $\sqrt{s}= 3.773$ GeV

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (697 additional authors not shown)

Abstract: Using a 20.3 fb$^{-1}$ of $e^{+}e^{-}$ collision data sample collected by the BESIII detector at the BEPCII collider, we present an observation of transverse polarization and a complete determination of the electromagnetic form factor of the $Λ$ hyperon in $e^{+}e^{-}\toΛ\barΛ$ decay with the entangled $Λ-\barΛ$ pair at $\sqrt{s}=3.773$ GeV. The relative phase between the electric and magnetic for… ▽ More Using a 20.3 fb$^{-1}$ of $e^{+}e^{-}$ collision data sample collected by the BESIII detector at the BEPCII collider, we present an observation of transverse polarization and a complete determination of the electromagnetic form factor of the $Λ$ hyperon in $e^{+}e^{-}\toΛ\barΛ$ decay with the entangled $Λ-\barΛ$ pair at $\sqrt{s}=3.773$ GeV. The relative phase between the electric and magnetic form factors is determined to be $ΔΦ=(1.53\pm0.36\pm0.03)$ rad with a significance of 5.5$σ$ taking into account systematic uncertainty. This result indicates a non-zero phase between the transition amplitudes of the $Λ\barΛ$ helicity states. Additionally, we measure the angular distribution parameter and the modulus of the ratio between the electric and the magnetic form factor is found to be $η=0.86\pm0.05\pm0.03$ and $R(s)=|G_{E}(s)/G_{M}(s)|=0.47\pm0.08\pm0.05$, where the first uncertainty is statistical and the second systematic. △ Less

Submitted 7 April, 2025; originally announced April 2025.

Comments: 9 pages, 1 table, 5 figures

arXiv:2504.04915 [pdf, other]

Collab-RAG: Boosting Retrieval-Augmented Generation for Complex Question Answering via White-Box and Black-Box LLM Collaboration

Authors: Ran Xu, Wenqi Shi, Yuchen Zhuang, Yue Yu, Joyce C. Ho, Haoyu Wang, Carl Yang

Abstract: Retrieval-Augmented Generation (RAG) systems often struggle to handle multi-hop question-answering tasks accurately due to irrelevant context retrieval and limited complex reasoning capabilities. We introduce Collab-RAG, a collaborative training framework that leverages mutual enhancement between a white-box small language model (SLM) and a blackbox large language model (LLM) for RAG. Specifically… ▽ More Retrieval-Augmented Generation (RAG) systems often struggle to handle multi-hop question-answering tasks accurately due to irrelevant context retrieval and limited complex reasoning capabilities. We introduce Collab-RAG, a collaborative training framework that leverages mutual enhancement between a white-box small language model (SLM) and a blackbox large language model (LLM) for RAG. Specifically, the SLM decomposes complex queries into simpler sub-questions, thus enhancing the accuracy of the retrieval and facilitating more effective reasoning by the black-box LLM. Concurrently, the black-box LLM provides feedback signals to improve the SLM's decomposition capability. We observe that Collab-RAG relies solely on supervision from an affordable black-box LLM without additional distillation from frontier LLMs, yet demonstrates strong generalization across multiple black-box LLMs. Experimental evaluations across five multi-hop QA datasets demonstrate that Collab-RAG substantially outperforms existing black-box-only and SLM fine-tuning baselines by 1.8%-14.2% on average. In particular, our fine-tuned 3B SLM surpasses a frozen 32B LLM in question decomposition, highlighting the efficiency of Collab-RAG in improving reasoning and retrieval for complex questions. The code of Collab-RAG is available on https://github.com/ritaranx/Collab-RAG/. △ Less

Submitted 7 April, 2025; originally announced April 2025.

Comments: Work in progress. Code: https://github.com/ritaranx/Collab-RAG/

arXiv:2504.04420 [pdf, ps, other]

Observation of $ψ(3686) \to Ξ^- K^0_S \barΩ^+ $+c.c

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (680 additional authors not shown)

Abstract: Using a sample of $(2.712\pm0.014) \times 10^{9}$ $ψ(3686)$ events collected with the BESIII detector at the electron positron collider BEPCII, the decay $ψ(3686) \to Ξ^- K^0_S \barΩ^+ +c.c.$ is observed for the first time, which has a significance of 5.9 standard deviations. The branching fraction of this decay is measured to be $(2.91\pm0.47\pm0.33)\times 10^{-6}$, where the first and second unc… ▽ More Using a sample of $(2.712\pm0.014) \times 10^{9}$ $ψ(3686)$ events collected with the BESIII detector at the electron positron collider BEPCII, the decay $ψ(3686) \to Ξ^- K^0_S \barΩ^+ +c.c.$ is observed for the first time, which has a significance of 5.9 standard deviations. The branching fraction of this decay is measured to be $(2.91\pm0.47\pm0.33)\times 10^{-6}$, where the first and second uncertainties are statistical and systematic, respectively. The ratio between $\mathcal{B}_{ψ(3686) \to Ξ^- K^0_S \barΩ^+ +c.c.}$ and $\mathcal{B}_{ψ(3686) \to Ω^- K^+ \barΞ^0 +c.c.}$ is determined to be $1.05\pm0.23\pm0.14 $, which deviates with the isospin symmetry conservation predicted value of 0.5 by $2.1σ$. △ Less

Submitted 13 June, 2025; v1 submitted 6 April, 2025; originally announced April 2025.

arXiv:2504.04096 [pdf, ps, other]

Observation of a Three-Resonance Structure in the Cross Section of $e^+e^-\toπ^+π^- h_c$

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (697 additional authors not shown)

Abstract: Using $e^+e^-$ collision data collected with the BESIII detector operating at the Beijing Electron Positron Collider, the cross section of $e^+e^-\to π^+π^- h_c$ is measured at 59 points with center-of-mass energy $\sqrt{s}$ ranging from $4.009$ to $4.950~\mathrm{GeV}$ with a total integrated luminosity of $22.2~\mathrm{fb}^{-1}$. The cross section between $4.3$ and $4.45~\mathrm{GeV}$ exhibits a… ▽ More Using $e^+e^-$ collision data collected with the BESIII detector operating at the Beijing Electron Positron Collider, the cross section of $e^+e^-\to π^+π^- h_c$ is measured at 59 points with center-of-mass energy $\sqrt{s}$ ranging from $4.009$ to $4.950~\mathrm{GeV}$ with a total integrated luminosity of $22.2~\mathrm{fb}^{-1}$. The cross section between $4.3$ and $4.45~\mathrm{GeV}$ exhibits a plateau-like shape and drops sharply around $4.5~\mathrm{GeV}$, which cannot be described by two resonances only. Three coherent Breit-Wigner functions are used to parameterize the $\sqrt{s}$-dependent cross section line shape. The masses and widths are determined to be $M_1=(4223.6_{-3.7-2.9}^{+3.6+2.6})~\mathrm{MeV}/c^2$, $Γ_1=(58.5_{-11.4-6.5}^{+10.8+6.7})~\mathrm{MeV}$, $M_2=(4327.4_{-18.8-9.3}^{+20.1+10.7})~\mathrm{MeV}/c^2$, $Γ_2=(244.1_{-27.1-18.0}^{+34.0+23.9})~\mathrm{MeV}$, and $M_3=(4467.4_{-5.4-2.7}^{+7.2+3.2})~\mathrm{MeV}/c^2$, $Γ_3=(62.8_{-14.4-6.6}^{+19.2+9.8})~\mathrm{MeV}$. The first uncertainties are statistical and the other two are systematic. The statistical significance of the three Breit-Wigner assumption over the two Breit-Wigner assumption is greater than $5σ$. △ Less

Submitted 5 April, 2025; originally announced April 2025.

arXiv:2504.01823 [pdf, other]

Evidence of doubly OZI-suppressed decay $η_{c} \to ωφ$ in the radiative decay $J/ψ\to γη_{c}$

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (680 additional authors not shown)

Abstract: Using a sample of $(10087\pm44) \times 10^{6}$ $J/ψ$ events collected with the BESIII detector at the BEPCII collider, the first evidence for the doubly OZI-suppressed decay $η_{c} \to ωφ$ is reported with a significance of 4.0$σ$. The branching fraction of $η_{c} \to ωφ$ is measured to be $\mathcal{B}(η_{c} \to ωφ) = (3.86 \pm 0.92 \pm 0.62) \times 10^{-5}$, where the first uncertainty is statist… ▽ More Using a sample of $(10087\pm44) \times 10^{6}$ $J/ψ$ events collected with the BESIII detector at the BEPCII collider, the first evidence for the doubly OZI-suppressed decay $η_{c} \to ωφ$ is reported with a significance of 4.0$σ$. The branching fraction of $η_{c} \to ωφ$ is measured to be $\mathcal{B}(η_{c} \to ωφ) = (3.86 \pm 0.92 \pm 0.62) \times 10^{-5}$, where the first uncertainty is statistical and the second is systematic. This result provides valuable insights into the underlying mechanisms of charmonium decays, particularly for processes such as $η_{c} \to VV$ (where $V$ represents a vector meson). △ Less

Submitted 2 April, 2025; originally announced April 2025.

arXiv:2503.23260 [pdf, other]

Mismatch-Robust Underwater Acoustic Localization Using A Differentiable Modular Forward Model

Authors: Dariush Kari, Yongjie Zhuang, Andrew C. Singer

Abstract: In this paper, we study the underwater acoustic localization in the presence of environmental mismatch. Especially, we exploit a pre-trained neural network for the acoustic wave propagation in a gradient-based optimization framework to estimate the source location. To alleviate the effect of mismatch between the training data and the test data, we simultaneously optimize over the network weights a… ▽ More In this paper, we study the underwater acoustic localization in the presence of environmental mismatch. Especially, we exploit a pre-trained neural network for the acoustic wave propagation in a gradient-based optimization framework to estimate the source location. To alleviate the effect of mismatch between the training data and the test data, we simultaneously optimize over the network weights at the inference time, and provide conditions under which this method is effective. Moreover, we introduce a physics-inspired modularity in the forward model that enables us to learn the path lengths of the multipath structure in an end-to-end training manner without access to the specific path labels. We investigate the validity of the assumptions in a simple yet illustrative environment model. △ Less

Submitted 29 March, 2025; originally announced March 2025.

arXiv:2503.22126 [pdf, other]

Updated model-independent measurement of the strong-phase differences between $D^0$ and $\bar{D}^0 \to K^{0}_{S/L}π^+π^-$ decays

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (696 additional authors not shown)

Abstract: The strong-phase differences between $D^0\to K_{S/L}^0π^+π^-$ and $\bar{D}^0\to K_{S/L}^0π^+π^-$ decays are one of the most important inputs in measuring the $C\!P$ violating angle $γ$ via $B^- \to D K^-$ decays. They also play a key role in studies of charm mixing and indirect $C\!P$ violation. In this paper, the strong-phase differences are determined in a model-independent way with quantum-corr… ▽ More The strong-phase differences between $D^0\to K_{S/L}^0π^+π^-$ and $\bar{D}^0\to K_{S/L}^0π^+π^-$ decays are one of the most important inputs in measuring the $C\!P$ violating angle $γ$ via $B^- \to D K^-$ decays. They also play a key role in studies of charm mixing and indirect $C\!P$ violation. In this paper, the strong-phase differences are determined in a model-independent way with quantum-correlated $D^0$-$\bar{D}^0$ decays from 7.93 fb$^{-1}$ of $e^+e^-$ annihilation data at $\sqrt{s}$=3.773 GeV by the BESIII experiment. These results are the most precise to date and are expected to significantly reduce associated uncertainties in determining the $C\!P$ violating angle $γ$ and related charm mixing parameters. △ Less

Submitted 18 April, 2025; v1 submitted 27 March, 2025; originally announced March 2025.

arXiv:2503.21696 [pdf, other]

Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks

Authors: Wenqi Zhang, Mengna Wang, Gangao Liu, Xu Huixin, Yiwei Jiang, Yongliang Shen, Guiyang Hou, Zhe Zheng, Hang Zhang, Xin Li, Weiming Lu, Peng Li, Yueting Zhuang

Abstract: Recent advances in deep thinking models have demonstrated remarkable reasoning capabilities on mathematical and coding tasks. However, their effectiveness in embodied domains which require continuous interaction with environments through image action interleaved trajectories remains largely -unexplored. We present Embodied Reasoner, a model that extends o1 style reasoning to interactive embodied s… ▽ More Recent advances in deep thinking models have demonstrated remarkable reasoning capabilities on mathematical and coding tasks. However, their effectiveness in embodied domains which require continuous interaction with environments through image action interleaved trajectories remains largely -unexplored. We present Embodied Reasoner, a model that extends o1 style reasoning to interactive embodied search tasks. Unlike mathematical reasoning that relies primarily on logical deduction, embodied scenarios demand spatial understanding, temporal reasoning, and ongoing self-reflection based on interaction history. To address these challenges, we synthesize 9.3k coherent Observation-Thought-Action trajectories containing 64k interactive images and 90k diverse thinking processes (analysis, spatial reasoning, reflection, planning, and verification). We develop a three-stage training pipeline that progressively enhances the model's capabilities through imitation learning, self-exploration via rejection sampling, and self-correction through reflection tuning. The evaluation shows that our model significantly outperforms those advanced visual reasoning models, e.g., it exceeds OpenAI o1, o3-mini, and Claude-3.7 by +9\%, 24\%, and +13\%. Analysis reveals our model exhibits fewer repeated searches and logical inconsistencies, with particular advantages in complex long-horizon tasks. Real-world environments also show our superiority while exhibiting fewer repeated searches and logical inconsistency cases. △ Less

Submitted 14 May, 2025; v1 submitted 27 March, 2025; originally announced March 2025.

Comments: Code: https://github.com/zwq2018/embodied_reasoner Dataset: https://huggingface.co/datasets/zwq2018/embodied_reasoner

arXiv:2503.19823 [pdf, other]

GyralNet Subnetwork Partitioning via Differentiable Spectral Modularity Optimization

Authors: Yan Zhuang, Minheng Chen, Chao Cao, Tong Chen, Jing Zhang, Xiaowei Yu, Yanjun Lyu, Lu Zhang, Tianming Liu, Dajiang Zhu

Abstract: Understanding the structural and functional organization of the human brain requires a detailed examination of cortical folding patterns, among which the three-hinge gyrus (3HG) has been identified as a key structural landmark. GyralNet, a network representation of cortical folding, models 3HGs as nodes and gyral crests as edges, highlighting their role as critical hubs in cortico-cortical connect… ▽ More Understanding the structural and functional organization of the human brain requires a detailed examination of cortical folding patterns, among which the three-hinge gyrus (3HG) has been identified as a key structural landmark. GyralNet, a network representation of cortical folding, models 3HGs as nodes and gyral crests as edges, highlighting their role as critical hubs in cortico-cortical connectivity. However, existing methods for analyzing 3HGs face significant challenges, including the sub-voxel scale of 3HGs at typical neuroimaging resolutions, the computational complexity of establishing cross-subject correspondences, and the oversimplification of treating 3HGs as independent nodes without considering their community-level relationships. To address these limitations, we propose a fully differentiable subnetwork partitioning framework that employs a spectral modularity maximization optimization strategy to modularize the organization of 3HGs within GyralNet. By incorporating topological structural similarity and DTI-derived connectivity patterns as attribute features, our approach provides a biologically meaningful representation of cortical organization. Extensive experiments on the Human Connectome Project (HCP) dataset demonstrate that our method effectively partitions GyralNet at the individual level while preserving the community-level consistency of 3HGs across subjects, offering a robust foundation for understanding brain connectivity. △ Less

Submitted 31 March, 2025; v1 submitted 25 March, 2025; originally announced March 2025.

Comments: 10 pages, 3 figures

arXiv:2503.18620 [pdf, ps, other]

doi 10.1103/PhysRevD.111.092007

Observation of the decay $ψ(3686)\rightarrow Σ^{0}\barΣ^{0}ω$

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (695 additional authors not shown)

Abstract: Using a dataset of $(27.12\pm 0.14)\times 10^{8}$ $ψ(3686)$ events collected by the BESIII detector operating at the BEPCII collider, we report the first observation of the decay $ψ(3686)\toΣ^{0}\barΣ^{0}ω$ with a statistical significance of 8.9$σ$. The measured branching fraction is $(1.24 \pm 0.16_{\textrm{stat}} \pm 0.11_{\textrm{sys}}) \times 10^{-5}$, where the first uncertainty i… ▽ More Using a dataset of $(27.12\pm 0.14)\times 10^{8}$ $ψ(3686)$ events collected by the BESIII detector operating at the BEPCII collider, we report the first observation of the decay $ψ(3686)\toΣ^{0}\barΣ^{0}ω$ with a statistical significance of 8.9$σ$. The measured branching fraction is $(1.24 \pm 0.16_{\textrm{stat}} \pm 0.11_{\textrm{sys}}) \times 10^{-5}$, where the first uncertainty is statistical and the second is systematic. Additionally, we investigate potential intermediate states in the invariant mass distributions of $Σ^{0}ω$, $\barΣ^{0}ω$ and $Σ^{0}\barΣ^{0}$. A hint of a resonance is observed in the invariant mass distribution of $M_{Σ^{0}(\barΣ^{0})ω}$, located around 2.06 GeV/$c^2$, with a significance of 2.5$σ$. △ Less

Submitted 24 March, 2025; originally announced March 2025.

arXiv:2503.17165 [pdf, other]

Stringent test of $CP$ symmetry in $Σ^+$ hyperon decays

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (680 additional authors not shown)

Abstract: The non-leptonic two-body weak decays $Σ^{+} \to p π^{0}$ and $\barΣ^{-} \to \bar{p} π^{0}$ are investigated, utilizing $(1.0087\pm0.0044)\times10^{10}$ $J/ψ$ events and $(2.7124\pm0.0143)\times10^{9}$ $ψ(3686)$ events collected by BESIII experiment. The precision of the weak-decay parameters for the decays $Σ^{+} \to p π^{0}$ ($α_{0}$) and $\barΣ^{-} \to \bar{p} π^{0}$ ($\barα_{0}$) is improved b… ▽ More The non-leptonic two-body weak decays $Σ^{+} \to p π^{0}$ and $\barΣ^{-} \to \bar{p} π^{0}$ are investigated, utilizing $(1.0087\pm0.0044)\times10^{10}$ $J/ψ$ events and $(2.7124\pm0.0143)\times10^{9}$ $ψ(3686)$ events collected by BESIII experiment. The precision of the weak-decay parameters for the decays $Σ^{+} \to p π^{0}$ ($α_{0}$) and $\barΣ^{-} \to \bar{p} π^{0}$ ($\barα_{0}$) is improved by a factor of three compared to the previous world average. Furthermore, the quantum-entangled $Σ^{+}\barΣ^{-}$ system enables the most precise test of $CP$ symmetry for the decay $Σ^+\to pπ^0$, through the asymmetry observable $A_{CP}=(α_{0}+\barα_{0})/(α_{0}-\barα_{0})$ that is measured to be $-0.0118\pm0.0083_{\rm stat}\pm0.0028_{\rm syst}$. Assuming $CP$ conservation, the average decay parameter is determined to be ${\left< α_{\rm 0}\right>} = (α_0-\barα_0)/2=-0.9869\pm0.0011_{\rm stat}\pm0.0016_{\rm syst}$, which is the most precise measurement of the asymmetry decay parameters in baryon sectors. The angular dependence of the ratio of the polarization of the $Σ^+$ in both $J/ψ$ and $ψ(3686)$ decays is studied for the first time. △ Less

Submitted 21 March, 2025; originally announced March 2025.

arXiv:2503.16070 [pdf, other]

Search for the radiative leptonic decay $D^+\toγe^+ν_e$ with Deep Learning

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (680 additional authors not shown)

Abstract: Using 20.3$~\rm fb^{-1}$ of $e^+e^-$ annihilation data collected at a center-of-mass energy of 3.773$~\rm GeV$ with the BESIII detector, we report an improved search for the radiative leptonic decay $D^+\toγe^+ν_e$. An upper limit on its partial branching fraction for photon energies $E_γ>10~\rm MeV$ is determined to be $1.2\times10^{-5}$ at 90\% confidence level, which excludes most current theor… ▽ More Using 20.3$~\rm fb^{-1}$ of $e^+e^-$ annihilation data collected at a center-of-mass energy of 3.773$~\rm GeV$ with the BESIII detector, we report an improved search for the radiative leptonic decay $D^+\toγe^+ν_e$. An upper limit on its partial branching fraction for photon energies $E_γ>10~\rm MeV$ is determined to be $1.2\times10^{-5}$ at 90\% confidence level, which excludes most current theoretical predictions. A sophisticated deep learning approach with thorough validation, based on the Transformer architecture, is implemented to efficiently distinguish the signal from massive backgrounds. △ Less

Submitted 20 March, 2025; originally announced March 2025.

Comments: 15 pages, 6 figures

arXiv:2503.14655 [pdf, other]

Core-Periphery Principle Guided State Space Model for Functional Connectome Classification

Authors: Minheng Chen, Xiaowei Yu, Jing Zhang, Tong Chen, Chao Cao, Yan Zhuang, Yanjun Lyu, Lu Zhang, Tianming Liu, Dajiang Zhu

Abstract: Understanding the organization of human brain networks has become a central focus in neuroscience, particularly in the study of functional connectivity, which plays a crucial role in diagnosing neurological disorders. Advances in functional magnetic resonance imaging and machine learning techniques have significantly improved brain network analysis. However, traditional machine learning approaches… ▽ More Understanding the organization of human brain networks has become a central focus in neuroscience, particularly in the study of functional connectivity, which plays a crucial role in diagnosing neurological disorders. Advances in functional magnetic resonance imaging and machine learning techniques have significantly improved brain network analysis. However, traditional machine learning approaches struggle to capture the complex relationships between brain regions, while deep learning methods, particularly Transformer-based models, face computational challenges due to their quadratic complexity in long-sequence modeling. To address these limitations, we propose a Core-Periphery State-Space Model (CP-SSM), an innovative framework for functional connectome classification. Specifically, we introduce Mamba, a selective state-space model with linear complexity, to effectively capture long-range dependencies in functional brain networks. Furthermore, inspired by the core-periphery (CP) organization, a fundamental characteristic of brain networks that enhances efficient information transmission, we design CP-MoE, a CP-guided Mixture-of-Experts that improves the representation learning of brain connectivity patterns. We evaluate CP-SSM on two benchmark fMRI datasets: ABIDE and ADNI. Experimental results demonstrate that CP-SSM surpasses Transformer-based models in classification performance while significantly reducing computational complexity. These findings highlight the effectiveness and efficiency of CP-SSM in modeling brain functional connectivity, offering a promising direction for neuroimaging-based neurological disease diagnosis. △ Less

Submitted 18 March, 2025; originally announced March 2025.

arXiv:2503.13330 [pdf, ps, other]

LEAVS: An LLM-based Labeler for Abdominal CT Supervision

Authors: Ricardo Bigolin Lanfredi, Yan Zhuang, Mark Finkelstein, Praveen Thoppey Srinivasan Balamuralikrishna, Luke Krembs, Brandon Khoury, Arthi Reddy, Pritam Mukherjee, Neil M. Rofsky, Ronald M. Summers

Abstract: Extracting structured labels from radiology reports has been employed to create vision models to simultaneously detect several types of abnormalities. However, existing works focus mainly on the chest region. Few works have been investigated on abdominal radiology reports due to more complex anatomy and a wider range of pathologies in the abdomen. We propose LEAVS (Large language model Extractor f… ▽ More Extracting structured labels from radiology reports has been employed to create vision models to simultaneously detect several types of abnormalities. However, existing works focus mainly on the chest region. Few works have been investigated on abdominal radiology reports due to more complex anatomy and a wider range of pathologies in the abdomen. We propose LEAVS (Large language model Extractor for Abdominal Vision Supervision). This labeler can annotate the certainty of presence and the urgency of seven types of abnormalities for nine abdominal organs on CT radiology reports. To ensure broad coverage, we chose abnormalities that encompass most of the finding types from CT reports. Our approach employs a specialized chain-of-thought prompting strategy for a locally-run LLM using sentence extraction and multiple-choice questions in a tree-based decision system. We demonstrate that the LLM can extract several abnormality types across abdominal organs with an average F1 score of 0.89, significantly outperforming competing labelers and humans. Additionally, we show that extraction of urgency labels achieved performance comparable to human annotations. Finally, we demonstrate that the abnormality labels contain valuable information for training a single vision model that classifies several organs as normal or abnormal. We release our code and structured annotations for a public CT dataset containing over 1,000 CT volumes. △ Less

Submitted 28 May, 2025; v1 submitted 17 March, 2025; originally announced March 2025.

Comments: Early acceptance (top 9% of submissions) for MICCAI 2025

arXiv:2503.11383 [pdf, other]

Study of $φ\to K\bar{K}$ and $K_{S}^{0}-K_{L}^{0}$ asymmetry in the amplitude analysis of $D_{s}^{+} \to K_{S}^{0}K_{L}^{0}π^{+}$ decay

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (701 additional authors not shown)

Abstract: Using $e^+e^-$ annihilation data corresponding to a total integrated luminosity of 7.33 $\rm fb^{-1}$ collected at center-of-mass energies between 4.128 and 4.226~GeV with the BESIII detector, we provide the first amplitude analysis and absolute branching fraction measurement of the hadronic decay $D_{s}^{+} \to K_{S}^{0}K_{L}^{0}π^{+}$. The branching fraction of… ▽ More Using $e^+e^-$ annihilation data corresponding to a total integrated luminosity of 7.33 $\rm fb^{-1}$ collected at center-of-mass energies between 4.128 and 4.226~GeV with the BESIII detector, we provide the first amplitude analysis and absolute branching fraction measurement of the hadronic decay $D_{s}^{+} \to K_{S}^{0}K_{L}^{0}π^{+}$. The branching fraction of $D_{s}^{+} \to K_{S}^{0}K_{L}^{0}π^{+}$ is determined to be $(1.86\pm0.06_{\rm stat}\pm0.03_{\rm syst})\%$. Combining the $\mathcal{B}(D_{s}^{+} \to φ(\to K_{S}^0K_{L}^0) π^+)$ obtained in this work and the world average of $\mathcal{B}(D_{s}^{+} \to φ(\to K^+K^-) π^+)$, we measure the relative branching fraction $\mathcal{B}(φ\to K_S^0K_L^0)/\mathcal{B}(φ\to K^+K^-)$=($0.597 \pm 0.023_{\rm stat} \pm 0.018_{\rm syst} \pm 0.016_{\rm PDG}$), which deviates from the PDG value by more than 3$σ$. Furthermore, the asymmetry of the branching fractions of $D^+_s\to K_{S}^0K^{*}(892)^{+}$ and $D^+_s\to K_{L}^0K^{*}(892)^{+}$, $\frac{\mathcal{B}(D_{s}^{+} \to K_{S}^0K^{*}(892)^{+})-\mathcal{B}(D_{s}^{+} \to K_{L}^0K^{*}(892)^{+})}{\mathcal{B}(D_{s}^{+} \to K_{S}^0K^{*}(892)^{+})+\mathcal{B}(D_{s}^{+} \to K_{L}^0K^{*}(892)^{+})}$, is determined to be $(-13.4\pm5.0_{\rm stat}\pm3.4_{\rm syst})\%$. △ Less

Submitted 23 March, 2025; v1 submitted 14 March, 2025; originally announced March 2025.

Comments: 11 pages, 4 figures

arXiv:2503.11001 [pdf, other]

A Weighted Predict-and-Optimize Framework for Power System Operation Considering Varying Impacts of Uncertainty

Authors: Yingrui Zhuang, Lin Cheng, Can Wan, Rui Xie, Ning Qi, Yue Chen

Abstract: Integrating prediction and optimization enhances decision-making quality by yielding near optimal solutions. Given that prediction errors associated with multiple uncertainties have varying impacts on downstream decision-making, improving the prediction accuracy of critical uncertainties with significant impacts on decision-making quality yields better optimization results. Inspired by this observ… ▽ More Integrating prediction and optimization enhances decision-making quality by yielding near optimal solutions. Given that prediction errors associated with multiple uncertainties have varying impacts on downstream decision-making, improving the prediction accuracy of critical uncertainties with significant impacts on decision-making quality yields better optimization results. Inspired by this observation, this paper proposes a novel weighted predict-and-optimize (WPO) framework for decision-making under uncertainty. Specifically, we introduce an uncertainty-aware weighting mechanism into the predictive model to capture the relative importance of each uncertainty for specific optimization tasks, and introduce a problem-driven prediction loss (PDPL) to quantify the suboptimality of weighted predictions for downstream optimization as compared to perfect predictions. By optimizing the uncertainty weights to minimize the PDPL, WPO framework enables adaptive uncertainty impact assessment and integrated learning of prediction and optimization. Furthermore, to facilitate weight optimization, we construct a surrogate model to establish the mapping between weights and PDPL, where multi-task learning and enhanced graph convolutional networks are adopted for efficient surrogate model construction and training. Numerical experiments on modified IEEE 33-bus and 123-bus systems demonstrate that the proposed WPO framework outperforms traditional methods by achieving a much smaller PDPL within acceptable computational time. △ Less

Submitted 13 March, 2025; originally announced March 2025.

Comments: This is a paper submitted to IEEE TRANSACTIONS ON Power Systems

arXiv:2503.09640 [pdf, other]

Physics-Aware Human-Object Rendering from Sparse Views via 3D Gaussian Splatting

Authors: Weiquan Wang, Jun Xiao, Yueting Zhuang, Long Chen

Abstract: Rendering realistic human-object interactions (HOIs) from sparse-view inputs is challenging due to occlusions and incomplete observations, yet crucial for various real-world applications. Existing methods always struggle with either low rendering qualities (\eg, visual fidelity and physically plausible HOIs) or high computational costs. To address these limitations, we propose HOGS (Human-Object R… ▽ More Rendering realistic human-object interactions (HOIs) from sparse-view inputs is challenging due to occlusions and incomplete observations, yet crucial for various real-world applications. Existing methods always struggle with either low rendering qualities (\eg, visual fidelity and physically plausible HOIs) or high computational costs. To address these limitations, we propose HOGS (Human-Object Rendering via 3D Gaussian Splatting), a novel framework for efficient and physically plausible HOI rendering from sparse views. Specifically, HOGS combines 3D Gaussian Splatting with a physics-aware optimization process. It incorporates a Human Pose Refinement module for accurate pose estimation and a Sparse-View Human-Object Contact Prediction module for efficient contact region identification. This combination enables coherent joint rendering of human and object Gaussians while enforcing physically plausible interactions. Extensive experiments on the HODome dataset demonstrate that HOGS achieves superior rendering quality, efficiency, and physical plausibility compared to existing methods. We further show its extensibility to hand-object grasp rendering tasks, presenting its broader applicability to articulated object interactions. △ Less

Submitted 12 March, 2025; originally announced March 2025.

arXiv:2503.07640 [pdf]

BrainNet-MoE: Brain-Inspired Mixture-of-Experts Learning for Neurological Disease Identification

Authors: Jing Zhang, Xiaowei Yu, Tong Chen, Chao Cao, Mingheng Chen, Yan Zhuang, Yanjun Lyu, Lu Zhang, Li Su, Tianming Liu, Dajiang Zhu

Abstract: The Lewy body dementia (LBD) is the second most common neurodegenerative dementia after Alzheimer's disease (AD). Early differentiation between AD and LBD is crucial because they require different treatment approaches, but this is challenging due to significant clinical overlap, heterogeneity, complex pathogenesis, and the rarity of LBD. While recent advances in artificial intelligence (AI) demons… ▽ More The Lewy body dementia (LBD) is the second most common neurodegenerative dementia after Alzheimer's disease (AD). Early differentiation between AD and LBD is crucial because they require different treatment approaches, but this is challenging due to significant clinical overlap, heterogeneity, complex pathogenesis, and the rarity of LBD. While recent advances in artificial intelligence (AI) demonstrate powerful learning capabilities and offer new hope for accurate diagnosis, existing methods primarily focus on designing "neural-level networks". Our work represents a pioneering effort in modeling system-level artificial neural network called BrainNet-MoE for brain modeling and diagnosing. Inspired by the brain's hierarchical organization of bottom-up sensory integration and top-down control, we design a set of disease-specific expert groups to process brain sub-network under different condition, A disease gate mechanism guides the specializa-tion of expert groups, while a transformer layer enables communication be-tween all sub-networks, generating a comprehensive whole-brain represen-tation for downstream disease classification. Experimental results show superior classification accuracy with interpretable insights into how brain sub-networks contribute to different neurodegenerative conditions. △ Less

Submitted 5 March, 2025; originally announced March 2025.

arXiv:2503.06998 [pdf, other]

SOYO: A Tuning-Free Approach for Video Style Morphing via Style-Adaptive Interpolation in Diffusion Models

Authors: Haoyu Zheng, Qifan Yu, Binghe Yu, Yang Dai, Wenqiao Zhang, Juncheng Li, Siliang Tang, Yueting Zhuang

Abstract: Diffusion models have achieved remarkable progress in image and video stylization. However, most existing methods focus on single-style transfer, while video stylization involving multiple styles necessitates seamless transitions between them. We refer to this smooth style transition between video frames as video style morphing. Current approaches often generate stylized video frames with disconti… ▽ More Diffusion models have achieved remarkable progress in image and video stylization. However, most existing methods focus on single-style transfer, while video stylization involving multiple styles necessitates seamless transitions between them. We refer to this smooth style transition between video frames as video style morphing. Current approaches often generate stylized video frames with discontinuous structures and abrupt style changes when handling such transitions. To address these limitations, we introduce SOYO, a novel diffusion-based framework for video style morphing. Our method employs a pre-trained text-to-image diffusion model without fine-tuning, combining attention injection and AdaIN to preserve structural consistency and enable smooth style transitions across video frames. Moreover, we notice that applying linear equidistant interpolation directly induces imbalanced style morphing. To harmonize across video frames, we propose a novel adaptive sampling scheduler operating between two style images. Extensive experiments demonstrate that SOYO outperforms existing methods in open-domain video style morphing, better preserving the structural coherence of video frames while achieving stable and smooth style transitions. △ Less

Submitted 10 March, 2025; originally announced March 2025.

arXiv:2503.06692 [pdf, other]

InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models

Authors: Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Mengdi Zhang, Jian Shao, Yueting Zhuang

Abstract: Advanced reasoning in large language models has achieved remarkable performance on challenging tasks, but the prevailing long-context reasoning paradigm faces critical limitations: quadratic computational scaling with sequence length, reasoning constrained by maximum context boundaries, and performance degradation beyond pre-training context windows. Existing approaches primarily compress reasonin… ▽ More Advanced reasoning in large language models has achieved remarkable performance on challenging tasks, but the prevailing long-context reasoning paradigm faces critical limitations: quadratic computational scaling with sequence length, reasoning constrained by maximum context boundaries, and performance degradation beyond pre-training context windows. Existing approaches primarily compress reasoning chains without addressing the fundamental scaling problem. To overcome these challenges, we introduce InftyThink, a paradigm that transforms monolithic reasoning into an iterative process with intermediate summarization. By interleaving short reasoning segments with concise progress summaries, our approach enables unbounded reasoning depth while maintaining bounded computational costs. This creates a characteristic sawtooth memory pattern that significantly reduces computational complexity compared to traditional approaches. Furthermore, we develop a methodology for reconstructing long-context reasoning datasets into our iterative format, transforming OpenR1-Math into 333K training instances. Experiments across multiple model architectures demonstrate that our approach reduces computational costs while improving performance, with Qwen2.5-Math-7B showing 3-13% improvements across MATH500, AIME24, and GPQA_diamond benchmarks. Our work challenges the assumed trade-off between reasoning depth and computational efficiency, providing a more scalable approach to complex reasoning without architectural modifications. △ Less

Submitted 24 May, 2025; v1 submitted 9 March, 2025; originally announced March 2025.

Comments: Project Page: https://zju-real.github.io/InftyThink Code: https://github.com/ZJU-REAL/InftyThink

arXiv:2503.06470 [pdf, other]

Think Twice, Click Once: Enhancing GUI Grounding via Fast and Slow Systems

Authors: Fei Tang, Yongliang Shen, Hang Zhang, Siqi Chen, Guiyang Hou, Wenqi Zhang, Wenqiao Zhang, Kaitao Song, Weiming Lu, Yueting Zhuang

Abstract: Humans can flexibly switch between different modes of thinking based on task complexity: from rapid intuitive judgments to in-depth analytical understanding. However, current Graphical User Interface (GUI) grounding systems which locate interface elements based on natural language instructions rely solely on immediate prediction without reasoning, struggling to understand complex interface layouts… ▽ More Humans can flexibly switch between different modes of thinking based on task complexity: from rapid intuitive judgments to in-depth analytical understanding. However, current Graphical User Interface (GUI) grounding systems which locate interface elements based on natural language instructions rely solely on immediate prediction without reasoning, struggling to understand complex interface layouts with nested structures and hierarchical relationships, limiting their effectiveness on complex interfaces. Inspired by human dual-system cognition, we present Focus, a novel GUI grounding framework that combines fast prediction with systematic analysis. The framework dynamically switches between rapid and deliberate processing through an adaptive system switching based on task complexity, optimizing both efficiency and accuracy. Focus decomposes grounding into progressive stages: interface summarization, visual focused analysis, and precise coordinate prediction. This structured decomposition enables systematic understanding of both interface layouts and visual relationships. Extensive experiments show that Focus achieves state-of-the-art performance using only 300K of the training data with a 2B parameter model compared to existing approaches. Focus demonstrates superior performance particularly in complex GUI scenarios, achieving 77.4% average accuracy on ScreenSpot and 13.3% on the more challenging ScreenSpot-Pro. Our analysis reveals the effectiveness of this dual-system approach while demonstrating its potential for improving complex GUI interaction scenarios. △ Less

Submitted 9 March, 2025; originally announced March 2025.

arXiv:2503.05666 [pdf, other]

Kinodynamic Model Predictive Control for Energy Efficient Locomotion of Legged Robots with Parallel Elasticity

Authors: Yulun Zhuang, Yichen Wang, Yanran Ding

Abstract: In this paper, we introduce a kinodynamic model predictive control (MPC) framework that exploits unidirectional parallel springs (UPS) to improve the energy efficiency of dynamic legged robots. The proposed method employs a hierarchical control structure, where the solution of MPC with simplified dynamic models is used to warm-start the kinodynamic MPC, which accounts for nonlinear centroidal dyna… ▽ More In this paper, we introduce a kinodynamic model predictive control (MPC) framework that exploits unidirectional parallel springs (UPS) to improve the energy efficiency of dynamic legged robots. The proposed method employs a hierarchical control structure, where the solution of MPC with simplified dynamic models is used to warm-start the kinodynamic MPC, which accounts for nonlinear centroidal dynamics and kinematic constraints. The proposed approach enables energy efficient dynamic hopping on legged robots by using UPS to reduce peak motor torques and energy consumption during stance phases. Simulation results demonstrated a 38.8% reduction in the cost of transport (CoT) for a monoped robot equipped with UPS during high-speed hopping. Additionally, preliminary hardware experiments show a 14.8% reduction in energy consumption. Video: https://youtu.be/AF11qMXJD48 △ Less

Submitted 7 March, 2025; originally announced March 2025.

Comments: 7 pages, 6 figures. Accepted for publication at ICRA 2025

arXiv:2503.05382 [pdf, other]

Measurement of the branching fractions of $D^+ \to K^+K^-π^+π^+π^-$, $φπ^+π^+π^-$, $K^0_SK^+π^+π^-π^0$, $K^0_SK^+η$, and $K^0_SK^+ω$ decays

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (693 additional authors not shown)

Abstract: Using $20.3~\mathrm{fb}^{-1}$ of $e^+e^-$ collision data collected at a center-of-mass energy of 3.773 GeV with the BESIII detector operating at the BEPCII collider, the branching fractions of three hadronic charm meson decays, $D^+\to φπ^+π^+π^-$, $D^+\to K^0_SK^+π^+π^-π^0$, and $D^+\to K^0_SK^+ω$, are measured for the first time to be $(0.54\pm0.19\pm0.02)\times 10^{-4}$,… ▽ More Using $20.3~\mathrm{fb}^{-1}$ of $e^+e^-$ collision data collected at a center-of-mass energy of 3.773 GeV with the BESIII detector operating at the BEPCII collider, the branching fractions of three hadronic charm meson decays, $D^+\to φπ^+π^+π^-$, $D^+\to K^0_SK^+π^+π^-π^0$, and $D^+\to K^0_SK^+ω$, are measured for the first time to be $(0.54\pm0.19\pm0.02)\times 10^{-4}$, $(2.51\pm0.34\pm0.14)\times 10^{-4}$, and $(2.02\pm0.35\pm0.10)\times 10^{-4}$, respectively. Futhermore, the branching fractions of $D^+\to K^+K^-π^+π^+π^-$ and $D^+\to K^0_SK^+η$ are measured with improved precision, yielding values of $(0.66\pm0.11\pm0.03)\times 10^{-4}$ and $(2.27\pm0.22\pm0.05)\times 10^{-4}$, respectively. △ Less

Submitted 7 March, 2025; originally announced March 2025.

Comments: 11 pages, 3 figures

Report number: BAM-00841

arXiv:2503.04095 [pdf, other]

Chart-HQA: A Benchmark for Hypothetical Question Answering in Charts

Authors: Xiangnan Chen, Yuancheng Fang, Qian Xiao, Juncheng Li, Jun Lin, Siliang Tang, Yi Yang, Yueting Zhuang

Abstract: Multimodal Large Language Models (MLLMs) have garnered significant attention for their strong visual-semantic understanding. Most existing chart benchmarks evaluate MLLMs' ability to parse information from charts to answer questions. However, they overlook the inherent output biases of MLLMs, where models rely on their parametric memory to answer questions rather than genuinely understanding the c… ▽ More Multimodal Large Language Models (MLLMs) have garnered significant attention for their strong visual-semantic understanding. Most existing chart benchmarks evaluate MLLMs' ability to parse information from charts to answer questions. However, they overlook the inherent output biases of MLLMs, where models rely on their parametric memory to answer questions rather than genuinely understanding the chart content. To address this limitation, we introduce a novel Chart Hypothetical Question Answering (HQA) task, which imposes assumptions on the same question to compel models to engage in counterfactual reasoning based on the chart content. Furthermore, we introduce HAI, a human-AI interactive data synthesis approach that leverages the efficient text-editing capabilities of LLMs alongside human expert knowledge to generate diverse and high-quality HQA data at a low cost. Using HAI, we construct Chart-HQA, a challenging benchmark synthesized from publicly available data sources. Evaluation results on 18 MLLMs of varying model sizes reveal that current models face significant generalization challenges and exhibit imbalanced reasoning performance on the HQA task. △ Less

Submitted 7 March, 2025; v1 submitted 6 March, 2025; originally announced March 2025.

Comments: Under review

arXiv:2503.02268 [pdf, other]

AppAgentX: Evolving GUI Agents as Proficient Smartphone Users

Authors: Wenjia Jiang, Yangyang Zhuang, Chenxi Song, Xu Yang, Joey Tianyi Zhou, Chi Zhang

Abstract: Recent advancements in Large Language Models (LLMs) have led to the development of intelligent LLM-based agents capable of interacting with graphical user interfaces (GUIs). These agents demonstrate strong reasoning and adaptability, enabling them to perform complex tasks that traditionally required predefined rules. However, the reliance on step-by-step reasoning in LLM-based agents often results… ▽ More Recent advancements in Large Language Models (LLMs) have led to the development of intelligent LLM-based agents capable of interacting with graphical user interfaces (GUIs). These agents demonstrate strong reasoning and adaptability, enabling them to perform complex tasks that traditionally required predefined rules. However, the reliance on step-by-step reasoning in LLM-based agents often results in inefficiencies, particularly for routine tasks. In contrast, traditional rule-based systems excel in efficiency but lack the intelligence and flexibility to adapt to novel scenarios. To address this challenge, we propose a novel evolutionary framework for GUI agents that enhances operational efficiency while retaining intelligence and flexibility. Our approach incorporates a memory mechanism that records the agent's task execution history. By analyzing this history, the agent identifies repetitive action sequences and evolves high-level actions that act as shortcuts, replacing these low-level operations and improving efficiency. This allows the agent to focus on tasks requiring more complex reasoning, while simplifying routine actions. Experimental results on multiple benchmark tasks demonstrate that our approach significantly outperforms existing methods in both efficiency and accuracy. The code will be open-sourced to support further research. △ Less

Submitted 14 April, 2025; v1 submitted 3 March, 2025; originally announced March 2025.

arXiv:2503.02242 [pdf, other]

$\mathbfΦ$-GAN: Physics-Inspired GAN for Generating SAR Images Under Limited Data

Authors: Xidan Zhang, Yihan Zhuang, Qian Guo, Haodong Yang, Xuelin Qian, Gong Cheng, Junwei Han, Zhongling Huang

Abstract: Approaches for improving generative adversarial networks (GANs) training under a few samples have been explored for natural images. However, these methods have limited effectiveness for synthetic aperture radar (SAR) images, as they do not account for the unique electromagnetic scattering properties of SAR. To remedy this, we propose a physics-inspired regularization method dubbed $Φ$-GAN, which i… ▽ More Approaches for improving generative adversarial networks (GANs) training under a few samples have been explored for natural images. However, these methods have limited effectiveness for synthetic aperture radar (SAR) images, as they do not account for the unique electromagnetic scattering properties of SAR. To remedy this, we propose a physics-inspired regularization method dubbed $Φ$-GAN, which incorporates the ideal point scattering center (PSC) model of SAR with two physical consistency losses. The PSC model approximates SAR targets using physical parameters, ensuring that $Φ$-GAN generates SAR images consistent with real physical properties while preventing discriminator overfitting by focusing on PSC-based decision cues. To embed the PSC model into GANs for end-to-end training, we introduce a physics-inspired neural module capable of estimating the physical parameters of SAR targets efficiently. This module retains the interpretability of the physical model and can be trained with limited data. We propose two physical loss functions: one for the generator, guiding it to produce SAR images with physical parameters consistent with real ones, and one for the discriminator, enhancing its robustness by basing decisions on PSC attributes. We evaluate $Φ$-GAN across several conditional GAN (cGAN) models, demonstrating state-of-the-art performance in data-scarce scenarios on three SAR image datasets. △ Less

Submitted 3 March, 2025; originally announced March 2025.

arXiv:2503.02196 [pdf, ps, other]

First Measurement of the Decay Dynamics in the Semileptonic Transition of the $D^{+(0)}$ into the Axial-vector Meson $\bar K_1(1270)$

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (680 additional authors not shown)

Abstract: Using $e^+e^-$ collision data taken at the center-of-mass energy of 3.773 GeV with the BESIII detector, corresponding to an integrated luminosity of 20.3 fb$^{-1}$, we report the first amplitude and angular analyses of the semileptonic decays $D^{+(0)}\to K^-π^+π^{0(-)} e^+ν_e$. From the amplitude analysis, we determine for the first time the hadronic form factors of the semileptonic $D$ decays in… ▽ More Using $e^+e^-$ collision data taken at the center-of-mass energy of 3.773 GeV with the BESIII detector, corresponding to an integrated luminosity of 20.3 fb$^{-1}$, we report the first amplitude and angular analyses of the semileptonic decays $D^{+(0)}\to K^-π^+π^{0(-)} e^+ν_e$. From the amplitude analysis, we determine for the first time the hadronic form factors of the semileptonic $D$ decays into the axial-vector meson $\bar{K}_1(1270)$ to be $r_A=(-11.2\pm1.0\pm0.9)\times10^{-2}$ and $r_V = (-4.3\pm 1.0\pm2.4)\times 10^{-2}$. The angular analysis yields an up-down asymmetry $\mathcal{A}^\prime_{ud} = 0.01\pm0.11$, which is consistent with the Standard Model prediction. △ Less

Submitted 3 March, 2025; originally announced March 2025.

Comments: 15 pages, 6 figures, submitted to PRL

arXiv:2502.21239 [pdf, other]

Semantic Volume: Quantifying and Detecting both External and Internal Uncertainty in LLMs

Authors: Xiaomin Li, Zhou Yu, Ziji Zhang, Yingying Zhuang, Swair Shah, Narayanan Sadagopan, Anurag Beniwal

Abstract: Large language models (LLMs) have demonstrated remarkable performance across diverse tasks by encoding vast amounts of factual knowledge. However, they are still prone to hallucinations, generating incorrect or misleading information, often accompanied by high uncertainty. Existing methods for hallucination detection primarily focus on quantifying internal uncertainty, which arises from missing or… ▽ More Large language models (LLMs) have demonstrated remarkable performance across diverse tasks by encoding vast amounts of factual knowledge. However, they are still prone to hallucinations, generating incorrect or misleading information, often accompanied by high uncertainty. Existing methods for hallucination detection primarily focus on quantifying internal uncertainty, which arises from missing or conflicting knowledge within the model. However, hallucinations can also stem from external uncertainty, where ambiguous user queries lead to multiple possible interpretations. In this work, we introduce Semantic Volume, a novel mathematical measure for quantifying both external and internal uncertainty in LLMs. Our approach perturbs queries and responses, embeds them in a semantic space, and computes the determinant of the Gram matrix of the embedding vectors, capturing their dispersion as a measure of uncertainty. Our framework provides a generalizable and unsupervised uncertainty detection method without requiring internal access to LLMs. We conduct extensive experiments on both external and internal uncertainty detection, demonstrating that our Semantic Volume method consistently outperforms existing baselines in both tasks. Additionally, we provide theoretical insights linking our measure to differential entropy, unifying and extending previous sampling-based uncertainty measures such as the semantic entropy. Semantic Volume is shown to be a robust and interpretable approach to improving the reliability of LLMs by systematically detecting uncertainty in both user queries and model responses. △ Less

Submitted 5 May, 2025; v1 submitted 28 February, 2025; originally announced February 2025.

arXiv:2502.20821 [pdf, ps, other]

doi 10.1007/JHEP06(2025)194

Improved measurement of absolute branching fraction of the inclusive decay $Λ_{c}^{+} \to K_{S}^{0} X$

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (679 additional authors not shown)

Abstract: By analyzing $4.5$ fb$^{-1}$ of $e^{+}e^{-}$ collision data accumulated with the BESIII detector at center-of-mass energies ranging from $4599.53$ MeV to $4698.82$ MeV, we report the measurement of the absolute branching fraction (BF) of the inclusive decay $Λ_{c}^{+} \to K_{S}^{0} X$ using the double-tag technique. The result is $\mathcal{B}(Λ_{c}^{+} \to K_{S}^{0} X)=(10.9\pm0.2\pm0.1)\%$, where… ▽ More By analyzing $4.5$ fb$^{-1}$ of $e^{+}e^{-}$ collision data accumulated with the BESIII detector at center-of-mass energies ranging from $4599.53$ MeV to $4698.82$ MeV, we report the measurement of the absolute branching fraction (BF) of the inclusive decay $Λ_{c}^{+} \to K_{S}^{0} X$ using the double-tag technique. The result is $\mathcal{B}(Λ_{c}^{+} \to K_{S}^{0} X)=(10.9\pm0.2\pm0.1)\%$, where the first uncertainty is statistical and the second is systematic. This result indicates that there are still undiscovered decay channels containing $K_{S}^{0}$ in the final state with a combined BF of $(3.1\pm0.4)\%$. The BF of the inclusive decay $Λ_{c}^{+} \to \overline{K}^{0} / K^{0} X$ is calculated to be $\mathcal{B}(Λ_{c}^{+} \to \overline{K}^{0} / K^{0} X)=(21.8 \pm0.4 \pm0.2 \pm1.1)\%$, where the third uncertainty accounts for a possible difference between $\mathcal{B}(Λ_{c}^{+} \to K_{S}^{0} X)$ and $\mathcal{B}(Λ_{c}^{+} \to K_{L}^{0} X)$. The result is in agreement with the prediction of the statistical isospin model. △ Less

Submitted 21 June, 2025; v1 submitted 28 February, 2025; originally announced February 2025.

Journal ref: J. High Energ. Phys. 2025, 194 (2025)

arXiv:2502.20742 [pdf, other]

Structured Preference Optimization for Vision-Language Long-Horizon Task Planning

Authors: Xiwen Liang, Min Lin, Weiqi Ruan, Rongtao Xu, Yuecheng Liu, Jiaqi Chen, Bingqian Lin, Yuzheng Zhuang, Xiaodan Liang

Abstract: Existing methods for vision-language task planning excel in short-horizon tasks but often fall short in complex, long-horizon planning within dynamic environments. These challenges primarily arise from the difficulty of effectively training models to produce high-quality reasoning processes for long-horizon tasks. To address this, we propose Structured Preference Optimization (SPO), which aims to… ▽ More Existing methods for vision-language task planning excel in short-horizon tasks but often fall short in complex, long-horizon planning within dynamic environments. These challenges primarily arise from the difficulty of effectively training models to produce high-quality reasoning processes for long-horizon tasks. To address this, we propose Structured Preference Optimization (SPO), which aims to enhance reasoning and action selection in long-horizon task planning through structured preference evaluation and optimized training strategies. Specifically, SPO introduces: 1) Preference-Based Scoring and Optimization, which systematically evaluates reasoning chains based on task relevance, visual grounding, and historical consistency; and 2) Curriculum-Guided Training, where the model progressively adapts from simple to complex tasks, improving its generalization ability in long-horizon scenarios and enhancing reasoning robustness. To advance research in vision-language long-horizon task planning, we introduce ExtendaBench, a comprehensive benchmark covering 1,509 tasks across VirtualHome and Habitat 2.0, categorized into ultra-short, short, medium, and long tasks. Experimental results demonstrate that SPO significantly improves reasoning quality and final decision accuracy, outperforming prior methods on long-horizon tasks and underscoring the effectiveness of preference-driven optimization in vision-language task planning. Specifically, SPO achieves a +5.98% GCR and +4.68% SR improvement in VirtualHome and a +3.30% GCR and +2.11% SR improvement in Habitat over the best-performing baselines. △ Less

Submitted 15 May, 2025; v1 submitted 28 February, 2025; originally announced February 2025.

Comments: 18 pages

Showing 51–100 of 557 results for author: Zhuang, Y