Skip to main content

Showing 1–50 of 335 results for author: Chenghao

Searching in archive cs. Search in all archives.
.
  1. arXiv:2509.03867  [pdf, ps, other

    cs.CL

    Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth

    Authors: Yang Wang, Chenghao Xiao, Chia-Yi Hsiao, Zi Yan Chang, Chi-Li Chen, Tyler Loakman, Chenghua Lin

    Abstract: We introduce Drivelology, a unique linguistic phenomenon characterised as "nonsense with depth", utterances that are syntactically coherent yet pragmatically paradoxical, emotionally loaded, or rhetorically subversive. While such expressions may resemble surface-level nonsense, they encode implicit meaning requiring contextual inference, moral reasoning, or emotional interpretation. We find that c… ▽ More

    Submitted 3 September, 2025; originally announced September 2025.

    Comments: Accepted for oral presentation at the EMNLP 2025 Main Conference

  2. arXiv:2509.02972  [pdf, ps, other

    cs.RO

    IL-SLAM: Intelligent Line-assisted SLAM Based on Feature Awareness for Dynamic Environments

    Authors: Haolan Zhang, Thanh Nguyen Canh, Chenghao Li, Ruidong Yang, Yonghoon Ji, Nak Young Chong

    Abstract: Visual Simultaneous Localization and Mapping (SLAM) plays a crucial role in autonomous systems. Traditional SLAM methods, based on static environment assumptions, struggle to handle complex dynamic environments. Recent dynamic SLAM systems employ geometric constraints and deep learning to remove dynamic features, yet this creates a new challenge: insufficient remaining point features for subsequen… ▽ More

    Submitted 2 September, 2025; originally announced September 2025.

    Comments: submitted to International Conference on Robotic Computing and Communication(IEEE IRC)

  3. arXiv:2509.01111  [pdf, ps, other

    cs.RO

    SR-SLAM: Scene-reliability Based RGB-D SLAM in Diverse Environments

    Authors: Haolan Zhang, Chenghao Li, Thanh Nguyen Canh, Lijun Wang, Nak Young Chong

    Abstract: Visual simultaneous localization and mapping (SLAM) plays a critical role in autonomous robotic systems, especially where accurate and reliable measurements are essential for navigation and sensing. In feature-based SLAM, the quantityand quality of extracted features significantly influence system performance. Due to the variations in feature quantity and quality across diverse environments, curre… ▽ More

    Submitted 1 September, 2025; originally announced September 2025.

    Comments: submitted

  4. arXiv:2508.19257  [pdf, ps, other

    cs.CV cs.AI cs.LG cs.RO

    TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models

    Authors: Chenghao Liu, Jiachen Zhang, Chengxuan Li, Zhimu Zhou, Shixin Wu, Songfang Huang, Huiling Duan

    Abstract: Vision-Language-Action (VLA) models process visual inputs independently at each timestep, discarding valuable temporal information inherent in robotic manipulation tasks. This frame-by-frame processing makes models vulnerable to visual noise while ignoring the substantial coherence between consecutive frames in manipulation sequences. We propose Temporal Token Fusion (TTF), a training-free approac… ▽ More

    Submitted 15 August, 2025; originally announced August 2025.

    Comments: Manuscript submitted to AAAI 2026, currently under review

  5. STARec: An Efficient Agent Framework for Recommender Systems via Autonomous Deliberate Reasoning

    Authors: Chenghao Wu, Ruiyang Ren, Junjie Zhang, Ruirui Wang, Zhongrui Ma, Qi Ye, Wayne Xin Zhao

    Abstract: While modern recommender systems are instrumental in navigating information abundance, they remain fundamentally limited by static user modeling and reactive decision-making paradigms. Current large language model (LLM)-based agents inherit these shortcomings through their overreliance on heuristic pattern matching, yielding recommendations prone to shallow correlation bias, limited causal inferen… ▽ More

    Submitted 26 August, 2025; originally announced August 2025.

    Journal ref: Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM 2025)

  6. arXiv:2508.16654  [pdf, ps, other

    cs.CV

    MSNav: Zero-Shot Vision-and-Language Navigation with Dynamic Memory and LLM Spatial Reasoning

    Authors: Chenghao Liu, Zhimu Zhou, Jiachen Zhang, Minghao Zhang, Songfang Huang, Huiling Duan

    Abstract: Vision-and-Language Navigation (VLN) requires an agent to interpret natural language instructions and navigate complex environments. Current approaches often adopt a "black-box" paradigm, where a single Large Language Model (LLM) makes end-to-end decisions. However, it is plagued by critical vulnerabilities, including poor spatial reasoning, weak cross-modal grounding, and memory overload in long-… ▽ More

    Submitted 1 September, 2025; v1 submitted 20 August, 2025; originally announced August 2025.

    Comments: 19 pages, 15 figures

  7. arXiv:2508.15825  [pdf, ps, other

    cs.CL q-fin.ST

    Enhancing Cryptocurrency Sentiment Analysis with Multimodal Features

    Authors: Chenghao Liu, Aniket Mahanti, Ranesh Naha, Guanghao Wang, Erwann Sbai

    Abstract: As cryptocurrencies gain popularity, the digital asset marketplace becomes increasingly significant. Understanding social media signals offers valuable insights into investor sentiment and market dynamics. Prior research has predominantly focused on text-based platforms such as Twitter. However, video content remains underexplored, despite potentially containing richer emotional and contextual sen… ▽ More

    Submitted 31 August, 2025; v1 submitted 17 August, 2025; originally announced August 2025.

  8. arXiv:2508.15392  [pdf, ps, other

    cs.LG cs.CL

    CITE: A Comprehensive Benchmark for Heterogeneous Text-Attributed Graphs on Catalytic Materials

    Authors: Chenghao Zhang, Qingqing Long, Ludi Wang, Wenjuan Cui, Jianjun Yu, Yi Du

    Abstract: Text-attributed graphs(TAGs) are pervasive in real-world systems,where each node carries its own textual features. In many cases these graphs are inherently heterogeneous, containing multiple node types and diverse edge types. Despite the ubiquity of such heterogeneous TAGs, there remains a lack of large-scale benchmark datasets. This shortage has become a critical bottleneck, hindering the develo… ▽ More

    Submitted 21 August, 2025; originally announced August 2025.

    Comments: 23 pages, 4 figures,

  9. arXiv:2508.13692  [pdf, ps, other

    cs.CV

    HumanPCR: Probing MLLM Capabilities in Diverse Human-Centric Scenes

    Authors: Keliang Li, Hongze Shen, Hao Shi, Ruibing Hou, Hong Chang, Jie Huang, Chenghao Jia, Wen Wang, Yiling Wu, Dongmei Jiang, Shiguang Shan, Xilin Chen

    Abstract: The aspiration for artificial general intelligence, fueled by the rapid progress of multimodal models, demands human-comparable performance across diverse environments. We propose HumanPCR, an evaluation suite for probing MLLMs' capacity about human-related visual contexts across three hierarchical levels: Perception, Comprehension, and Reasoning (denoted by Human-P, Human-C, and Human-R, respecti… ▽ More

    Submitted 19 August, 2025; originally announced August 2025.

  10. arXiv:2508.13618  [pdf, ps, other

    cs.CV

    TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis

    Authors: Shunian Chen, Hejin Huang, Yexin Liu, Zihan Ye, Pengcheng Chen, Chenghao Zhu, Michael Guan, Rongsheng Wang, Junying Chen, Guanbin Li, Ser-Nam Lim, Harry Yang, Benyou Wang

    Abstract: Audio-driven talking head synthesis has achieved remarkable photorealism, yet state-of-the-art (SOTA) models exhibit a critical failure: they lack generalization to the full spectrum of human diversity in ethnicity, language, and age groups. We argue that this generalization gap is a direct symptom of limitations in existing training data, which lack the necessary scale, quality, and diversity. To… ▽ More

    Submitted 19 August, 2025; originally announced August 2025.

  11. arXiv:2508.13167  [pdf, ps, other

    cs.AI cs.CL

    Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL

    Authors: Weizhen Li, Jianbo Lin, Zhuosong Jiang, Jingyi Cao, Xinpeng Liu, Jiayu Zhang, Zhenqiang Huang, Qianben Chen, Weichen Sun, Qiexiang Wang, Hongxuan Lu, Tianrui Qin, Chenghao Zhu, Yi Yao, Shuying Fan, Xiaowan Li, Tiannan Wang, Pai Liu, King Zhu, He Zhu, Dingfeng Shi, Piaohong Wang, Yeyi Guan, Xiangru Tang, Minghao Liu , et al. (5 additional authors not shown)

    Abstract: Recent advances in large language models (LLMs) and multi-agent systems have demonstrated remarkable capabilities in complex problem-solving tasks such as deep research, vibe coding, and mathematical reasoning. However, most existing multi-agent systems are built upon manual prompt/workflow engineering with sophisticated agent frameworks, making them computationally inefficient, less capable, and… ▽ More

    Submitted 6 August, 2025; originally announced August 2025.

    Comments: 51 pages

  12. arXiv:2508.10667  [pdf, ps, other

    cs.CV cs.AI

    AddressVLM: Cross-view Alignment Tuning for Image Address Localization using Large Vision-Language Models

    Authors: Shixiong Xu, Chenghao Zhang, Lubin Fan, Yuan Zhou, Bin Fan, Shiming Xiang, Gaofeng Meng, Jieping Ye

    Abstract: Large visual language models (LVLMs) have demonstrated impressive performance in coarse-grained geo-localization at the country or city level, but they struggle with fine-grained street-level localization within urban areas. In this paper, we explore integrating city-wide address localization capabilities into LVLMs, facilitating flexible address-related question answering using street-view images… ▽ More

    Submitted 14 August, 2025; originally announced August 2025.

  13. FineBadminton: A Multi-Level Dataset for Fine-Grained Badminton Video Understanding

    Authors: Xusheng He, Wei Liu, Shanshan Ma, Qian Liu, Chenghao Ma, Jianlong Wu

    Abstract: Fine-grained analysis of complex and high-speed sports like badminton presents a significant challenge for Multimodal Large Language Models (MLLMs), despite their notable advancements in general video understanding. This difficulty arises primarily from the scarcity of datasets with sufficiently rich and domain-specific annotations. To bridge this gap, we introduce FineBadminton, a novel and large… ▽ More

    Submitted 10 August, 2025; originally announced August 2025.

  14. arXiv:2508.04625  [pdf, ps, other

    cs.CV cs.CE

    FinMMR: Make Financial Numerical Reasoning More Multimodal, Comprehensive, and Challenging

    Authors: Zichen Tang, Haihong E, Jiacheng Liu, Zhongjun Yang, Rongjin Li, Zihua Rong, Haoyang He, Zhuodi Hao, Xinyang Hu, Kun Ji, Ziyan Ma, Mengyuan Ji, Jun Zhang, Chenghao Ma, Qianhe Zheng, Yang Liu, Yiling Huang, Xinyi Hu, Qing Huang, Zijian Xie, Shiyao Peng

    Abstract: We present FinMMR, a novel bilingual multimodal benchmark tailored to evaluate the reasoning capabilities of multimodal large language models (MLLMs) in financial numerical reasoning tasks. Compared to existing benchmarks, our work introduces three significant advancements. (1) Multimodality: We meticulously transform existing financial reasoning benchmarks, and construct novel questions from the… ▽ More

    Submitted 6 August, 2025; originally announced August 2025.

    Comments: Accepted by ICCV 2025. arXiv admin note: text overlap with arXiv:2311.06602 by other authors

  15. arXiv:2508.04379  [pdf, ps, other

    cs.CV cs.LG

    VisionTS++: Cross-Modal Time Series Foundation Model with Continual Pre-trained Visual Backbones

    Authors: Lefei Shen, Mouxiang Chen, Xu Liu, Han Fu, Xiaoxue Ren, Jianling Sun, Zhuo Li, Chenghao Liu

    Abstract: Recent studies have revealed that vision models pre-trained on images can perform well in time series forecasting by reformulating forecasting as an image reconstruction task, suggesting their potential as universal time series foundation models. However, effective cross-modal transfer from vision to time series remains challenging due to three key discrepancies: (1) data-modality gap between stru… ▽ More

    Submitted 6 August, 2025; originally announced August 2025.

    Comments: 21 pages

  16. arXiv:2508.03003  [pdf, ps, other

    cs.RO

    Thruster-Enhanced Locomotion: A Decoupled Model Predictive Control with Learned Contact Residuals

    Authors: Chenghao Wang, Alireza Ramezani

    Abstract: Husky Carbon, a robot developed by Northeastern University, serves as a research platform to explore unification of posture manipulation and thrust vectoring. Unlike conventional quadrupeds, its joint actuators and thrusters enable enhanced control authority, facilitating thruster-assisted narrow-path walking. While a unified Model Predictive Control (MPC) framework optimizing both ground reaction… ▽ More

    Submitted 4 August, 2025; originally announced August 2025.

  17. arXiv:2508.02890  [pdf, ps, other

    cs.CV cs.CL

    VisuCraft: Enhancing Large Vision-Language Models for Complex Visual-Guided Creative Content Generation via Structured Information Extraction

    Authors: Rongxin Jiang, Robert Long, Chenghao Gu, Mingrui Yan

    Abstract: This paper introduces VisuCraft, a novel framework designed to significantly enhance the capabilities of Large Vision-Language Models (LVLMs) in complex visual-guided creative content generation. Existing LVLMs often exhibit limitations in maintaining high visual fidelity, genuine creativity, and precise adherence to nuanced user instructions when generating long-form texts. VisuCraft addresses th… ▽ More

    Submitted 4 August, 2025; originally announced August 2025.

  18. arXiv:2507.23203  [pdf, ps, other

    cs.RO

    Quadratic Programming-Based Posture Manipulation and Thrust-vectoring for Agile Dynamic Walking on Narrow Pathways

    Authors: Chenghao Wang, Eric Sihite, Kaushik Venkatesh Krishnamurthy, Shreyansh Pitroda, Adarsh Salagame, Alireza Ramezani, Morteza Gharib

    Abstract: There has been significant advancement in legged robot's agility where they can show impressive acrobatic maneuvers, such as parkour. These maneuvers rely heavily on posture manipulation. To expand the stability and locomotion plasticity, we use the multi-modal ability in our legged-aerial platform, the Husky Beta, to perform thruster-assisted walking. This robot has thrusters on each of its sagit… ▽ More

    Submitted 30 July, 2025; originally announced July 2025.

  19. arXiv:2507.22607  [pdf, ps, other

    cs.CV cs.AI cs.CL

    VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning

    Authors: Ruifeng Yuan, Chenghao Xiao, Sicong Leng, Jianyu Wang, Long Li, Weiwen Xu, Hou Pong Chan, Deli Zhao, Tingyang Xu, Zhongyu Wei, Hao Zhang, Yu Rong

    Abstract: Reinforcement learning has proven its effectiveness in enhancing the reasoning capabilities of large language models. Recent research efforts have progressively extended this paradigm to multimodal reasoning tasks. Due to the inherent complexity and diversity of multimodal tasks, especially in semantic content and problem formulations, existing models often exhibit unstable performance across vari… ▽ More

    Submitted 31 July, 2025; v1 submitted 30 July, 2025; originally announced July 2025.

    Comments: 21 pages, 5 figures, 6 tables. Work in progress

  20. arXiv:2507.21750  [pdf, ps, other

    cs.CL

    Adversarial Defence without Adversarial Defence: Enhancing Language Model Robustness via Instance-level Principal Component Removal

    Authors: Yang Wang, Chenghao Xiao, Yizhi Li, Stuart E. Middleton, Noura Al Moubayed, Chenghua Lin

    Abstract: Pre-trained language models (PLMs) have driven substantial progress in natural language processing but remain vulnerable to adversarial attacks, raising concerns about their robustness in real-world applications. Previous studies have sought to mitigate the impact of adversarial attacks by introducing adversarial perturbations into the training process, either implicitly or explicitly. While both… ▽ More

    Submitted 29 July, 2025; originally announced July 2025.

    Comments: This paper was accepted with an A-decision to Transactions of the Association for Computational Linguistics. This version is the pre-publication version prior to MIT Press production

  21. arXiv:2507.21709  [pdf, ps, other

    cs.RO

    Adaptive Prior Scene-Object SLAM for Dynamic Environments

    Authors: Haolan Zhang, Thanh Nguyen Canh, Chenghao Li, Nak Young Chong

    Abstract: Visual Simultaneous Localization and Mapping (SLAM) plays a vital role in real-time localization for autonomous systems. However, traditional SLAM methods, which assume a static environment, often suffer from significant localization drift in dynamic scenarios. While recent advancements have improved SLAM performance in such environments, these systems still struggle with localization drift, parti… ▽ More

    Submitted 29 July, 2025; originally announced July 2025.

    Comments: Accepted by IEEE The 2025 IEEE International Conference on Real-time Computing and Robotics

  22. arXiv:2507.13043  [pdf, ps, other

    cs.LG

    The Power of Architecture: Deep Dive into Transformer Architectures for Long-Term Time Series Forecasting

    Authors: Lefei Shen, Mouxiang Chen, Han Fu, Xiaoxue Ren, Xiaoyun Joy Wang, Jianling Sun, Zhuo Li, Chenghao Liu

    Abstract: Transformer-based models have recently become dominant in Long-term Time Series Forecasting (LTSF), yet the variations in their architecture, such as encoder-only, encoder-decoder, and decoder-only designs, raise a crucial question: What Transformer architecture works best for LTSF tasks? However, existing models are often tightly coupled with various time-series-specific designs, making it diffic… ▽ More

    Submitted 17 July, 2025; originally announced July 2025.

    Comments: 15 pages, 6 figures

  23. arXiv:2507.07663  [pdf

    cs.CV

    MolCLIP: A Molecular-Auxiliary CLIP Framework for Identifying Drug Mechanism of Action Based on Time-Lapsed Mitochondrial Images

    Authors: Fengqian Pang, Chunyue Lei, Hongfei Zhao, Chenghao Liu, Zhiqiang Xing, Huafeng Wang, Chuyang Ye

    Abstract: Drug Mechanism of Action (MoA) mainly investigates how drug molecules interact with cells, which is crucial for drug discovery and clinical application. Recently, deep learning models have been used to recognize MoA by relying on high-content and fluorescence images of cells exposed to various drugs. However, these methods focus on spatial characteristics while overlooking the temporal dynamics of… ▽ More

    Submitted 10 July, 2025; originally announced July 2025.

  24. arXiv:2507.04631  [pdf, ps, other

    cs.CV cs.AI cs.RO

    Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts

    Authors: Yun Wang, Longguang Wang, Chenghao Zhang, Yongjian Zhang, Zhanjie Zhang, Ao Ma, Chenyou Fan, Tin Lun Lam, Junjie Hu

    Abstract: Recently, learning-based stereo matching networks have advanced significantly. However, they often lack robustness and struggle to achieve impressive cross-domain performance due to domain shifts and imbalanced disparity distributions among diverse datasets. Leveraging Vision Foundation Models (VFMs) can intuitively enhance the model's robustness, but integrating such a model into stereo matching… ▽ More

    Submitted 6 July, 2025; originally announced July 2025.

    Journal ref: ICCV 2025

  25. arXiv:2507.01887  [pdf, ps, other

    cs.CL

    MiCoTA: Bridging the Learnability Gap with Intermediate CoT and Teacher Assistants

    Authors: Dongyi Ding, Tiannan Wang, Chenghao Zhu, Meiling Tao, Yuchen Eleanor Jiang, Wangchunshu Zhou

    Abstract: Large language models (LLMs) excel at reasoning tasks requiring long thought sequences for planning, reflection, and refinement. However, their substantial model size and high computational demands are impractical for widespread deployment. Yet, small language models (SLMs) often struggle to learn long-form CoT reasoning due to their limited capacity, a phenomenon we refer to as the "SLMs Learnabi… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Work in progress

  26. arXiv:2506.22813  [pdf, ps, other

    cs.CL

    Selecting and Merging: Towards Adaptable and Scalable Named Entity Recognition with Large Language Models

    Authors: Zhuojun Ding, Wei Wei, Chenghao Fan

    Abstract: Supervised fine-tuning (SFT) is widely used to align large language models (LLMs) with information extraction (IE) tasks, such as named entity recognition (NER). However, annotating such fine-grained labels and training domain-specific models is costly. Existing works typically train a unified model across multiple domains, but such approaches lack adaptation and scalability since not all training… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

  27. arXiv:2506.17871  [pdf, ps, other

    cs.CL cs.AI cs.LG

    How Alignment Shrinks the Generative Horizon

    Authors: Chenghao Yang, Ari Holtzman

    Abstract: Despite their impressive capabilities, aligned large language models (LLMs) often generate outputs that lack diversity. What drives this stability in the generation? We investigate this phenomenon through the lens of probability concentration in the model's output distribution. To quantify this concentration, we introduce the Branching Factor (BF) -- a token-invariant measure of the effective numb… ▽ More

    Submitted 21 June, 2025; originally announced June 2025.

    Comments: Codebase: https://github.com/yangalan123/LLMBranchingFactor, Website: https://yangalan123.github.io/branching_factor/

  28. arXiv:2506.14248  [pdf, ps, other

    cs.CL cs.AI

    Re-Initialization Token Learning for Tool-Augmented Large Language Models

    Authors: Chenghao Li, Liu Liu, Baosheng Yu, Jiayan Qiu, Yibing Zhan

    Abstract: Large language models have demonstrated exceptional performance, yet struggle with complex tasks such as numerical reasoning, plan generation. Integrating external tools, such as calculators and databases, into large language models (LLMs) is crucial for enhancing problem-solving capabilities. Current methods assign a unique token to each tool, enabling LLMs to call tools through token prediction-… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

  29. arXiv:2506.14087  [pdf, ps, other

    cs.LG

    Multi-Scale Finetuning for Encoder-based Time Series Foundation Models

    Authors: Zhongzheng Qiao, Chenghao Liu, Yiming Zhang, Ming Jin, Quang Pham, Qingsong Wen, P. N. Suganthan, Xudong Jiang, Savitha Ramasamy

    Abstract: Time series foundation models (TSFMs) demonstrate impressive zero-shot performance for time series forecasting. However, an important yet underexplored challenge is how to effectively finetune TSFMs on specific downstream tasks. While naive finetuning can yield performance gains, we argue that it falls short of fully leveraging TSFMs' capabilities, often resulting in overfitting and suboptimal per… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  30. arXiv:2506.12915  [pdf, ps, other

    cs.CL

    PersonaFeedback: A Large-scale Human-annotated Benchmark For Personalization

    Authors: Meiling Tao, Chenghao Zhu, Dongyi Ding, Tiannan Wang, Yuchen Eleanor Jiang, Wangchunshu Zhou

    Abstract: With the rapid improvement in the general capabilities of LLMs, LLM personalization, i.e., how to build LLM systems that can generate personalized responses or services that are tailored to distinct user personas, has become an increasingly important research and engineering problem. However, unlike many new challenging benchmarks being released for evaluating the general/reasoning capabilities, t… ▽ More

    Submitted 15 June, 2025; originally announced June 2025.

    Comments: Work in progress

  31. arXiv:2506.09513  [pdf, ps, other

    cs.CL cs.AI cs.MA

    ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning

    Authors: Yu Sun, Xingyu Qian, Weiwen Xu, Hao Zhang, Chenghao Xiao, Long Li, Yu Rong, Wenbing Huang, Qifeng Bai, Tingyang Xu

    Abstract: Though reasoning-based large language models (LLMs) have excelled in mathematics and programming, their capabilities in knowledge-intensive medical question answering remain underexplored. To address this, we introduce ReasonMed, the largest medical reasoning dataset, comprising 370k high-quality examples distilled from 1.7 million initial reasoning paths generated by various LLMs. ReasonMed is co… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: 24 pages, 6 figures, 7 tables

  32. arXiv:2506.07785  [pdf, ps, other

    cs.CV cs.AI cs.LG

    Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger

    Authors: Qi Yang, Chenghao Zhang, Lubin Fan, Kun Ding, Jieping Ye, Shiming Xiang

    Abstract: Recent advancements in Large Vision Language Models (LVLMs) have significantly improved performance in Visual Question Answering (VQA) tasks through multimodal Retrieval-Augmented Generation (RAG). However, existing methods still face challenges, such as the scarcity of knowledge with reasoning examples and erratic responses from retrieved knowledge. To address these issues, in this study, we prop… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

    Comments: ICML 2025 Spotlight. 22 pages, 16 figures

  33. arXiv:2506.07276  [pdf, ps, other

    cs.LG cs.AI

    Tokenized Bandit for LLM Decoding and Alignment

    Authors: Suho Shin, Chenghao Yang, Haifeng Xu, Mohammad T. Hajiaghayi

    Abstract: We introduce the tokenized linear bandit (TLB) and multi-armed bandit (TMAB), variants of linear and stochastic multi-armed bandit problems inspired by LLM decoding and alignment. In these problems, at each round $t \in [T]$, a user submits a query (context), and the decision maker (DM) sequentially selects a token irrevocably from a token set. Once the sequence is complete, the DM observes a rand… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

    Comments: To appear at ICML 2025

  34. arXiv:2506.07044  [pdf, ps, other

    cs.CL cs.AI cs.CV

    Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

    Authors: LASA Team, Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, Yu Sun, Junao Shen, Chaojun Wang, Jie Tan, Deli Zhao, Tingyang Xu, Hao Zhang, Yu Rong

    Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in understanding common visual elements, largely due to their large-scale datasets and advanced training strategies. However, their effectiveness in medical applications remains limited due to the inherent discrepancies between data and tasks in medical scenarios and those in the general domain. Concretely, existing… ▽ More

    Submitted 13 June, 2025; v1 submitted 8 June, 2025; originally announced June 2025.

    Comments: Technical Report, 53 pages, 25 tables, and 16 figures. Our webpage is https://alibaba-damo-academy.github.io/lingshu/

  35. arXiv:2506.00812  [pdf, other

    cs.DB

    VecFlow: A High-Performance Vector Data Management System for Filtered-Search on GPUs

    Authors: Jingyi Xi, Chenghao Mo, Benjamin Karsin, Artem Chirkin, Mingqin Li, Minjia Zhang

    Abstract: Vector search and database systems have become a keystone component in many AI applications. While many prior research has investigated how to accelerate the performance of generic vector search, emerging AI applications require running more sophisticated vector queries efficiently, such as vector search with attribute filters. Unfortunately, recent filtered-ANNS solutions are primarily designed f… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

  36. arXiv:2505.23810  [pdf, ps, other

    cs.CL cs.AI

    MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation

    Authors: Chenghao Yang, Yinbo Luo, Zhoufutu Wen, Qi Chu, Tao Gong, Longxiang Liu, Kaiyuan Zhang, Jianpeng Jiao, Ge Zhang, Wenhao Huang, Nenghai Yu

    Abstract: Large Language Models (\textbf{LLMs}), e.g. ChatGPT, have been widely adopted in real-world dialogue applications. However, LLMs' robustness, especially in handling long complex dialogue sessions, including frequent motivation transfer, sophisticated cross-turn dependency, is criticized all along. Nevertheless, no existing benchmarks can fully reflect these weaknesses. We present \textbf{MARS-Benc… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: 29 pages, 13 figures

  37. arXiv:2505.21959   

    cs.LG cs.CL

    EnsemW2S: Enhancing Weak-to-Strong Generalization with Large Language Model Ensembles

    Authors: Aakriti Agrawal, Mucong Ding, Zora Che, Chenghao Deng, Anirudh Satheesh, Bang An, Bayan Bruss, John Langford, Furong Huang

    Abstract: With Large Language Models (LLMs) rapidly approaching and potentially surpassing human-level performance, it has become imperative to develop approaches capable of effectively supervising and enhancing these powerful models using smaller, human-level models exposed to only human-level data. We address this critical weak-to-strong (W2S) generalization challenge by proposing a novel method aimed at… ▽ More

    Submitted 4 June, 2025; v1 submitted 28 May, 2025; originally announced May 2025.

    Comments: Manuscript uploaded as version2 of arXiv:2410.04571

  38. arXiv:2505.21848  [pdf, ps, other

    cs.CV

    FPAN: Mitigating Replication in Diffusion Models through the Fine-Grained Probabilistic Addition of Noise to Token Embeddings

    Authors: Jingqi Xu, Chenghao Li, Yuke Zhang, Peter A. Beerel

    Abstract: Diffusion models have demonstrated remarkable potential in generating high-quality images. However, their tendency to replicate training data raises serious privacy concerns, particularly when the training datasets contain sensitive or private information. Existing mitigation strategies primarily focus on reducing image duplication, modifying the cross-attention mechanism, and altering the denoisi… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  39. arXiv:2505.20471  [pdf, ps, other

    cs.CV cs.AI cs.ET cs.LG cs.RO

    WeatherEdit: Controllable Weather Editing with 4D Gaussian Field

    Authors: Chenghao Qian, Wenjing Li, Yuhu Guo, Gustav Markkula

    Abstract: In this work, we present WeatherEdit, a novel weather editing pipeline for generating realistic weather effects with controllable types and severity in 3D scenes. Our approach is structured into two key components: weather background editing and weather particle construction. For weather background editing, we introduce an all-in-one adapter that integrates multiple weather styles into a single pr… ▽ More

    Submitted 7 August, 2025; v1 submitted 26 May, 2025; originally announced May 2025.

  40. Smart Energy Guardian: A Hybrid Deep Learning Model for Detecting Fraudulent PV Generation

    Authors: Xiaolu Chen, Chenghao Huang, Yanru Zhang, Hao Wang

    Abstract: With the proliferation of smart grids, smart cities face growing challenges due to cyber-attacks and sophisticated electricity theft behaviors, particularly in residential photovoltaic (PV) generation systems. Traditional Electricity Theft Detection (ETD) methods often struggle to capture complex temporal dependencies and integrating multi-source data, limiting their effectiveness. In this work, w… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

    Comments: 2024 IEEE International Smart Cities Conference (ISC2)

  41. arXiv:2505.18750  [pdf, ps, other

    eess.SY cs.AI math.OC

    Agent-Based Decentralized Energy Management of EV Charging Station with Solar Photovoltaics via Multi-Agent Reinforcement Learning

    Authors: Jiarong Fan, Chenghao Huang, Hao Wang

    Abstract: In the pursuit of energy net zero within smart cities, transportation electrification plays a pivotal role. The adoption of Electric Vehicles (EVs) keeps increasing, making energy management of EV charging stations critically important. While previous studies have managed to reduce energy cost of EV charging while maintaining grid stability, they often overlook the robustness of EV charging manage… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

    Comments: 2024 IEEE International Smart Cities Conference (ISC2)

  42. Season-Independent PV Disaggregation Using Multi-Scale Net Load Temporal Feature Extraction and Weather Factor Fusion

    Authors: Xiaolu Chen, Chenghao Huang, Yanru Zhang, Hao Wang

    Abstract: With the advancement of energy Internet and energy system integration, the increasing adoption of distributed photovoltaic (PV) systems presents new challenges on smart monitoring and measurement for utility companies, particularly in separating PV generation from net electricity load. Existing methods struggle with feature extraction from net load and capturing the relevance between weather facto… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

    Comments: 2024 IEEE 8th Conference on Energy Internet and Energy System Integration (EI2)

  43. arXiv:2505.12022  [pdf, other

    cs.DS

    A Reduction-based Algorithm for the Clique Interdiction Problem

    Authors: Chenghao Zhu, Yi Zhou, Haoyu Jiang

    Abstract: The Clique Interdiction Problem (CIP) aims to minimize the size of the largest clique in a given graph by removing a given number of vertices. The CIP models a special Stackelberg game and has important applications in fields such as pandemic control and terrorist identification. However, the CIP is a bilevel graph optimization problem, making it very challenging to solve. Recently, data reduction… ▽ More

    Submitted 20 May, 2025; v1 submitted 17 May, 2025; originally announced May 2025.

  44. arXiv:2505.11890  [pdf, other

    cs.CE

    LLM-Enhanced Feature Engineering for Multi-Factor Electricity Price Predictions

    Authors: Haochen Xue, Chenghao Liu, Chong Zhang, Yuxuan Chen, Angxiao Zong, Zhaodong Wu, Yulong Li, Jiayi Liu, Kaiyu Liang, Zhixiang Lu, Ruobing Li, Jionglong Su

    Abstract: Accurately forecasting electricity price volatility is crucial for effective risk management and decision-making. Traditional forecasting models often fall short in capturing the complex, non-linear dynamics of electricity markets, particularly when external factors like weather conditions and market volatility are involved. These limitations hinder their ability to provide reliable predictions in… ▽ More

    Submitted 17 May, 2025; originally announced May 2025.

  45. arXiv:2505.10983  [pdf, ps, other

    cs.LG cs.AI cs.CR

    GenoArmory: A Unified Evaluation Framework for Adversarial Attacks on Genomic Foundation Models

    Authors: Haozheng Luo, Chenghao Qiu, Yimin Wang, Shang Wu, Jiahao Yu, Han Liu, Binghui Wang, Yan Chen

    Abstract: We propose the first unified adversarial attack benchmark for Genomic Foundation Models (GFMs), named GenoArmory. Unlike existing GFM benchmarks, GenoArmory offers the first comprehensive evaluation framework to systematically assess the vulnerability of GFMs to adversarial attacks. Methodologically, we evaluate the adversarial robustness of five state-of-the-art GFMs using four widely adopted att… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

  46. arXiv:2505.09012  [pdf, ps, other

    cs.AI eess.SY

    Deep Reinforcement Learning for Power Grid Multi-Stage Cascading Failure Mitigation

    Authors: Bo Meng, Chenghao Xu, Yongli Zhu

    Abstract: Cascading failures in power grids can lead to grid collapse, causing severe disruptions to social operations and economic activities. In certain cases, multi-stage cascading failures can occur. However, existing cascading-failure-mitigation strategies are usually single-stage-based, overlooking the complexity of the multi-stage scenario. This paper treats the multi-stage cascading failure problem… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

    Comments: This paper has been accepted and presented at ICLR 2025 in Singapore, Apr. 28, 2025

  47. arXiv:2505.00598  [pdf, ps, other

    cs.LG cs.AI

    Fast and Low-Cost Genomic Foundation Models via Outlier Removal

    Authors: Haozheng Luo, Chenghao Qiu, Maojiang Su, Zhihan Zhou, Zoe Mehta, Guo Ye, Jerry Yao-Chieh Hu, Han Liu

    Abstract: To address the challenge of scarce computational resources in genomic modeling, we introduce GERM, a genomic foundation model with strong compression performance and fast adaptability. GERM improves upon models like DNABERT-2 by eliminating outliers that hinder low-rank adaptation and post-training quantization, enhancing both efficiency and robustness. We replace the vanilla attention layer with… ▽ More

    Submitted 2 May, 2025; v1 submitted 1 May, 2025; originally announced May 2025.

    Comments: International Conference on Machine Learning (ICML) 2025

  48. arXiv:2504.21117  [pdf, other

    cs.CL

    Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts

    Authors: Hanhua Hong, Chenghao Xiao, Yang Wang, Yiqi Liu, Wenge Rong, Chenghua Lin

    Abstract: Evaluating natural language generation (NLG) systems is challenging due to the diversity of valid outputs. While human evaluation is the gold standard, it suffers from inconsistencies, lack of standardisation, and demographic biases, limiting reproducibility. LLM-based evaluation offers a scalable alternative but is highly sensitive to prompt design, where small variations can lead to significant… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

    Comments: 10 pages

  49. Privacy-Preserving Personalized Federated Learning for Distributed Photovoltaic Disaggregation under Statistical Heterogeneity

    Authors: Xiaolu Chen, Chenghao Huang, Yanru Zhang, Hao Wang

    Abstract: The rapid expansion of distributed photovoltaic (PV) installations worldwide, many being behind-the-meter systems, has significantly challenged energy management and grid operations, as unobservable PV generation further complicates the supply-demand balance. Therefore, estimating this generation from net load, known as PV disaggregation, is critical. Given privacy concerns and the need for large… ▽ More

    Submitted 22 May, 2025; v1 submitted 25 April, 2025; originally announced April 2025.

    Comments: 11 pages

    Journal ref: IEEE Transactions on Instrumentation and Measurement, 2025

  50. arXiv:2504.15585  [pdf, ps, other

    cs.CR cs.AI cs.CL cs.LG

    A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment

    Authors: Kun Wang, Guibin Zhang, Zhenhong Zhou, Jiahao Wu, Miao Yu, Shiqian Zhao, Chenlong Yin, Jinhu Fu, Yibo Yan, Hanjun Luo, Liang Lin, Zhihao Xu, Haolang Lu, Xinye Cao, Xinyun Zhou, Weifei Jin, Fanci Meng, Shicheng Xu, Junyuan Mao, Yu Wang, Hao Wu, Minghe Wang, Fan Zhang, Junfeng Fang, Wenjie Qu , et al. (78 additional authors not shown)

    Abstract: The remarkable success of Large Language Models (LLMs) has illuminated a promising pathway toward achieving Artificial General Intelligence for both academic and industrial communities, owing to their unprecedented performance across various applications. As LLMs continue to gain prominence in both research and commercial domains, their security and safety implications have become a growing concer… ▽ More

    Submitted 8 June, 2025; v1 submitted 22 April, 2025; originally announced April 2025.