Skip to main content

Showing 1–50 of 1,179 results for author: Gao, X

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.01449  [pdf, ps, other

    cs.CL

    LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation

    Authors: Tianyu Liu, Qitan Lv, Hao Li, Xing Gao, Xiao Sun

    Abstract: Speculative decoding (SD), where a small draft model is employed to propose draft tokens in advance and then the target model validates them in parallel, has emerged as a promising technique for LLM inference acceleration. Many endeavors to improve SD are to eliminate the need for a draft model and generate draft tokens in a retrieval-based manner in order to further alleviate the drafting overhea… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  2. arXiv:2507.01401  [pdf, ps, other

    cs.CV cs.AI

    Medical-Knowledge Driven Multiple Instance Learning for Classifying Severe Abdominal Anomalies on Prenatal Ultrasound

    Authors: Huanwen Liang, Jingxian Xu, Yuanji Zhang, Yuhao Huang, Yuhan Zhang, Xin Yang, Ran Li, Xuedong Deng, Yanjun Liu, Guowei Tao, Yun Wu, Sheng Zhao, Xinru Gao, Dong Ni

    Abstract: Fetal abdominal malformations are serious congenital anomalies that require accurate diagnosis to guide pregnancy management and reduce mortality. Although AI has demonstrated significant potential in medical diagnosis, its application to prenatal abdominal anomalies remains limited. Most existing studies focus on image-level classification and rely on standard plane localization, placing less emp… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Accepted by MICCAI 2025

  3. arXiv:2507.01059  [pdf, ps, other

    cs.MA cs.AI cs.CL cs.CV cs.RO

    Automated Vehicles Should be Connected with Natural Language

    Authors: Xiangbo Gao, Keshu Wu, Hao Zhang, Kexin Tian, Yang Zhou, Zhengzhong Tu

    Abstract: Multi-agent collaborative driving promises improvements in traffic safety and efficiency through collective perception and decision making. However, existing communication media -- including raw sensor data, neural network features, and perception results -- suffer limitations in bandwidth efficiency, information completeness, and agent interoperability. Moreover, traditional approaches have large… ▽ More

    Submitted 29 June, 2025; originally announced July 2025.

  4. arXiv:2507.00574  [pdf, ps, other

    cs.LG

    Foundation Models for Clinical Records at Health System Scale

    Authors: Haresh Rengaraj Rajamohan, Xiang Gao, Weicheng Zhu, Shih-Lun Huang, Long Chen, Kyunghyun Cho, Cem M. Deniz, Narges Razavian

    Abstract: Large-scale pretraining has transformed modeling of language and other data types, but its potential remains underexplored in healthcare with structured electronic health records (EHRs). We present a novel generative pretraining strategy for sequential EHR data using next-visit event prediction. Our model learns to autoregressively generate various tokenized clinical events for the next visit base… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: Accepted to ICML 2025 Workshop on Foundation Models for Structured Data

  5. arXiv:2507.00462  [pdf, ps, other

    cs.CV

    Unleashing the Potential of All Test Samples: Mean-Shift Guided Test-Time Adaptation

    Authors: Jizhou Han, Chenhao Ding, SongLin Dong, Yuhang He, Xinyuan Gao, Yihong Gong

    Abstract: Visual-language models (VLMs) like CLIP exhibit strong generalization but struggle with distribution shifts at test time. Existing training-free test-time adaptation (TTA) methods operate strictly within CLIP's original feature space, relying on high-confidence samples while overlooking the potential of low-confidence ones. We propose MS-TTA, a training-free approach that enhances feature represen… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  6. arXiv:2506.23257  [pdf, ps, other

    cs.CV

    PCLVis: Visual Analytics of Process Communication Latency in Large-Scale Simulation

    Authors: Chongke Bi, Xin Gao, Baofeng Fu, Yuheng Zhao, Siming Chen, Ying Zhao, Yunhai Wang

    Abstract: Large-scale simulations on supercomputers have become important tools for users. However, their scalability remains a problem due to the huge communication cost among parallel processes. Most of the existing communication latency analysis methods rely on the physical link layer information, which is only available to administrators. In this paper, a framework called PCLVis is proposed to help gene… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

  7. arXiv:2506.22438  [pdf

    cs.CV

    Counting with Confidence: Accurate Pest Monitoring in Water Traps

    Authors: Xumin Gao, Mark Stevens, Grzegorz Cielniak

    Abstract: Accurate pest population monitoring and tracking their dynamic changes are crucial for precision agriculture decision-making. A common limitation in existing vision-based automatic pest counting research is that models are typically evaluated on datasets with ground truth but deployed in real-world scenarios without assessing the reliability of counting results due to the lack of ground truth. To… ▽ More

    Submitted 19 May, 2025; originally announced June 2025.

    Comments: \c{opyright} 20XX the authors. This work has been accepted to IFAC for publication under a Creative Commons Licence CC-BY-NC-ND

  8. arXiv:2506.21263  [pdf, ps, other

    cs.LG cs.AI cs.CL

    DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster

    Authors: Ji Qi, WenPeng Zhu, Li Li, Ming Wu, YingJun Wu, Wu He, Xun Gao, Jason Zeng, Michael Heinrich

    Abstract: The distributed training of foundation models, particularly large language models (LLMs), demands a high level of communication. Consequently, it is highly dependent on a centralized cluster with fast and reliable interconnects. Can we conduct training on slow networks and thereby unleash the power of decentralized clusters when dealing with models exceeding 100 billion parameters? In this paper,… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

  9. arXiv:2506.20947  [pdf, ps, other

    cs.CV cs.MM

    Hierarchical Sub-action Tree for Continuous Sign Language Recognition

    Authors: Dejie Yang, Zhu Xu, Xinjie Gao, Yang Liu

    Abstract: Continuous sign language recognition (CSLR) aims to transcribe untrimmed videos into glosses, which are typically textual words. Recent studies indicate that the lack of large datasets and precise annotations has become a bottleneck for CSLR due to insufficient training data. To address this, some works have developed cross-modal solutions to align visual and textual modalities. However, they typi… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

  10. arXiv:2506.19283  [pdf, ps, other

    cs.CV cs.AI cs.RO

    AirV2X: Unified Air-Ground Vehicle-to-Everything Collaboration

    Authors: Xiangbo Gao, Yuheng Wu, Xuewen Luo, Keshu Wu, Xinghao Chen, Yuping Wang, Chenxi Liu, Yang Zhou, Zhengzhong Tu

    Abstract: While multi-vehicular collaborative driving demonstrates clear advantages over single-vehicle autonomy, traditional infrastructure-based V2X systems remain constrained by substantial deployment costs and the creation of "uncovered danger zones" in rural and suburban areas. We present AirV2X-Perception, a large-scale dataset that leverages Unmanned Aerial Vehicles (UAVs) as a flexible alternative o… ▽ More

    Submitted 30 June, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

  11. arXiv:2506.18544  [pdf, ps, other

    cs.CV

    Normality Prior Guided Multi-Semantic Fusion Network for Unsupervised Image Anomaly Detection

    Authors: Muhao Xu, Xueying Zhou, Xizhan Gao, Weiye Song, Guang Feng, Sijie Niu

    Abstract: Recently, detecting logical anomalies is becoming a more challenging task compared to detecting structural ones. Existing encoder decoder based methods typically compress inputs into low-dimensional bottlenecks on the assumption that the compression process can effectively suppress the transmission of logical anomalies to the decoder. However, logical anomalies present a particular difficulty beca… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

  12. arXiv:2506.18315  [pdf, ps, other

    cs.SE cs.AI

    Use Property-Based Testing to Bridge LLM Code Generation and Validation

    Authors: Lehan He, Zeren Chen, Zhe Zhang, Jing Shao, Xiang Gao, Lu Sheng

    Abstract: Large Language Models (LLMs) excel at code generation, but ensuring their outputs to be functionally correct, especially in complex programming tasks, is a persistent challenge. While traditional Test-Driven Development (TDD) offers a path for code refinement, its efficacy with LLMs is often undermined by the scarcity of high-quality test cases or the pitfalls of automated test generation, includi… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

  13. arXiv:2506.18158  [pdf, ps, other

    cs.AI cs.CV

    Chain-of-Memory: Enhancing GUI Agents for Cross-Application Navigation

    Authors: Xinzge Gao, Chuanrui Hu, Bin Chen, Teng Li

    Abstract: Multimodal large language models (MLLMs) are attracting growing attention in the development of Graphical User Interface (GUI) agents. Existing approaches often rely on historical screenshots or actions to implicitly represent the task state. This reliance poses challenges for GUI agents in accurately understanding task states and underscores the absence of effective mechanisms to store critical i… ▽ More

    Submitted 22 June, 2025; originally announced June 2025.

  14. arXiv:2506.17590  [pdf, ps, other

    cs.CV cs.AI cs.RO

    DRAMA-X: A Fine-grained Intent Prediction and Risk Reasoning Benchmark For Driving

    Authors: Mihir Godbole, Xiangbo Gao, Zhengzhong Tu

    Abstract: Understanding the short-term motion of vulnerable road users (VRUs) like pedestrians and cyclists is critical for safe autonomous driving, especially in urban scenarios with ambiguous or high-risk behaviors. While vision-language models (VLMs) have enabled open-vocabulary perception, their utility for fine-grained intent reasoning remains underexplored. Notably, no existing benchmark evaluates mul… ▽ More

    Submitted 21 June, 2025; originally announced June 2025.

    Comments: 19 pages, 5 figures, Preprint under review. Code available at: https://github.com/taco-group/DRAMA-X

  15. arXiv:2506.16730  [pdf, ps, other

    cs.CV

    TeSG: Textual Semantic Guidance for Infrared and Visible Image Fusion

    Authors: Mingrui Zhu, Xiru Chen, Xin Wei, Nannan Wang, Xinbo Gao

    Abstract: Infrared and visible image fusion (IVF) aims to combine complementary information from both image modalities, producing more informative and comprehensive outputs. Recently, text-guided IVF has shown great potential due to its flexibility and versatility. However, the effective integration and utilization of textual semantic information remains insufficiently studied. To tackle these challenges, w… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

    Comments: 11 pages, 6 figures

  16. arXiv:2506.16712  [pdf, ps, other

    cs.CL cs.AI

    ReasonGRM: Enhancing Generative Reward Models through Large Reasoning Models

    Authors: Bin Chen, Xinzge Gao, Chuanrui Hu, Penghang Yu, Hua Zhang, Bing-Kun Bao

    Abstract: Generative Reward Models (GRMs) provide greater flexibility than scalar reward models in capturing human preferences, but their effectiveness is limited by poor reasoning capabilities. This often results in incomplete or overly speculative reasoning paths, leading to hallucinations or missing key information in complex tasks. We address this challenge with ReasonGRM, a three-stage generative rewar… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

  17. arXiv:2506.15741  [pdf, ps, other

    cs.AI cs.CL

    OAgents: An Empirical Study of Building Effective Agents

    Authors: He Zhu, Tianrui Qin, King Zhu, Heyuan Huang, Yeyi Guan, Jinxiang Xia, Yi Yao, Hanhao Li, Ningning Wang, Pai Liu, Tianhao Peng, Xin Gui, Xiaowan Li, Yuhui Liu, Yuchen Eleanor Jiang, Jun Wang, Changwang Zhang, Xiangru Tang, Ge Zhang, Jian Yang, Minghao Liu, Xitong Gao, Jiaheng Liu, Wangchunshu Zhou

    Abstract: Recently, Agentic AI has become an increasingly popular research field. However, we argue that current agent research practices lack standardization and scientific rigor, making it hard to conduct fair comparisons among methods. As a result, it is still unclear how different design choices in agent frameworks affect effectiveness, and measuring their progress remains challenging. In this work, we… ▽ More

    Submitted 23 June, 2025; v1 submitted 17 June, 2025; originally announced June 2025.

    Comments: 28 pages

  18. arXiv:2506.14466  [pdf, ps, other

    cs.CR

    MalGuard: Towards Real-Time, Accurate, and Actionable Detection of Malicious Packages in PyPI Ecosystem

    Authors: Xingan Gao, Xiaobing Sun, Sicong Cao, Kaifeng Huang, Di Wu, Xiaolei Liu, Xingwei Lin, Yang Xiang

    Abstract: Malicious package detection has become a critical task in ensuring the security and stability of the PyPI. Existing detection approaches have focused on advancing model selection, evolving from traditional machine learning (ML) models to large language models (LLMs). However, as the complexity of the model increases, the time consumption also increases, which raises the question of whether a light… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

  19. arXiv:2506.13541  [pdf, ps, other

    cs.CL cs.LG

    Mixture of Weight-shared Heterogeneous Group Attention Experts for Dynamic Token-wise KV Optimization

    Authors: Guanghui Song, Dongping Liao, Yiren Zhao, Kejiang Ye, Cheng-zhong Xu, Xitong Gao

    Abstract: Transformer models face scalability challenges in causal language modeling (CLM) due to inefficient memory allocation for growing key-value (KV) caches, which strains compute and storage resources. Existing methods like Grouped Query Attention (GQA) and token-level KV optimization improve efficiency but rely on rigid resource allocation, often discarding "low-priority" tokens or statically groupin… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  20. arXiv:2506.13205  [pdf, ps, other

    cs.CR cs.AI

    Screen Hijack: Visual Poisoning of VLM Agents in Mobile Environments

    Authors: Xuan Wang, Siyuan Liang, Zhe Liu, Yi Yu, Yuliang Lu, Xiaochun Cao, Ee-Chien Chang, Xitong Gao

    Abstract: With the growing integration of vision-language models (VLMs), mobile agents are now widely used for tasks like UI automation and camera-based user assistance. These agents are often fine-tuned on limited user-generated datasets, leaving them vulnerable to covert threats during the training process. In this work we present GHOST, the first clean-label backdoor attack specifically designed for mobi… ▽ More

    Submitted 2 July, 2025; v1 submitted 16 June, 2025; originally announced June 2025.

    Comments: 12 pages

  21. arXiv:2506.12351  [pdf, ps, other

    cs.CV

    EKPC: Elastic Knowledge Preservation and Compensation for Class-Incremental Learning

    Authors: Huaijie Wang, De Cheng, Lingfeng He, Yan Li, Jie Li, Nannan Wang, Xinbo Gao

    Abstract: Class-Incremental Learning (CIL) aims to enable AI models to continuously learn from sequentially arriving data of different classes over time while retaining previously acquired knowledge. Recently, Parameter-Efficient Fine-Tuning (PEFT) methods, like prompt pool-based approaches and adapter tuning, have shown great attraction in CIL. However, these methods either introduce additional parameters… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

  22. arXiv:2506.12266  [pdf, ps, other

    cs.CL cs.AI cs.HC cs.LG

    The Behavior Gap: Evaluating Zero-shot LLM Agents in Complex Task-Oriented Dialogs

    Authors: Avinash Baidya, Kamalika Das, Xiang Gao

    Abstract: Large Language Model (LLM)-based agents have significantly impacted Task-Oriented Dialog Systems (TODS) but continue to face notable performance challenges, especially in zero-shot scenarios. While prior work has noted this performance gap, the behavioral factors driving the performance gap remain under-explored. This study proposes a comprehensive evaluation framework to quantify the behavior gap… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

    Comments: ACL 2025; 18 pages, 8 figures

  23. arXiv:2506.12103  [pdf, other

    cs.AI cs.CY cs.LG

    The Amazon Nova Family of Models: Technical Report and Model Card

    Authors: Amazon AGI, Aaron Langford, Aayush Shah, Abhanshu Gupta, Abhimanyu Bhatter, Abhinav Goyal, Abhinav Mathur, Abhinav Mohanty, Abhishek Kumar, Abhishek Sethi, Abi Komma, Abner Pena, Achin Jain, Adam Kunysz, Adam Opyrchal, Adarsh Singh, Aditya Rawal, Adok Achar Budihal Prasad, Adrià de Gispert, Agnika Kumar, Aishwarya Aryamane, Ajay Nair, Akilan M, Akshaya Iyengar, Akshaya Vishnu Kudlu Shanbhogue , et al. (761 additional authors not shown)

    Abstract: We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly-capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Lite is a low-cost multimodal model that is lightning fast for processing images, video, documents… ▽ More

    Submitted 17 March, 2025; originally announced June 2025.

    Comments: 48 pages, 10 figures

    Report number: 20250317

  24. arXiv:2506.12087  [pdf, ps, other

    cs.NE cs.AI

    Efficient Parallel Training Methods for Spiking Neural Networks with Constant Time Complexity

    Authors: Wanjin Feng, Xingyu Gao, Wenqian Du, Hailong Shi, Peilin Zhao, Pengcheng Wu, Chunyan Miao

    Abstract: Spiking Neural Networks (SNNs) often suffer from high time complexity $O(T)$ due to the sequential processing of $T$ spikes, making training computationally expensive. In this paper, we propose a novel Fixed-point Parallel Training (FPT) method to accelerate SNN training without modifying the network architecture or introducing additional assumptions. FPT reduces the time complexity to $O(K)$,… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

  25. arXiv:2506.11862  [pdf, ps, other

    cs.SD eess.AS eess.SP

    Confidence-Based Self-Training for EMG-to-Speech: Leveraging Synthetic EMG for Robust Modeling

    Authors: Xiaodan Chen, Xiaoxue Gao, Mathias Quoy, Alexandre Pitti, Nancy F. Chen

    Abstract: Voiced Electromyography (EMG)-to-Speech (V-ETS) models reconstruct speech from muscle activity signals, facilitating applications such as neurolaryngologic diagnostics. Despite its potential, the advancement of V-ETS is hindered by a scarcity of paired EMG-speech data. To address this, we propose a novel Confidence-based Multi-Speaker Self-training (CoM2S) approach, along with a newly curated Libr… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

  26. arXiv:2506.11549  [pdf, ps, other

    cs.CV eess.IV

    EyeSim-VQA: A Free-Energy-Guided Eye Simulation Framework for Video Quality Assessment

    Authors: Zhaoyang Wang, Wen Lu, Jie Li, Lihuo He, Maoguo Gong, Xinbo Gao

    Abstract: Free-energy-guided self-repair mechanisms have shown promising results in image quality assessment (IQA), but remain under-explored in video quality assessment (VQA), where temporal dynamics and model constraints pose unique challenges. Unlike static images, video content exhibits richer spatiotemporal complexity, making perceptual restoration more difficult. Moreover, VQA systems often rely on pr… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

    Comments: This work has been submitted to the IEEE TCSVT for possible publication

  27. arXiv:2506.11545  [pdf, ps, other

    eess.IV cs.CV

    FCA2: Frame Compression-Aware Autoencoder for Modular and Fast Compressed Video Super-Resolution

    Authors: Zhaoyang Wang, Jie Li, Wen Lu, Lihuo He, Maoguo Gong, Xinbo Gao

    Abstract: State-of-the-art (SOTA) compressed video super-resolution (CVSR) models face persistent challenges, including prolonged inference time, complex training pipelines, and reliance on auxiliary information. As video frame rates continue to increase, the diminishing inter-frame differences further expose the limitations of traditional frame-to-frame information exploitation methods, which are inadequat… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

    Comments: This work has been submitted to the IEEE TMM for possible publication

  28. arXiv:2506.09393  [pdf, ps, other

    cs.CL

    A Hierarchical Probabilistic Framework for Incremental Knowledge Tracing in Classroom Settings

    Authors: Xinyi Gao, Qiucheng Wu, Yang Zhang, Xuechen Liu, Kaizhi Qian, Ying Xu, Shiyu Chang

    Abstract: Knowledge tracing (KT) aims to estimate a student's evolving knowledge state and predict their performance on new exercises based on performance history. Many realistic classroom settings for KT are typically low-resource in data and require online updates as students' exercise history grows, which creates significant challenges for existing KT approaches. To restore strong performance under low-r… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: 24 pages, 4 figures

  29. arXiv:2506.09234  [pdf, ps, other

    cs.CE

    Transaction Categorization with Relational Deep Learning in QuickBooks

    Authors: Kaiwen Dong, Padmaja Jonnalagedda, Xiang Gao, Ayan Acharya, Maria Kissa, Mauricio Flores, Nitesh V. Chawla, Kamalika Das

    Abstract: Automatic transaction categorization is crucial for enhancing the customer experience in QuickBooks by providing accurate accounting and bookkeeping. The distinct challenges in this domain stem from the unique formatting of transaction descriptions, the wide variety of transaction categories, and the vast scale of the data involved. Furthermore, organizing transaction data in a relational database… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

    Comments: Accepted to ECML-PKDD 2025

  30. arXiv:2506.07419  [pdf, ps, other

    cs.SE

    Generate Realistic Test Scenes for V2X Communication Systems

    Authors: An Guo, Xinyu Gao, Chunrong Fang, Haoxiang Tian, Weisong Sun, Yanzhou Mu, Shuncheng Tang, Lei Ma, Zhenyu Chen

    Abstract: Accurately perceiving complex driving environments is essential for ensuring the safe operation of autonomous vehicles. With the tremendous progress in deep learning and communication technologies, cooperative perception with Vehicle-to-Everything (V2X) technologies has emerged as a solution to overcome the limitations of single-agent perception systems in perceiving distant objects and occlusions… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  31. arXiv:2506.06283  [pdf, other

    cs.CV cs.AI

    Facial Foundational Model Advances Early Warning of Coronary Artery Disease from Live Videos with DigitalShadow

    Authors: Juexiao Zhou, Zhongyi Han, Mankun Xin, Xingwei He, Guotao Wang, Jiaoyan Song, Gongning Luo, Wenjia He, Xintong Li, Yuetan Chu, Juanwen Chen, Bo Wang, Xia Wu, Wenwen Duan, Zhixia Guo, Liyan Bai, Yilin Pan, Xuefei Bi, Lu Liu, Long Feng, Xiaonan He, Xin Gao

    Abstract: Global population aging presents increasing challenges to healthcare systems, with coronary artery disease (CAD) responsible for approximately 17.8 million deaths annually, making it a leading cause of global mortality. As CAD is largely preventable, early detection and proactive management are essential. In this work, we introduce DigitalShadow, an advanced early warning system for CAD, powered b… ▽ More

    Submitted 23 April, 2025; originally announced June 2025.

  32. arXiv:2506.05779  [pdf, ps, other

    cs.NI cs.LG

    Pegasus: A Universal Framework for Scalable Deep Learning Inference on the Dataplane

    Authors: Yinchao Zhang, Su Yao, Yong Feng, Kang Chen, Tong Li, Zhuotao Liu, Yi Zhao, Lexuan Zhang, Xiangyu Gao, Feng Xiong, Qi Li, Ke Xu

    Abstract: The paradigm of Intelligent DataPlane (IDP) embeds deep learning (DL) models on the network dataplane to enable intelligent traffic analysis at line-speed. However, the current use of the match-action table (MAT) abstraction on the dataplane is misaligned with DL inference, leading to several key limitations, including accuracy degradation, limited scale, and lack of generality. This paper propose… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

    Comments: to be published in Sigcomm 2025

  33. arXiv:2506.05692  [pdf, ps, other

    cs.CR cs.AI

    SafeGenBench: A Benchmark Framework for Security Vulnerability Detection in LLM-Generated Code

    Authors: Xinghang Li, Jingzhe Ding, Chao Peng, Bing Zhao, Xiang Gao, Hongwan Gao, Xinchen Gu

    Abstract: The code generation capabilities of large language models(LLMs) have emerged as a critical dimension in evaluating their overall performance. However, prior research has largely overlooked the security risks inherent in the generated code. In this work, we introduce SafeGenBench, a benchmark specifically designed to assess the security of LLM-generated code. The dataset encompasses a wide range of… ▽ More

    Submitted 20 June, 2025; v1 submitted 5 June, 2025; originally announced June 2025.

  34. arXiv:2506.02917  [pdf, other

    cs.RO

    Text-guided Generation of Efficient Personalized Inspection Plans

    Authors: Xingpeng Sun, Zherong Pan, Xifeng Gao, Kui Wu, Aniket Bera

    Abstract: We propose a training-free, Vision-Language Model (VLM)-guided approach for efficiently generating trajectories to facilitate target inspection planning based on text descriptions. Unlike existing Vision-and-Language Navigation (VLN) methods designed for general agents in unknown environments, our approach specifically targets the efficient inspection of known scenes, with widespread applications… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: 8 pages, 5 figures

  35. arXiv:2506.02742  [pdf, ps, other

    eess.AS cs.AI cs.SD eess.SP

    Prompt-Unseen-Emotion: Zero-shot Expressive Speech Synthesis with Prompt-LLM Contextual Knowledge for Mixed Emotions

    Authors: Xiaoxue Gao, Huayun Zhang, Nancy F. Chen

    Abstract: Existing expressive text-to-speech (TTS) systems primarily model a limited set of categorical emotions, whereas human conversations extend far beyond these predefined emotions, making it essential to explore more diverse emotional speech generation for more natural interactions. To bridge this gap, this paper proposes a novel prompt-unseen-emotion (PUE) approach to generate unseen emotional speech… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

  36. arXiv:2506.02605  [pdf, ps, other

    cs.CV

    One-Step Diffusion-based Real-World Image Super-Resolution with Visual Perception Distillation

    Authors: Xue Wu, Jingwei Xin, Zhijun Tu, Jie Hu, Jie Li, Nannan Wang, Xinbo Gao

    Abstract: Diffusion-based models have been widely used in various visual generation tasks, showing promising results in image super-resolution (SR), while typically being limited by dozens or even hundreds of sampling steps. Although existing methods aim to accelerate the inference speed of multi-step diffusion-based SR methods through knowledge distillation, their generated images exhibit insufficient sema… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

  37. arXiv:2506.02580  [pdf, other

    cs.AI

    V2X-UniPool: Unifying Multimodal Perception and Knowledge Reasoning for Autonomous Driving

    Authors: Xuewen Luo, Fengze Yang, Fan Ding, Xiangbo Gao, Shuo Xing, Yang Zhou, Zhengzhong Tu, Chenxi Liu

    Abstract: Knowledge-driven autonomous driving systems(ADs) offer powerful reasoning capabilities, but face two critical challenges: limited perception due to the short-sightedness of single-vehicle sensors, and hallucination arising from the lack of real-time environmental grounding. To address these issues, this paper introduces V2X-UniPool, a unified framework that integrates multimodal Vehicle-to-Everyth… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

  38. arXiv:2506.02439  [pdf, ps, other

    cs.CV

    Video-Level Language-Driven Video-Based Visible-Infrared Person Re-Identification

    Authors: Shuang Li, Jiaxu Leng, Changjiang Kuang, Mingpi Tan, Xinbo Gao

    Abstract: Video-based Visible-Infrared Person Re-Identification (VVI-ReID) aims to match pedestrian sequences across modalities by extracting modality-invariant sequence-level features. As a high-level semantic representation, language provides a consistent description of pedestrian characteristics in both infrared and visible modalities. Leveraging the Contrastive Language-Image Pre-training (CLIP) model t… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: Accepted by IEEE TIFS

  39. arXiv:2506.02412  [pdf, ps, other

    cs.CL cs.AI

    SingaKids: A Multilingual Multimodal Dialogic Tutor for Language Learning

    Authors: Zhengyuan Liu, Geyu Lin, Hui Li Tan, Huayun Zhang, Yanfeng Lu, Xiaoxue Gao, Stella Xin Yin, He Sun, Hock Huan Goh, Lung Hsiang Wong, Nancy F. Chen

    Abstract: The integration of generative artificial intelligence into educational applications has enhanced personalized and interactive learning experiences, and it shows strong potential to promote young learners language acquisition. However, it is still challenging to ensure consistent and robust performance across different languages and cultural contexts, and kids-friendly design requires simplified in… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: ACL 2025 Industry Track

  40. arXiv:2506.01822  [pdf, ps, other

    cs.CV cs.MM

    GSCodec Studio: A Modular Framework for Gaussian Splat Compression

    Authors: Sicheng Li, Chengzhen Wu, Hao Li, Xiang Gao, Yiyi Liao, Lu Yu

    Abstract: 3D Gaussian Splatting and its extension to 4D dynamic scenes enable photorealistic, real-time rendering from real-world captures, positioning Gaussian Splats (GS) as a promising format for next-generation immersive media. However, their high storage requirements pose significant challenges for practical use in sharing, transmission, and storage. Despite various studies exploring GS compression fro… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: Repository of the project: https://github.com/JasonLSC/GSCodec_Studio

  41. arXiv:2506.00955  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Leveraging Large Language Models for Sarcastic Speech Annotation in Sarcasm Detection

    Authors: Zhu Li, Yuqing Zhang, Xiyuan Gao, Shekhar Nayak, Matt Coler

    Abstract: Sarcasm fundamentally alters meaning through tone and context, yet detecting it in speech remains a challenge due to data scarcity. In addition, existing detection systems often rely on multimodal data, limiting their applicability in contexts where only speech is available. To address this, we propose an annotation pipeline that leverages large language models (LLMs) to generate a sarcasm dataset… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

    Comments: Accepted to Interspeech 2025

  42. arXiv:2505.24500  [pdf, other

    cs.CL cs.AI

    TimeHC-RL: Temporal-aware Hierarchical Cognitive Reinforcement Learning for Enhancing LLMs' Social Intelligence

    Authors: Guiyang Hou, Xing Gao, Yuchuan Wu, Xiang Huang, Wenqi Zhang, Zhe Zheng, Yongliang Shen, Jialu Du, Fei Huang, Yongbin Li, Weiming Lu

    Abstract: Recently, Large Language Models (LLMs) have made significant progress in IQ-related domains that require careful thinking, such as mathematics and coding. However, enhancing LLMs' cognitive development in social domains, particularly from a post-training perspective, remains underexplored. Recognizing that the social world follows a distinct timeline and requires a richer blend of cognitive modes… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

    Comments: 22 pages, 12 figures

  43. arXiv:2505.23744  [pdf, other

    cs.CV cs.AI

    Boosting Domain Incremental Learning: Selecting the Optimal Parameters is All You Need

    Authors: Qiang Wang, Xiang Song, Yuhang He, Jizhou Han, Chenhao Ding, Xinyuan Gao, Yihong Gong

    Abstract: Deep neural networks (DNNs) often underperform in real-world, dynamic settings where data distributions change over time. Domain Incremental Learning (DIL) offers a solution by enabling continual model adaptation, with Parameter-Isolation DIL (PIDIL) emerging as a promising paradigm to reduce knowledge conflicts. However, existing PIDIL methods struggle with parameter selection accuracy, especiall… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: Accepted at CVPR 2025

  44. arXiv:2505.23123  [pdf, ps, other

    cs.SI

    Offline Map Matching Based on Localization Error Distribution Modeling

    Authors: Ruilin Xu, Yuchen Song, Kaijie Li, Xitong Gao, Kejiang Ye, Fan Zhang, Juanjuan Zhao

    Abstract: Offline map matching involves aligning historical trajectories of mobile objects, which may have positional errors, with digital maps. This is essential for applications in intelligent transportation systems (ITS), such as route analysis and traffic pattern mining. Existing methods have two main limitations: (i) they assume a uniform Localization Error Distribution (LED) across urban areas, neglec… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: 13 pages

  45. arXiv:2505.22869  [pdf, ps, other

    cs.CV cs.LG q-bio.BM

    CFP-Gen: Combinatorial Functional Protein Generation via Diffusion Language Models

    Authors: Junbo Yin, Chao Zha, Wenjia He, Chencheng Xu, Xin Gao

    Abstract: Existing PLMs generate protein sequences based on a single-condition constraint from a specific modality, struggling to simultaneously satisfy multiple constraints across different modalities. In this work, we introduce CFP-Gen, a novel diffusion language model for Combinatorial Functional Protein GENeration. CFP-Gen facilitates the de novo protein design by integrating multimodal conditions with… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

    Comments: Accepted at ICML 2025. Code is available at https://github.com/yinjunbo/cfpgen

  46. arXiv:2505.22018  [pdf, ps, other

    cs.CL

    Improving Continual Pre-training Through Seamless Data Packing

    Authors: Ruicheng Yin, Xuan Gao, Changze Lv, Xiaohua Wang, Xiaoqing Zheng, Xuanjing Huang

    Abstract: Continual pre-training has demonstrated significant potential in enhancing model performance, particularly in domain-specific scenarios. The most common approach for packing data before continual pre-training involves concatenating input texts and splitting them into fixed-length sequences. While straightforward and efficient, this method often leads to excessive truncation and context discontinui… ▽ More

    Submitted 29 May, 2025; v1 submitted 28 May, 2025; originally announced May 2025.

    Comments: Accepted to ACL 2025 Findings

  47. arXiv:2505.21962  [pdf, ps, other

    cs.CV

    A2Seek: Towards Reasoning-Centric Benchmark for Aerial Anomaly Understanding

    Authors: Mengjingcheng Mo, Xinyang Tong, Jiaxu Leng, Mingpi Tan, Jiankang Zheng, Yiran Liu, Haosheng Chen, Ji Gan, Weisheng Li, Xinbo Gao

    Abstract: While unmanned aerial vehicles (UAVs) offer wide-area, high-altitude coverage for anomaly detection, they face challenges such as dynamic viewpoints, scale variations, and complex scenes. Existing datasets and methods, mainly designed for fixed ground-level views, struggle to adapt to these conditions, leading to significant performance drops in drone-view scenarios. To bridge this gap, we introdu… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  48. arXiv:2505.21743  [pdf, other

    cs.LG cs.AI

    Simulating the Unseen: Crash Prediction Must Learn from What Did Not Happen

    Authors: Zihao Li, Xinyuan Cao, Xiangbo Gao, Kexin Tian, Keshu Wu, Mohammad Anis, Hao Zhang, Keke Long, Jiwan Jiang, Xiaopeng Li, Yunlong Zhang, Tianbao Yang, Dominique Lord, Zhengzhong Tu, Yang Zhou

    Abstract: Traffic safety science has long been hindered by a fundamental data paradox: the crashes we most wish to prevent are precisely those events we rarely observe. Existing crash-frequency models and surrogate safety metrics rely heavily on sparse, noisy, and under-reported records, while even sophisticated, high-fidelity simulations undersample the long-tailed situations that trigger catastrophic outc… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  49. arXiv:2505.21396  [pdf, ps, other

    cs.CL cs.AI cs.CY cs.HC

    Improving Research Idea Generation Through Data: An Empirical Investigation in Social Science

    Authors: Xiao Liu, Xinyi Dong, Xinyang Gao, Yansong Feng, Xun Pang

    Abstract: Recent advancements in large language models (LLMs) have shown promise in generating novel research ideas. However, these ideas often face challenges related to feasibility and expected effectiveness. This paper explores how augmenting LLMs with relevant data during the idea generation process can enhance the quality of generated ideas. We introduce two ways of incorporating data: (1) providing me… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  50. arXiv:2505.21074  [pdf, ps, other

    cs.LG cs.AI cs.CR cs.CV stat.ML

    Red-Teaming Text-to-Image Systems by Rule-based Preference Modeling

    Authors: Yichuan Cao, Yibo Miao, Xiao-Shan Gao, Yinpeng Dong

    Abstract: Text-to-image (T2I) models raise ethical and safety concerns due to their potential to generate inappropriate or harmful images. Evaluating these models' security through red-teaming is vital, yet white-box approaches are limited by their need for internal access, complicating their use with closed-source models. Moreover, existing black-box methods often assume knowledge about the model's specifi… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.