Skip to main content

Showing 1–50 of 1,686 results for author: hu, X

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.06116  [pdf, ps, other

    cs.SD cs.AI eess.AS

    Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis

    Authors: Xintong Hu, Yixuan Chen, Rui Yang, Wenxiang Guo, Changhao Pan

    Abstract: Automatic speech quality assessment plays a crucial role in the development of speech synthesis systems, but existing models exhibit significant performance variations across different granularity levels of prediction tasks. This paper proposes an enhanced MOS prediction system based on self-supervised learning speech models, incorporating a Mixture of Experts (MoE) classification head and utilizi… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

  2. arXiv:2507.05798  [pdf, ps, other

    cs.CV

    SPADE: Spatial-Aware Denoising Network for Open-vocabulary Panoptic Scene Graph Generation with Long- and Local-range Context Reasoning

    Authors: Xin Hu, Ke Qin, Guiduo Duan, Ming Li, Yuan-Fang Li, Tao He

    Abstract: Panoptic Scene Graph Generation (PSG) integrates instance segmentation with relation understanding to capture pixel-level structural relationships in complex scenes. Although recent approaches leveraging pre-trained vision-language models (VLMs) have significantly improved performance in the open-vocabulary setting, they commonly ignore the inherent limitations of VLMs in spatial relation reasonin… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

    Comments: Accepted to ICCV 2025

  3. arXiv:2507.05288  [pdf, ps, other

    cs.IR cs.AI cs.CL

    A Survey on Proactive Defense Strategies Against Misinformation in Large Language Models

    Authors: Shuliang Liu, Hongyi Liu, Aiwei Liu, Bingchen Duan, Qi Zheng, Yibo Yan, He Geng, Peijie Jiang, Jia Liu, Xuming Hu

    Abstract: The widespread deployment of large language models (LLMs) across critical domains has amplified the societal risks posed by algorithmically generated misinformation. Unlike traditional false content, LLM-generated misinformation can be self-reinforcing, highly plausible, and capable of rapid propagation across multiple languages, which traditional detection methods fail to mitigate effectively. Th… ▽ More

    Submitted 5 July, 2025; originally announced July 2025.

    Comments: Accepted by ACL 2025 Findings

  4. arXiv:2507.05173  [pdf, ps, other

    cs.CV

    Semantic Frame Interpolation

    Authors: Yijia Hong, Jiangning Zhang, Ran Yi, Yuji Wang, Weijian Cao, Xiaobin Hu, Zhucun Xue, Yabiao Wang, Chengjie Wang, Lizhuang Ma

    Abstract: Generating intermediate video content of varying lengths based on given first and last frames, along with text prompt information, offers significant research and application potential. However, traditional frame interpolation tasks primarily focus on scenarios with a small number of frames, no text control, and minimal differences between the first and last frames. Recent community developers hav… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

    Comments: https://github.com/hyj542682306/Semantic-Frame-Interpolation

  5. arXiv:2507.04909  [pdf, ps, other

    cs.CV cs.AI

    HV-MMBench: Benchmarking MLLMs for Human-Centric Video Understanding

    Authors: Yuxuan Cai, Jiangning Zhang, Zhenye Gan, Qingdong He, Xiaobin Hu, Junwei Zhu, Yabiao Wang, Chengjie Wang, Zhucun Xue, Xinwei He, Xiang Bai

    Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos. However, their capacity to comprehend human-centric video data remains underexplored, primarily due to the absence of comprehensive and high-quality evaluation benchmarks. Existing human-centric benchmarks predominantly emphasize video generation quality a… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

    Comments: Under review

  6. arXiv:2507.04781  [pdf, ps, other

    cs.LG

    FedPall: Prototype-based Adversarial and Collaborative Learning for Federated Learning with Feature Drift

    Authors: Yong Zhang, Feng Liang, Guanghu Yuan, Min Yang, Chengming Li, Xiping Hu

    Abstract: Federated learning (FL) enables collaborative training of a global model in the centralized server with data from multiple parties while preserving privacy. However, data heterogeneity can significantly degrade the performance of the global model when each party uses datasets from different sources to train a local model, thereby affecting personalized local models. Among various cases of data het… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

    Comments: 10 pages, 6 figures, and 1 table

  7. arXiv:2507.04705  [pdf, ps, other

    cs.CV

    Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations

    Authors: Yuji Wang, Moran Li, Xiaobin Hu, Ran Yi, Jiangning Zhang, Han Feng, Weijian Cao, Yabiao Wang, Chengjie Wang, Lizhuang Ma

    Abstract: Identity-preserving text-to-video (IPT2V) generation, which aims to create high-fidelity videos with consistent human identity, has become crucial for downstream applications. However, current end-to-end frameworks suffer a critical spatial-temporal trade-off: optimizing for spatially coherent layouts of key elements (e.g., character identity preservation) often compromises instruction-compliant t… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

  8. arXiv:2507.04404  [pdf, ps, other

    cs.AI

    LayerCake: Token-Aware Contrastive Decoding within Large Language Model Layers

    Authors: Jingze Zhu, Yongliang Wu, Wenbo Zhu, Jiawang Cao, Yanqiang Zheng, Jiawei Chen, Xu Yang, Bernt Schiele, Jonas Fischer, Xinting Hu

    Abstract: Large language models (LLMs) excel at natural language understanding and generation but remain vulnerable to factual errors, limiting their reliability in knowledge-intensive tasks. While decoding-time strategies provide a promising efficient solution without training, existing methods typically treat token-level and layer-level signals in isolation, overlooking the joint dynamics between them. In… ▽ More

    Submitted 6 July, 2025; originally announced July 2025.

  9. arXiv:2507.04365  [pdf, ps, other

    cs.CR cs.AI cs.CL

    Attention Slipping: A Mechanistic Understanding of Jailbreak Attacks and Defenses in LLMs

    Authors: Xiaomeng Hu, Pin-Yu Chen, Tsung-Yi Ho

    Abstract: As large language models (LLMs) become more integral to society and technology, ensuring their safety becomes essential. Jailbreak attacks exploit vulnerabilities to bypass safety guardrails, posing a significant threat. However, the mechanisms enabling these attacks are not well understood. In this paper, we reveal a universal phenomenon that occurs during jailbreak attacks: Attention Slipping. D… ▽ More

    Submitted 6 July, 2025; originally announced July 2025.

  10. arXiv:2507.04218  [pdf, ps, other

    cs.CV

    DreamPoster: A Unified Framework for Image-Conditioned Generative Poster Design

    Authors: Xiwei Hu, Haokun Chen, Zhongqi Qi, Hui Zhang, Dexiang Hong, Jie Shao, Xinglong Wu

    Abstract: We present DreamPoster, a Text-to-Image generation framework that intelligently synthesizes high-quality posters from user-provided images and text prompts while maintaining content fidelity and supporting flexible resolution and layout outputs. Specifically, DreamPoster is built upon our T2I model, Seedream3.0 to uniformly process different poster generating types. For dataset construction, we pr… ▽ More

    Submitted 5 July, 2025; originally announced July 2025.

  11. arXiv:2507.03004  [pdf, ps, other

    cs.CL cs.MA

    CLUES: Collaborative High-Quality Data Selection for LLMs via Training Dynamics

    Authors: Wanru Zhao, Hongxiang Fan, Shell Xu Hu, Wangchunshu Zhou, Bofan Chen, Nicholas D. Lane

    Abstract: Recent research has highlighted the importance of data quality in scaling large language models (LLMs). However, automated data quality control faces unique challenges in collaborative settings where sharing is not allowed directly between data silos. To tackle this issue, this paper proposes a novel data quality control technique based on the notion of data influence on the training dynamics of L… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: NeurIPS 2024

  12. arXiv:2507.02773  [pdf, ps, other

    cs.AI cs.LG cs.MA

    KERAP: A Knowledge-Enhanced Reasoning Approach for Accurate Zero-shot Diagnosis Prediction Using Multi-agent LLMs

    Authors: Yuzhang Xie, Hejie Cui, Ziyang Zhang, Jiaying Lu, Kai Shu, Fadi Nahab, Xiao Hu, Carl Yang

    Abstract: Medical diagnosis prediction plays a critical role in disease detection and personalized healthcare. While machine learning (ML) models have been widely adopted for this task, their reliance on supervised training limits their ability to generalize to unseen cases, particularly given the high cost of acquiring large, labeled datasets. Large language models (LLMs) have shown promise in leveraging l… ▽ More

    Submitted 6 July, 2025; v1 submitted 3 July, 2025; originally announced July 2025.

    Journal ref: American Medical Informatics Association (AMIA) 2025 Annual Symposium, Oral

  13. arXiv:2507.02307  [pdf, ps, other

    cs.CV

    Flow-CDNet: A Novel Network for Detecting Both Slow and Fast Changes in Bitemporal Images

    Authors: Haoxuan Li, Chenxu Wei, Haodong Wang, Xiaomeng Hu, Boyuan An, Lingyan Ran, Baosen Zhang, Jin Jin, Omirzhan Taukebayev, Amirkhan Temirbayev, Junrui Liu, Xiuwei Zhang

    Abstract: Change detection typically involves identifying regions with changes between bitemporal images taken at the same location. Besides significant changes, slow changes in bitemporal images are also important in real-life scenarios. For instance, weak changes often serve as precursors to major hazards in scenarios like slopes, dams, and tailings ponds. Therefore, designing a change detection network t… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: 18 pages, 8 figures

  14. arXiv:2507.01949  [pdf, ps, other

    cs.CV

    Kwai Keye-VL Technical Report

    Authors: Kwai Keye Team, Biao Yang, Bin Wen, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Hao Peng, Haojie Ding, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Jin Ouyang, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Shengnan Zhang, Siyang Mao , et al. (35 additional authors not shown)

    Abstract: While Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities on static images, they often fall short in comprehending dynamic, information-dense short-form videos, a dominant medium in today's digital landscape. To bridge this gap, we introduce \textbf{Kwai Keye-VL}, an 8-billion-parameter multimodal foundation model engineered for leading-edge performance in short-video unde… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Technical Report: https://github.com/Kwai-Keye/Keye

  15. arXiv:2507.01908  [pdf, ps, other

    cs.CV

    Reasoning to Edit: Hypothetical Instruction-Based Image Editing with Visual Reasoning

    Authors: Qingdong He, Xueqin Chen, Chaoyi Wang, Yanjie Pan, Xiaobin Hu, Zhenye Gan, Yabiao Wang, Chengjie Wang, Xiangtai Li, Jiangning Zhang

    Abstract: Instruction-based image editing (IIE) has advanced rapidly with the success of diffusion models. However, existing efforts primarily focus on simple and explicit instructions to execute editing operations such as adding, deleting, moving, or swapping objects. They struggle to handle more complex implicit hypothetical instructions that require deeper reasoning to infer plausible visual changes and… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  16. arXiv:2507.01335  [pdf, ps, other

    cs.CL cs.AI

    LEDOM: An Open and Fundamental Reverse Language Model

    Authors: Xunjian Yin, Sitao Cheng, Yuxi Xie, Xinyu Hu, Li Lin, Xinyi Wang, Liangming Pan, William Yang Wang, Xiaojun Wan

    Abstract: We introduce LEDOM, the first purely reverse language model, trained autoregressively on 435B tokens with 2B and 7B parameter variants, which processes sequences in reverse temporal order through previous token prediction. For the first time, we present the reverse language model as a potential foundational model across general tasks, accompanied by a set of intriguing examples and insights. Based… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: Work in progress

  17. arXiv:2507.00606  [pdf, ps, other

    cs.CL cs.AI

    Mixture of Reasonings: Teach Large Language Models to Reason with Adaptive Strategies

    Authors: Tao Xiong, Xavier Hu, Wenyan Fan, Shengyu Zhang

    Abstract: Large language models (LLMs) excel in complex tasks through advanced prompting techniques like Chain-of-Thought (CoT) and Tree-of-Thought (ToT), but their reliance on manually crafted, task-specific prompts limits adaptability and efficiency. We introduce Mixture of Reasoning (MoR), a training framework that embeds diverse reasoning strategies into LLMs for autonomous, task-adaptive reasoning with… ▽ More

    Submitted 2 July, 2025; v1 submitted 1 July, 2025; originally announced July 2025.

  18. arXiv:2507.00469  [pdf, ps, other

    cs.CV cs.LG

    Bisecle: Binding and Separation in Continual Learning for Video Language Understanding

    Authors: Yue Tan, Xiaoqian Hu, Hao Xue, Celso De Melo, Flora D. Salim

    Abstract: Frontier vision-language models (VLMs) have made remarkable improvements in video understanding tasks. However, real-world videos typically exist as continuously evolving data streams (e.g., dynamic scenes captured by wearable glasses), necessitating models to continually adapt to shifting data distributions and novel scenarios. Considering the prohibitive computational costs of fine-tuning models… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: 23 pages, 12 figures, 10 tables

  19. arXiv:2506.24113  [pdf, ps, other

    cs.CV

    Epona: Autoregressive Diffusion World Model for Autonomous Driving

    Authors: Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu, Xingang Pan, Xiaoyang Guo, Yuan Liu, Jingwei Huang, Li Yuan, Qian Zhang, Xiao-Xiao Long, Xun Cao, Wei Yin

    Abstract: Diffusion models have demonstrated exceptional visual quality in video generation, making them promising for autonomous driving world modeling. However, existing video diffusion-based world models struggle with flexible-length, long-horizon predictions and integrating trajectory planning. This is because conventional video diffusion models rely on global joint distribution modeling of fixed-length… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: ICCV2025, Project Page: https://kevin-thu.github.io/Epona/

  20. arXiv:2506.23581  [pdf, ps, other

    cs.CV cs.AI cs.LG

    PBCAT: Patch-based composite adversarial training against physically realizable attacks on object detection

    Authors: Xiao Li, Yiming Zhu, Yifan Huang, Wei Zhang, Yingzhe He, Jie Shi, Xiaolin Hu

    Abstract: Object detection plays a crucial role in many security-sensitive applications. However, several recent studies have shown that object detectors can be easily fooled by physically realizable attacks, \eg, adversarial patches and recent adversarial textures, which pose realistic and urgent threats. Adversarial Training (AT) has been recognized as the most effective defense against adversarial attack… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: Accepted by ICCV 2025

  21. arXiv:2506.23482  [pdf, ps, other

    cs.CV

    MTADiffusion: Mask Text Alignment Diffusion Model for Object Inpainting

    Authors: Jun Huang, Ting Liu, Yihang Wu, Xiaochao Qu, Luoqi Liu, Xiaolin Hu

    Abstract: Advancements in generative models have enabled image inpainting models to generate content within specific regions of an image based on provided prompts and masks. However, existing inpainting methods often suffer from problems such as semantic misalignment, structural distortion, and style inconsistency. In this work, we present MTADiffusion, a Mask-Text Alignment diffusion model designed for obj… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

    Comments: CVPR 2025

  22. arXiv:2506.23460  [pdf, ps, other

    cs.CV

    Contrastive Learning with Diffusion Features for Weakly Supervised Medical Image Segmentation

    Authors: Dewen Zeng, Xinrong Hu, Yu-Jen Chen, Yawen Wu, Xiaowei Xu, Yiyu Shi

    Abstract: Weakly supervised semantic segmentation (WSSS) methods using class labels often rely on class activation maps (CAMs) to localize objects. However, traditional CAM-based methods struggle with partial activations and imprecise object boundaries due to optimization discrepancies between classification and segmentation. Recently, the conditional diffusion model (CDM) has been used as an alternative fo… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

  23. arXiv:2506.23038  [pdf, ps, other

    cs.CV

    Inpainting is All You Need: A Diffusion-based Augmentation Method for Semi-supervised Medical Image Segmentation

    Authors: Xinrong Hu, Yiyu Shi

    Abstract: Collecting pixel-level labels for medical datasets can be a laborious and expensive process, and enhancing segmentation performance with a scarcity of labeled data is a crucial challenge. This work introduces AugPaint, a data augmentation framework that utilizes inpainting to generate image-label pairs from limited labeled data. AugPaint leverages latent diffusion models, known for their ability t… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

  24. arXiv:2506.22978  [pdf, ps, other

    cs.CL cs.AI

    A Systematic Study of Compositional Syntactic Transformer Language Models

    Authors: Yida Zhao, Hao Xve, Xiang Hu, Kewei Tu

    Abstract: Syntactic language models (SLMs) enhance Transformers by incorporating syntactic biases through the modeling of linearized syntactic parse trees alongside surface sentences. This paper focuses on compositional SLMs that are based on constituency parse trees and contain explicit bottom-up composition of constituent representations. We identify key aspects of design choices in existing compositional… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

  25. arXiv:2506.21101  [pdf, ps, other

    cs.CV

    OracleFusion: Assisting the Decipherment of Oracle Bone Script with Structurally Constrained Semantic Typography

    Authors: Caoshuo Li, Zengmao Ding, Xiaobin Hu, Bang Li, Donghao Luo, AndyPian Wu, Chaoyang Wang, Chengjie Wang, Taisong Jin, SevenShu, Yunsheng Wu, Yongge Liu, Rongrong Ji

    Abstract: As one of the earliest ancient languages, Oracle Bone Script (OBS) encapsulates the cultural records and intellectual expressions of ancient civilizations. Despite the discovery of approximately 4,500 OBS characters, only about 1,600 have been deciphered. The remaining undeciphered ones, with their complex structure and abstract imagery, pose significant challenges for interpretation. To address t… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

    Comments: Accepted to ICCV 2025

  26. arXiv:2506.20923  [pdf, ps, other

    cs.CL

    KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model

    Authors: Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, Qian Chen, Youcheng Pan, Yang Xiang, Meishan Zhang, Haofen Wang, Jun Yu, Baotian Hu, Min Zhang

    Abstract: In this paper, we propose KaLM-Embedding-V2, a versatile and compact embedding model, which achieves impressive performance in general-purpose text embedding tasks by leveraging superior training techniques and data. Our key innovations include: (1) To better align the architecture with representation learning, we remove the causal attention mask and adopt a fully bidirectional transformer with si… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

    Comments: Technical Report; 26 pages 12 tables 1 figure. arXiv admin note: substantial text overlap with arXiv:2501.01028

  27. arXiv:2506.19288  [pdf, ps, other

    cs.CV cs.RO

    Da Yu: Towards USV-Based Image Captioning for Waterway Surveillance and Scene Understanding

    Authors: Runwei Guan, Ningwei Ouyang, Tianhao Xu, Shaofeng Liang, Wei Dai, Yafeng Sun, Shang Gao, Songning Lai, Shanliang Yao, Xuming Hu, Ryan Wen Liu, Yutao Yue, Hui Xiong

    Abstract: Automated waterway environment perception is crucial for enabling unmanned surface vessels (USVs) to understand their surroundings and make informed decisions. Most existing waterway perception models primarily focus on instance-level object perception paradigms (e.g., detection, segmentation). However, due to the complexity of waterway environments, current perception datasets and models fail to… ▽ More

    Submitted 30 June, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

    Comments: 14 pages, 13 figures

  28. arXiv:2506.19171  [pdf, ps, other

    cs.LG

    Distilling Tool Knowledge into Language Models via Back-Translated Traces

    Authors: Xingyue Huang, Xianglong Hu, Zifeng Ding, Yuan He, Rishabh, Waleed Alzarooni, Ziyu Ye, Wendong Fan, Bailan He, Haige Bo, Changran Hu, Guohao Li

    Abstract: Large language models (LLMs) often struggle with mathematical problems that require exact computation or multi-step algebraic reasoning. Tool-integrated reasoning (TIR) offers a promising solution by leveraging external tools such as code interpreters to ensure correctness, but it introduces inference-time dependencies that hinder scalability and deployment. In this work, we propose a new paradigm… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

    Comments: Accepted in Workshop in Multi-Agent Systems in the Era of Foundation Models: Opportunities, Challenges and Futures, ICML 2025

  29. arXiv:2506.19028  [pdf, ps, other

    cs.CL cs.AI cs.CY

    Quantifying Fairness in LLMs Beyond Tokens: A Semantic and Statistical Perspective

    Authors: Weijie Xu, Yiwen Wang, Chi Xue, Xiangkun Hu, Xi Fang, Guimin Dong, Chandan K. Reddy

    Abstract: Large Language Models (LLMs) often generate responses with inherent biases, undermining their reliability in real-world applications. Existing evaluation methods often overlook biases in long-form responses and the intrinsic variability of LLM outputs. To address these challenges, we propose FiSCo(Fine-grained Semantic Computation), a novel statistical framework to evaluate group-level fairness in… ▽ More

    Submitted 24 June, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

    Comments: 29 pages, 9 figures, 15 tables

    MSC Class: 68T50 ACM Class: I.2.7

  30. arXiv:2506.18141  [pdf, ps, other

    cs.CL cs.AI

    Sparse Feature Coactivation Reveals Composable Semantic Modules in Large Language Models

    Authors: Ruixuan Deng, Xiaoyang Hu, Miles Gilberti, Shane Storks, Aman Taxali, Mike Angstadt, Chandra Sripada, Joyce Chai

    Abstract: We identify semantically coherent, context-consistent network components in large language models (LLMs) using coactivation of sparse autoencoder (SAE) features collected from just a handful of prompts. Focusing on country-relation tasks, we show that ablating semantic components for countries and relations changes model outputs in predictable ways, while amplifying these components induces counte… ▽ More

    Submitted 22 June, 2025; originally announced June 2025.

  31. arXiv:2506.17369  [pdf, ps, other

    cs.SE cs.AI

    Re-Evaluating Code LLM Benchmarks Under Semantic Mutation

    Authors: Zhiyuan Pan, Xing Hu, Xin Xia, Xiaohu Yang

    Abstract: In the era of large language models (LLMs), code benchmarks have become an important research area in software engineering and are widely used by practitioners. These benchmarks evaluate the performance of LLMs on specific code-related tasks, such as code understanding and generation. A critical step in constructing code benchmarks is the design of prompts. However, as existing code benchmarks typ… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

  32. arXiv:2506.17311  [pdf, ps, other

    cs.CY

    Can Large Language Models Be Trusted Paper Reviewers? A Feasibility Study

    Authors: Chuanlei Li, Xu Hu, Minghui Xu, Kun Li, Yue Zhang, Xiuzhen Cheng

    Abstract: Academic paper review typically requires substantial time, expertise, and human resources. Large Language Models (LLMs) present a promising method for automating the review process due to their extensive training data, broad knowledge base, and relatively low usage cost. This work explores the feasibility of using LLMs for academic paper review by proposing an automated review system. The system i… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

  33. arXiv:2506.17101  [pdf, ps, other

    cs.CV

    Multi-label Scene Classification for Autonomous Vehicles: Acquiring and Accumulating Knowledge from Diverse Datasets

    Authors: Ke Li, Chenyu Zhang, Yuxin Ding, Xianbiao Hu, Ruwen Qin

    Abstract: Driving scene identification, which assigns multiple non-exclusive class labels to a scene, provides the contextual awareness necessary for enhancing autonomous vehicles' ability to understand, reason about, and interact with the complex driving environment. As a multi-label classification problem, it is better tackled via multitasking learning. However, directly training a multi-label classificat… ▽ More

    Submitted 23 June, 2025; v1 submitted 20 June, 2025; originally announced June 2025.

  34. arXiv:2506.16701  [pdf, ps, other

    cs.CV

    Language-driven Description Generation and Common Sense Reasoning for Video Action Recognition

    Authors: Xiaodan Hu, Chuhang Zou, Suchen Wang, Jaechul Kim, Narendra Ahuja

    Abstract: Recent video action recognition methods have shown excellent performance by adapting large-scale pre-trained language-image models to the video domain. However, language models contain rich common sense priors - the scene contexts that humans use to constitute an understanding of objects, human-object interactions, and activities - that have not been fully exploited. In this paper, we introduce a… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

  35. arXiv:2506.16402  [pdf, ps, other

    cs.AI cs.CL cs.CV cs.LG cs.RO

    IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

    Authors: Xiaoya Lu, Zeren Chen, Xuhao Hu, Yijin Zhou, Weichen Zhang, Dongrui Liu, Lu Sheng, Jing Shao

    Abstract: Flawed planning from VLM-driven embodied agents poses significant safety hazards, hindering their deployment in real-world household tasks. However, existing static, non-interactive evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent's actions and rely on unreliable post-hoc evaluations that… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

  36. arXiv:2506.16307  [pdf, ps, other

    cs.CV cs.AI eess.IV

    Learning Multi-scale Spatial-frequency Features for Image Denoising

    Authors: Xu Zhao, Chen Zhao, Xiantao Hu, Hongliang Zhang, Ying Tai, Jian Yang

    Abstract: Recent advancements in multi-scale architectures have demonstrated exceptional performance in image denoising tasks. However, existing architectures mainly depends on a fixed single-input single-output Unet architecture, ignoring the multi-scale representations of pixel level. In addition, previous methods treat the frequency domain uniformly, ignoring the different characteristics of high-frequen… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

  37. arXiv:2506.15864  [pdf, ps, other

    cs.LG

    Improving Rectified Flow with Boundary Conditions

    Authors: Xixi Hu, Runlong Liao, Keyang Xu, Bo Liu, Yeqing Li, Eugene Ie, Hongliang Fei, Qiang Liu

    Abstract: Rectified Flow offers a simple and effective approach to high-quality generative modeling by learning a velocity field. However, we identify a limitation in directly modeling the velocity with an unconstrained neural network: the learned velocity often fails to satisfy certain boundary conditions, leading to inaccurate velocity field estimations that deviate from the desired ODE. This issue is par… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

    Comments: 14 pages

  38. arXiv:2506.15835  [pdf, ps, other

    eess.IV cs.AI cs.CV

    MoNetV2: Enhanced Motion Network for Freehand 3D Ultrasound Reconstruction

    Authors: Mingyuan Luo, Xin Yang, Zhongnuo Yan, Yan Cao, Yuanji Zhang, Xindi Hu, Jin Wang, Haoxuan Ding, Wei Han, Litao Sun, Dong Ni

    Abstract: Three-dimensional (3D) ultrasound (US) aims to provide sonographers with the spatial relationships of anatomical structures, playing a crucial role in clinical diagnosis. Recently, deep-learning-based freehand 3D US has made significant advancements. It reconstructs volumes by estimating transformations between images without external tracking. However, image-only reconstruction poses difficulties… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  39. arXiv:2506.15160  [pdf, ps, other

    cs.CV

    Enhancing point cloud analysis via neighbor aggregation correction based on cross-stage structure correlation

    Authors: Jiaqi Shi, Jin Xiao, Xiaoguang Hu, Boyang Song, Hao Jiang, Tianyou Chen, Baochang Zhang

    Abstract: Point cloud analysis is the cornerstone of many downstream tasks, among which aggregating local structures is the basis for understanding point cloud data. While numerous works aggregate neighbor using three-dimensional relative coordinates, there are irrelevant point interference and feature hierarchy gap problems due to the limitation of local coordinates. Although some works address this limita… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

    Comments: 17 papes, 7 figures

  40. arXiv:2506.13651  [pdf, ps, other

    cs.LG

    xbench: Tracking Agents Productivity Scaling with Profession-Aligned Real-World Evaluations

    Authors: Kaiyuan Chen, Yixin Ren, Yang Liu, Xiaobo Hu, Haotong Tian, Tianbao Xie, Fangfu Liu, Haoye Zhang, Hongzhang Liu, Yuan Gong, Chen Sun, Han Hou, Hui Yang, James Pan, Jianan Lou, Jiayi Mao, Jizheng Liu, Jinpeng Li, Kangyi Liu, Kenkun Liu, Rui Wang, Run Li, Tong Niu, Wenlong Zhang, Wenqi Yan , et al. (8 additional authors not shown)

    Abstract: We introduce xbench, a dynamic, profession-aligned evaluation suite designed to bridge the gap between AI agent capabilities and real-world productivity. While existing benchmarks often focus on isolated technical skills, they may not accurately reflect the economic value agents deliver in professional settings. To address this, xbench targets commercially significant domains with evaluation tasks… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: Project page: https://xbench.org

  41. arXiv:2506.12481  [pdf, ps, other

    cs.CV cs.LG cs.SD eess.AS

    Exploring Audio Cues for Enhanced Test-Time Video Model Adaptation

    Authors: Runhao Zeng, Qi Deng, Ronghao Zhang, Shuaicheng Niu, Jian Chen, Xiping Hu, Victor C. M. Leung

    Abstract: Test-time adaptation (TTA) aims to boost the generalization capability of a trained model by conducting self-/unsupervised learning during the testing phase. While most existing TTA methods for video primarily utilize visual supervisory signals, they often overlook the potential contribution of inherent audio data. To address this gap, we propose a novel approach that incorporates audio informatio… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

    Comments: 14 pages, 7 figures

  42. arXiv:2506.11452  [pdf, ps, other

    cs.IR

    Leveraging Reference Documents for Zero-Shot Ranking via Large Language Models

    Authors: Jieran Li, Xiuyuan Hu, Yang Zhao, Shengyao Zhuang, Hao Zhang

    Abstract: Large Language Models (LLMs) have demonstrated exceptional performance in the task of text ranking for information retrieval. While Pointwise ranking approaches offer computational efficiency by scoring documents independently, they often yield biased relevance estimates due to the lack of inter-document comparisons. In contrast, Pairwise methods improve ranking accuracy by explicitly comparing do… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

  43. arXiv:2506.10558  [pdf, ps, other

    cs.LO cs.AI

    StepProof: Step-by-step verification of natural language mathematical proofs

    Authors: Xiaolin Hu, Qinghua Zhou, Bogdan Grechuk, Ivan Y. Tyukin

    Abstract: Interactive theorem provers (ITPs) are powerful tools for the formal verification of mathematical proofs down to the axiom level. However, their lack of a natural language interface remains a significant limitation. Recent advancements in large language models (LLMs) have enhanced the understanding of natural language inputs, paving the way for autoformalization - the process of translating natura… ▽ More

    Submitted 30 June, 2025; v1 submitted 12 June, 2025; originally announced June 2025.

  44. arXiv:2506.10507  [pdf, ps, other

    cs.GR cs.CV

    Edit360: 2D Image Edits to 3D Assets from Any Angle

    Authors: Junchao Huang, Xinting Hu, Shaoshuai Shi, Zhuotao Tian, Li Jiang

    Abstract: Recent advances in diffusion models have significantly improved image generation and editing, but extending these capabilities to 3D assets remains challenging, especially for fine-grained edits that require multi-view consistency. Existing methods typically restrict editing to predetermined viewing angles, severely limiting their flexibility and practical applications. We introduce Edit360, a tun… ▽ More

    Submitted 30 June, 2025; v1 submitted 12 June, 2025; originally announced June 2025.

    Comments: 11 pages, 9 figures

  45. arXiv:2506.10426  [pdf, ps, other

    cs.SE

    Towards Understanding Bugs in Distributed Training and Inference Frameworks for Large Language Models

    Authors: Xiao Yu, Haoxuan Chen, Feifei Niu, Xing Hu, Jacky Wai Keung, Xin Xia

    Abstract: With the rapid development of large language models (LLMs), distributed training and inference frameworks like DeepSpeed have become essential for scaling model training and inference across multiple GPUs or nodes. However, the increasing complexity of these frameworks brings non-trivial software bugs, which may degrade training performance, cause unexpected failures, and result in significant res… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

  46. arXiv:2506.10399  [pdf, ps, other

    cs.CR

    FicGCN: Unveiling the Homomorphic Encryption Efficiency from Irregular Graph Convolutional Networks

    Authors: Zhaoxuan Kan, Husheng Han, Shangyi Shi, Tenghui Hua, Hang Lu, Xiaowei Li, Jianan Mu, Xing Hu

    Abstract: Graph Convolutional Neural Networks (GCNs) have gained widespread popularity in various fields like personal healthcare and financial systems, due to their remarkable performance. Despite the growing demand for cloud-based GCN services, privacy concerns over sensitive graph data remain significant. Homomorphic Encryption (HE) facilitates Privacy-Preserving Machine Learning (PPML) by allowing compu… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

    Comments: Accepted by ICML 2025

  47. arXiv:2506.10127  [pdf, ps, other

    cs.LG

    Meet Me at the Arm: The Cooperative Multi-Armed Bandits Problem with Shareable Arms

    Authors: Xinyi Hu, Aldo Pacchiano

    Abstract: We study the decentralized multi-player multi-armed bandits (MMAB) problem under a no-sensing setting, where each player receives only their own reward and obtains no information about collisions. Each arm has an unknown capacity, and if the number of players pulling an arm exceeds its capacity, all players involved receive zero reward. This setting generalizes the classical unit-capacity model an… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

  48. arXiv:2506.09501  [pdf, ps, other

    cs.CL

    Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning

    Authors: Jiayi Yuan, Hao Li, Xinheng Ding, Wenya Xie, Yu-Jhe Li, Wentian Zhao, Kun Wan, Jing Shi, Xia Hu, Zirui Liu

    Abstract: Large Language Models (LLMs) are now integral across various domains and have demonstrated impressive performance. Progress, however, rests on the premise that benchmark scores are both accurate and reproducible. We demonstrate that the reproducibility of LLM performance is fragile: changing system configuration such as evaluation batch size, GPU count, and GPU version can introduce significant di… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

  49. arXiv:2506.09345  [pdf, ps, other

    cs.CV

    An Effective End-to-End Solution for Multimodal Action Recognition

    Authors: Songping Wang, Xiantao Hu, Yueming Lyu, Caifeng Shan

    Abstract: Recently, multimodal tasks have strongly advanced the field of action recognition with their rich multimodal information. However, due to the scarcity of tri-modal data, research on tri-modal action recognition tasks faces many challenges. To this end, we have proposed a comprehensive multimodal action recognition solution that effectively utilizes multimodal information. First, the existing data… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

  50. arXiv:2506.07492  [pdf, ps, other

    cs.LG stat.ML

    Explicit Preference Optimization: No Need for an Implicit Reward Model

    Authors: Xiangkun Hu, Lemin Kong, Tong He, David Wipf

    Abstract: The generated responses of large language models (LLMs) are often fine-tuned to human preferences through a process called reinforcement learning from human feedback (RLHF). As RLHF relies on a challenging training sequence, whereby a separate reward model is independently learned and then later applied to LLM policy updates, ongoing research effort has targeted more straightforward alternatives.… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

    Comments: arXiv admin note: substantial text overlap with arXiv:2407.09072