Skip to main content

Showing 1–50 of 427 results for author: Xiao, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.04591  [pdf

    physics.med-ph cs.CV eess.IV

    Emerging Frameworks for Objective Task-based Evaluation of Quantitative Medical Imaging Methods

    Authors: Yan Liu, Huitian Xia, Nancy A. Obuchowski, Richard Laforest, Arman Rahmim, Barry A. Siegel, Abhinav K. Jha

    Abstract: Quantitative imaging (QI) is demonstrating strong promise across multiple clinical applications. For clinical translation of QI methods, objective evaluation on clinically relevant tasks is essential. To address this need, multiple evaluation strategies are being developed. In this paper, based on previous literature, we outline four emerging frameworks to perform evaluation studies of QI methods.… ▽ More

    Submitted 6 July, 2025; originally announced July 2025.

    Comments: 19 pages, 7 figures

  2. arXiv:2507.04306  [pdf, ps, other

    cs.CV

    Exploring Remote Physiological Signal Measurement under Dynamic Lighting Conditions at Night: Dataset, Experiment, and Analysis

    Authors: Zhipeng Li, Kegang Wang, Hanguang Xiao, Xingyue Liu, Feizhong Zhou, Jiaxin Jiang, Tianqi Liu

    Abstract: Remote photoplethysmography (rPPG) is a non-contact technique for measuring human physiological signals. Due to its convenience and non-invasiveness, it has demonstrated broad application potential in areas such as health monitoring and emotion recognition. In recent years, the release of numerous public datasets has significantly advanced the performance of rPPG algorithms under ideal lighting co… ▽ More

    Submitted 6 July, 2025; originally announced July 2025.

  3. arXiv:2507.03840  [pdf, ps, other

    cs.LG cond-mat.mtrl-sci cs.DC physics.comp-ph

    Distributed Equivariant Graph Neural Networks for Large-Scale Electronic Structure Prediction

    Authors: Manasa Kaniselvan, Alexander Maeder, Chen Hao Xia, Alexandros Nikolaos Ziogas, Mathieu Luisier

    Abstract: Equivariant Graph Neural Networks (eGNNs) trained on density-functional theory (DFT) data can potentially perform electronic structure prediction at unprecedented scales, enabling investigation of the electronic properties of materials with extended defects, interfaces, or exhibiting disordered phases. However, as interactions between atomic orbitals typically extend over 10+ angstroms, the graph… ▽ More

    Submitted 4 July, 2025; originally announced July 2025.

    Comments: 13 pages, 8 figures

  4. arXiv:2507.02827  [pdf, ps, other

    cs.CV cs.AI

    USAD: An Unsupervised Data Augmentation Spatio-Temporal Attention Diffusion Network

    Authors: Ying Yu, Hang Xiao, Siyao Li, Jiarui Li, Haotian Tang, Hanyu Liu, Chao Li

    Abstract: The primary objective of human activity recognition (HAR) is to infer ongoing human actions from sensor data, a task that finds broad applications in health monitoring, safety protection, and sports analysis. Despite proliferating research, HAR still faces key challenges, including the scarcity of labeled samples for rare activities, insufficient extraction of high-level features, and suboptimal m… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  5. arXiv:2507.02826  [pdf, ps, other

    cs.CV

    Confidence-driven Gradient Modulation for Multimodal Human Activity Recognition: A Dynamic Contrastive Dual-Path Learning Approach

    Authors: Panpan Ji, Junni Song, Hang Xiao, Hanyu Liu, Chao Li

    Abstract: Sensor-based Human Activity Recognition (HAR) is a core technology that enables intelligent systems to perceive and interact with their environment. However, multimodal HAR systems still encounter key challenges, such as difficulties in cross-modal feature alignment and imbalanced modality contributions. To address these issues, we propose a novel framework called the Dynamic Contrastive Dual-Path… ▽ More

    Submitted 4 July, 2025; v1 submitted 3 July, 2025; originally announced July 2025.

  6. Synergizing Implicit and Explicit User Interests: A Multi-Embedding Retrieval Framework at Pinterest

    Authors: Zhibo Fan, Hongtao Lin, Haoyu Chen, Bowen Deng, Hedi Xia, Yuke Yan, James Li

    Abstract: Industrial recommendation systems are typically composed of multiple stages, including retrieval, ranking, and blending. The retrieval stage plays a critical role in generating a high-recall set of candidate items that covers a wide range of diverse user interests. Effectively covering the diverse and long-tail user interests within this stage poses a significant challenge: traditional two-tower m… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

    Comments: KDD 2025

  7. arXiv:2506.18902  [pdf, ps, other

    cs.AI cs.CL cs.IR

    jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval

    Authors: Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, Han Xiao

    Abstract: We introduce jina-embeddings-v4, a 3.8 billion parameter multimodal embedding model that unifies text and image representations through a novel architecture supporting both single-vector and multi-vector embeddings in the late interaction style. The model incorporates task-specific Low-Rank Adaptation (LoRA) adapters to optimize performance across diverse retrieval scenarios, including query-docum… ▽ More

    Submitted 7 July, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

    Comments: 22 pages, 1-10 main, 14-22 experimental results, benchmark tables

    MSC Class: 68T50 ACM Class: I.2.7

  8. arXiv:2506.16445  [pdf, ps, other

    cs.CL cs.AI

    StoryWriter: A Multi-Agent Framework for Long Story Generation

    Authors: Haotian Xia, Hao Peng, Yunjia Qi, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li

    Abstract: Long story generation remains a challenge for existing large language models (LLMs), primarily due to two main factors: (1) discourse coherence, which requires plot consistency, logical coherence, and completeness in the long-form generation, and (2) narrative complexity, which requires an interwoven and engaging narrative. To address these challenges, we propose StoryWriter, a multi-agent story g… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

  9. arXiv:2506.15898  [pdf, ps, other

    cs.LG

    TrajDiff: Diffusion Bridge Network with Semantic Alignment for Trajectory Similarity Computation

    Authors: Xiao Zhang, Xingyu Zhao, Hong Xia, Yuan Cao, Guiyuan Jiang, Junyu Dong, Yanwei Yu

    Abstract: With the proliferation of location-tracking technologies, massive volumes of trajectory data are continuously being collected. As a fundamental task in trajectory data mining, trajectory similarity computation plays a critical role in a wide range of real-world applications. However, existing learning-based methods face three challenges: First, they ignore the semantic gap between GPS and grid fea… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

  10. arXiv:2506.15153  [pdf, ps, other

    cs.CV

    SynPo: Boosting Training-Free Few-Shot Medical Segmentation via High-Quality Negative Prompts

    Authors: Yufei Liu, Haoke Xiao, Jiaxing Chai, Yongcun Zhang, Rong Wang, Zijie Meng, Zhiming Luo

    Abstract: The advent of Large Vision Models (LVMs) offers new opportunities for few-shot medical image segmentation. However, existing training-free methods based on LVMs fail to effectively utilize negative prompts, leading to poor performance on low-contrast medical images. To address this issue, we propose SynPo, a training-free few-shot method based on LVMs (e.g., SAM), with the core insight: improving… ▽ More

    Submitted 19 June, 2025; v1 submitted 18 June, 2025; originally announced June 2025.

    Comments: MICCAI 2025 Early Accept. Project Page: https://liu-yufei.github.io/synpo-project-page/

  11. arXiv:2506.02452  [pdf, other

    cs.CV

    ANT: Adaptive Neural Temporal-Aware Text-to-Motion Model

    Authors: Wenshuo Chen, Kuimou Yu, Haozhe Jia, Kaishen Yuan, Bowen Tian, Songning Lai, Hongru Xiao, Erhang Zhang, Lei Wang, Yutao Yue

    Abstract: While diffusion models advance text-to-motion generation, their static semantic conditioning ignores temporal-frequency demands: early denoising requires structural semantics for motion foundations while later stages need localized details for text alignment. This mismatch mirrors biological morphogenesis where developmental phases demand distinct genetic programs. Inspired by epigenetic regulatio… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

  12. arXiv:2505.23932  [pdf, ps, other

    cs.CL

    SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving

    Authors: Wendong Xu, Jing Xiong, Chenyang Zhao, Qiujiang Chen, Haoran Wang, Hui Shen, Zhongwei Wan, Jianbo Dai, Taiqiang Wu, He Xiao, Chaofan Tao, Z. Morley Mao, Ying Sheng, Zhijiang Guo, Hongxia Yang, Bei Yu, Lingpeng Kong, Quanquan Gu, Ngai Wong

    Abstract: We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs) that closely mirrors real-world software development workflows. Unlike traditional static benchmarks, SwingArena models the collaborative process of software iteration by pairing LLMs as submitters, who generate patches, and reviewers, who create test cases and verify the patches through continuous integrati… ▽ More

    Submitted 2 June, 2025; v1 submitted 29 May, 2025; originally announced May 2025.

  13. arXiv:2505.22831  [pdf, ps, other

    cs.HC cs.AI

    Orca: Browsing at Scale Through User-Driven and AI-Facilitated Orchestration Across Malleable Webpages

    Authors: Peiling Jiang, Haijun Xia

    Abstract: Web-based activities are fundamentally distributed across webpages. However, conventional browsers with stacks of tabs fail to support operating and synthesizing large volumes of information across pages. While recent AI systems enable fully automated web browsing and information synthesis, they often diminish user agency and hinder contextual understanding. Therefore, we explore how AI could inst… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  14. arXiv:2505.21496  [pdf, ps, other

    cs.CL cs.CV cs.LG

    UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents

    Authors: Han Xiao, Guozhi Wang, Yuxiang Chai, Zimu Lu, Weifeng Lin, Hao He, Lue Fan, Liuyang Bian, Rui Hu, Liang Liu, Shuai Ren, Yafei Wen, Xiaoxin Chen, Aojun Zhou, Hongsheng Li

    Abstract: In this paper, we introduce UI-Genie, a self-improving framework addressing two key challenges in GUI agents: verification of trajectory outcome is challenging and high-quality training data are not scalable. These challenges are addressed by a reward model and a self-improving pipeline, respectively. The reward model, UI-Genie-RM, features an image-text interleaved architecture that efficiently p… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: https://github.com/Euphoria16/UI-Genie

  15. arXiv:2505.20246  [pdf, ps, other

    cs.AI cs.CL

    On Path to Multimodal Historical Reasoning: HistBench and HistAgent

    Authors: Jiahao Qiu, Fulian Xiao, Yimin Wang, Yuchen Mao, Yijia Chen, Xinzhe Juan, Shu Zhang, Siran Wang, Xuan Qi, Tongcheng Zhang, Zixin Yao, Jiacheng Guo, Yifu Lu, Charles Argon, Jundi Cui, Daixin Chen, Junran Zhou, Shuyao Zhou, Zhanpeng Zhou, Ling Yang, Shilong Liu, Hongru Wang, Kaixuan Huang, Xun Jiang, Yuming Cao , et al. (74 additional authors not shown)

    Abstract: Recent advances in large language models (LLMs) have led to remarkable progress across domains, yet their capabilities in the humanities, particularly history, remain underexplored. Historical reasoning poses unique challenges for AI, involving multimodal source interpretation, temporal inference, and cross-linguistic analysis. While general-purpose agents perform well on many existing benchmarks,… ▽ More

    Submitted 19 June, 2025; v1 submitted 26 May, 2025; originally announced May 2025.

    Comments: 17 pages, 7 figures

  16. arXiv:2505.19490  [pdf, other

    cs.AI

    Automated CAD Modeling Sequence Generation from Text Descriptions via Transformer-Based Large Language Models

    Authors: Jianxing Liao, Junyan Xu, Yatao Sun, Maowen Tang, Sicheng He, Jingxian Liao, Shui Yu, Yun Li, Hongguan Xiao

    Abstract: Designing complex computer-aided design (CAD) models is often time-consuming due to challenges such as computational inefficiency and the difficulty of generating precise models. We propose a novel language-guided framework for industrial design automation to address these issues, integrating large language models (LLMs) with computer-automated design (CAutoD).Through this framework, CAD models ar… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

    Comments: Accepted by ACL 2025 Main Conference

    ACM Class: I.2.7; I.2.6

  17. arXiv:2505.19073  [pdf, ps, other

    cs.CL

    Towards Harmonized Uncertainty Estimation for Large Language Models

    Authors: Rui Li, Jing Long, Muge Qi, Heming Xia, Lei Sha, Peiyi Wang, Zhifang Sui

    Abstract: To facilitate robust and trustworthy deployment of large language models (LLMs), it is essential to quantify the reliability of their generations through uncertainty estimation. While recent efforts have made significant advancements by leveraging the internal logic and linguistic features of LLMs to estimate uncertainty scores, our empirical analysis highlights the pitfalls of these methods to st… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

    Comments: ACL 2025

  18. arXiv:2505.18994  [pdf, ps, other

    cs.RO

    Designing Pin-pression Gripper and Learning its Dexterous Grasping with Online In-hand Adjustment

    Authors: Hewen Xiao, Xiuping Liu, Hang Zhao, Jian Liu, Kai Xu

    Abstract: We introduce a novel design of parallel-jaw grippers drawing inspiration from pin-pression toys. The proposed pin-pression gripper features a distinctive mechanism in which each finger integrates a 2D array of pins capable of independent extension and retraction. This unique design allows the gripper to instantaneously customize its finger's shape to conform to the object being grasped by dynamica… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

  19. arXiv:2505.18628  [pdf, ps, other

    cs.IT

    Multi-Subarray FD-RIS Enhanced Multi-user Wireless Networks: With Joint Distance-Angle Beamforming

    Authors: Han Xiao, Xiaoyan Hu, Wenjie Wang, Kai-Kit Wong, Kun Yang, Shi Jin

    Abstract: The concept of the frequency diverse reconfigurable intelligent surface (FD-RIS) technology has been introduced, which can enable simultaneous implementation of distance-angle beamforming in far-field communication scenarios. In order to improve the managing ability on undesired harmonic signals and the diversity of frequency offsets, this paper presents a novel multi-subarray FD-RIS framework. In… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

  20. arXiv:2505.17951  [pdf, ps, other

    cs.CV

    SplatCo: Structure-View Collaborative Gaussian Splatting for Detail-Preserving Rendering of Large-Scale Unbounded Scenes

    Authors: Haihong Xiao, Jianan Zou, Yuxin Zhou, Ying He, Wenxiong Kang

    Abstract: We present SplatCo, a structure-view collaborative Gaussian splatting framework for high-fidelity rendering of complex outdoor environments. SplatCo builds upon two novel components: (1) a cross-structure collaboration module that combines global tri-plane representations, which capture coarse scene layouts, with local context grid features that represent fine surface details. This fusion is achie… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

  21. arXiv:2505.17917  [pdf, ps, other

    stat.ML cs.LG stat.ME

    M-learner:A Flexible And Powerful Framework To Study Heterogeneous Treatment Effect In Mediation Model

    Authors: Xingyu Li, Qing Liu, Tony Jiang, Hong Amy Xia, Brian P. Hobbs, Peng Wei

    Abstract: We propose a novel method, termed the M-learner, for estimating heterogeneous indirect and total treatment effects and identifying relevant subgroups within a mediation framework. The procedure comprises four key steps. First, we compute individual-level conditional average indirect/total treatment effect Second, we construct a distance matrix based on pairwise differences. Third, we apply tSNE to… ▽ More

    Submitted 30 May, 2025; v1 submitted 23 May, 2025; originally announced May 2025.

  22. arXiv:2505.16782  [pdf, ps, other

    cs.CL

    Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning

    Authors: Xinghao Chen, Anhao Zhao, Heming Xia, Xuan Lu, Hanlin Wang, Yanjun Chen, Wei Zhang, Jian Wang, Wenjie Li, Xiaoyu Shen

    Abstract: Large Language Models (LLMs) have achieved impressive performance on complex reasoning tasks with Chain-of-Thought (CoT) prompting. However, conventional CoT relies on reasoning steps explicitly verbalized in natural language, introducing inefficiencies and limiting its applicability to abstract reasoning. To address this, there has been growing research interest in latent CoT reasoning, where inf… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

  23. arXiv:2505.16162  [pdf, other

    cs.CL

    KNN-SSD: Enabling Dynamic Self-Speculative Decoding via Nearest Neighbor Layer Set Optimization

    Authors: Mingbo Song, Heming Xia, Jun Zhang, Chak Tou Leong, Qiancheng Xu, Wenjie Li, Sujian Li

    Abstract: Speculative Decoding (SD) has emerged as a widely used paradigm to accelerate the inference of large language models (LLMs) without compromising generation quality. It works by efficiently drafting multiple tokens using a compact model and then verifying them in parallel using the target LLM. Notably, Self-Speculative Decoding proposes skipping certain layers to construct the draft model, which el… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: 8 pages

  24. arXiv:2505.15118  [pdf, other

    cs.SI cs.DB

    Maximum Degree-Based Quasi-Clique Search via an Iterative Framework

    Authors: Hongbo Xia, Kaiqiang Yu, Shengxin Liu, Cheng Long, Xun Zhou

    Abstract: Cohesive subgraph mining is a fundamental problem in graph theory with numerous real-world applications, such as social network analysis and protein-protein interaction modeling. Among various cohesive subgraphs, the $γ$-quasi-clique is widely studied for its flexibility in requiring each vertex to connect to at least a $γ$ proportion of other vertices in the subgraph. However, solving the maximum… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: Appears in the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2025

  25. arXiv:2505.14062  [pdf, other

    cs.CV

    Scaling Vision Mamba Across Resolutions via Fractal Traversal

    Authors: Bo Li, Haoke Xiao, Lv Tang

    Abstract: Vision Mamba has recently emerged as a promising alternative to Transformer-based architectures, offering linear complexity in sequence length while maintaining strong modeling capacity. However, its adaptation to visual inputs is hindered by challenges in 2D-to-1D patch serialization and weak scalability across input resolutions. Existing serialization strategies such as raster scanning disrupt l… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

    Comments: Work in progressing

  26. arXiv:2505.12814  [pdf, ps, other

    cs.CL cs.AI

    PsyMem: Fine-grained psychological alignment and Explicit Memory Control for Advanced Role-Playing LLMs

    Authors: Xilong Cheng, Yunxiao Qin, Yuting Tan, Zhengnan Li, Ye Wang, Hongjiang Xiao, Yuan Zhang

    Abstract: Existing LLM-based role-playing methods often rely on superficial textual descriptions or simplistic metrics, inadequately modeling both intrinsic and extrinsic character dimensions. Additionally, they typically simulate character memory with implicit model knowledge or basic retrieval augment generation without explicit memory alignment, compromising memory consistency. The two issues weaken reli… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

  27. arXiv:2505.12402  [pdf, ps, other

    cs.CR

    Automated Profile Inference with Language Model Agents

    Authors: Yuntao Du, Zitao Li, Bolin Ding, Yaliang Li, Hanshen Xiao, Jingren Zhou, Ninghui Li

    Abstract: Impressive progress has been made in automated problem-solving by the collaboration of large language models (LLMs) based agents. However, these automated capabilities also open avenues for malicious applications. In this paper, we study a new threat that LLMs pose to online pseudonymity, called automated profile inference, where an adversary can instruct LLMs to automatically scrape and extract s… ▽ More

    Submitted 18 May, 2025; originally announced May 2025.

  28. arXiv:2505.11248  [pdf, ps, other

    eess.SP cs.IT

    Unfolded Deep Graph Learning for Networked Over-the-Air Computation

    Authors: Xiao Tang, Huirong Xiao, Chao Shen, Li Sun, Qinghe Du, Dusit Niyato, Zhu Han

    Abstract: Over-the-air computation (AirComp) has emerged as a promising technology that enables simultaneous transmission and computation through wireless channels. In this paper, we investigate the networked AirComp in multiple clusters allowing diversified data computation, which is yet challenged by the transceiver coordination and interference management therein. Particularly, we aim to maximize the mul… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

    Comments: Accepted @ IEEE TWC

  29. arXiv:2505.10940  [pdf, ps, other

    cs.IR cs.AI

    Who You Are Matters: Bridging Topics and Social Roles via LLM-Enhanced Logical Recommendation

    Authors: Qing Yu, Xiaobei Wang, Shuchang Liu, Yandong Bai, Xiaoyu Yang, Xueliang Wang, Chang Meng, Shanshan Wu, Hailan Yang, Huihui Xiao, Xiang Li, Fan Yang, Xiaoqiang Feng, Lantao Hu, Han Li, Kun Gai, Lixin Zou

    Abstract: Recommender systems filter contents/items valuable to users by inferring preferences from user features and historical behaviors. Mainstream approaches follow the learning-to-rank paradigm, which focus on discovering and modeling item topics (e.g., categories), and capturing user preferences on these topics based on historical interactions. However, this paradigm often neglects the modeling of use… ▽ More

    Submitted 20 May, 2025; v1 submitted 16 May, 2025; originally announced May 2025.

  30. arXiv:2505.10787  [pdf, ps, other

    cs.CV

    EA-3DGS: Efficient and Adaptive 3D Gaussians with Highly Enhanced Quality for outdoor scenes

    Authors: Jianlin Guo, Haihong Xiao, Wenxiong Kang

    Abstract: Efficient scene representations are essential for many real-world applications, especially those involving spatial measurement. Although current NeRF-based methods have achieved impressive results in reconstructing building-scale scenes, they still suffer from slow training and inference speeds due to time-consuming stochastic sampling. Recently, 3D Gaussian Splatting (3DGS) has demonstrated excel… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  31. arXiv:2505.10557  [pdf, ps, other

    cs.CV cs.AI cs.CL

    MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning

    Authors: Ke Wang, Junting Pan, Linda Wei, Aojun Zhou, Weikang Shi, Zimu Lu, Han Xiao, Yunqiao Yang, Houxing Ren, Mingjie Zhan, Hongsheng Li

    Abstract: Natural language image-caption datasets, widely used for training Large Multimodal Models, mainly focus on natural scenarios and overlook the intricate details of mathematical figures that are critical for problem-solving, hindering the advancement of current LMMs in multimodal mathematical reasoning. To this end, we propose leveraging code as supervision for cross-modal alignment, since code inhe… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

    Comments: Accepted to ACL 2025 Findings

  32. arXiv:2505.08013  [pdf, ps, other

    cs.CV

    RDD: Robust Feature Detector and Descriptor using Deformable Transformer

    Authors: Gonglin Chen, Tianwen Fu, Haiwei Chen, Wenbin Teng, Hanyuan Xiao, Yajie Zhao

    Abstract: As a core step in structure-from-motion and SLAM, robust feature detection and description under challenging scenarios such as significant viewpoint changes remain unresolved despite their ubiquity. While recent works have identified the importance of local features in modeling geometric transformations, these methods fail to learn the visual cues present in long-range relationships. We present Ro… ▽ More

    Submitted 19 June, 2025; v1 submitted 12 May, 2025; originally announced May 2025.

  33. arXiv:2505.05813  [pdf, ps, other

    cs.LG

    BCE vs. CE in Deep Feature Learning

    Authors: Qiufu Li, Huibin Xiao, Linlin Shen

    Abstract: When training classification models, it expects that the learned features are compact within classes, and can well separate different classes. As the dominant loss function for training classification models, minimizing cross-entropy (CE) loss maximizes the compactness and distinctiveness, i.e., reaching neural collapse (NC). The recent works show that binary CE (BCE) performs also well in multi-c… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

    Comments: Accepted by ICML2025

  34. arXiv:2505.05446  [pdf, ps, other

    cs.CV cs.CL

    Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

    Authors: Han Xiao, Yina Xie, Guanxin Tan, Yinghao Chen, Rui Hu, Ke Wang, Aojun Zhou, Hao Li, Hao Shao, Xudong Lu, Peng Gao, Yafei Wen, Xiaoxin Chen, Shuai Ren, Hongsheng Li

    Abstract: Visual Document Understanding has become essential with the increase of text-rich visual content. This field poses significant challenges due to the need for effective integration of visual perception and textual comprehension, particularly across diverse document types with complex layouts. Moreover, existing fine-tuning datasets for this domain often fall short in providing the detailed contextu… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

    Comments: CVPR2025

  35. arXiv:2505.03733  [pdf, other

    cs.CL

    WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch

    Authors: Zimu Lu, Yunqiao Yang, Houxing Ren, Haotian Hou, Han Xiao, Ke Wang, Weikang Shi, Aojun Zhou, Mingjie Zhan, Hongsheng Li

    Abstract: LLM-based agents have demonstrated great potential in generating and managing code within complex codebases. In this paper, we introduce WebGen-Bench, a novel benchmark designed to measure an LLM-based agent's ability to create multi-file website codebases from scratch. It contains diverse instructions for website generation, created through the combined efforts of human annotators and GPT-4o. The… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

  36. arXiv:2504.19838  [pdf, other

    cs.HC

    LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects

    Authors: Guangyi Liu, Pengxiang Zhao, Liang Liu, Yaxuan Guo, Han Xiao, Weifeng Lin, Yuxiang Chai, Yue Han, Shuai Ren, Hao Wang, Xiaoyu Liang, Wenhao Wang, Tianze Wu, Linghao Li, Hao Wang, Guanjing Xiong, Yong Liu, Hongsheng Li

    Abstract: With the rapid rise of large language models (LLMs), phone automation has undergone transformative changes. This paper systematically reviews LLM-driven phone GUI agents, highlighting their evolution from script-based automation to intelligent, adaptive systems. We first contextualize key challenges, (i) limited generality, (ii) high maintenance overhead, and (iii) weak intent comprehension, and s… ▽ More

    Submitted 23 May, 2025; v1 submitted 28 April, 2025; originally announced April 2025.

    Comments: 39 pages, 10 figures, 7 tables, Project Homepage: https://github.com/PhoneLLM/Awesome-LLM-Powered-Phone-GUI-Agents

  37. arXiv:2504.19362  [pdf, other

    eess.IV cs.AI cs.CV

    Low-Rank Adaptive Structural Priors for Generalizable Diabetic Retinopathy Grading

    Authors: Yunxuan Wang, Ray Yin, Yumei Tan, Hao Chen, Haiying Xia

    Abstract: Diabetic retinopathy (DR), a serious ocular complication of diabetes, is one of the primary causes of vision loss among retinal vascular diseases. Deep learning methods have been extensively applied in the grading of diabetic retinopathy (DR). However, their performance declines significantly when applied to data outside the training distribution due to domain shifts. Domain generalization (DG) ha… ▽ More

    Submitted 27 April, 2025; originally announced April 2025.

    Comments: Accepted by IJCNN 2025

  38. arXiv:2504.15278  [pdf, other

    cs.CV cs.RO

    DRAWER: Digital Reconstruction and Articulation With Environment Realism

    Authors: Hongchi Xia, Entong Su, Marius Memmel, Arhan Jain, Raymond Yu, Numfor Mbiziwo-Tiapo, Ali Farhadi, Abhishek Gupta, Shenlong Wang, Wei-Chiu Ma

    Abstract: Creating virtual digital replicas from real-world data unlocks significant potential across domains like gaming and robotics. In this paper, we present DRAWER, a novel framework that converts a video of a static indoor scene into a photorealistic and interactive digital environment. Our approach centers on two main contributions: (i) a reconstruction module based on a dual scene representation tha… ▽ More

    Submitted 22 April, 2025; v1 submitted 21 April, 2025; originally announced April 2025.

    Comments: Project page: https://drawer-art.github.io/

  39. arXiv:2504.14482  [pdf, other

    cs.CL cs.SD

    DialogueAgents: A Hybrid Agent-Based Speech Synthesis Framework for Multi-Party Dialogue

    Authors: Xiang Li, Duyi Pan, Hongru Xiao, Jiale Han, Jing Tang, Jiabao Ma, Wei Wang, Bo Cheng

    Abstract: Speech synthesis is crucial for human-computer interaction, enabling natural and intuitive communication. However, existing datasets involve high construction costs due to manual annotation and suffer from limited character diversity, contextual scenarios, and emotional expressiveness. To address these issues, we propose DialogueAgents, a novel hybrid agent-based speech synthesis framework, which… ▽ More

    Submitted 20 April, 2025; originally announced April 2025.

    Comments: Accepted by ICME 2025. Dataset and code are publicly available: [https://github.com/uirlx/DialogueAgents](https://github.com/uirlx/DialogueAgents)

  40. arXiv:2504.11784  [pdf, other

    cs.IT

    DALC: Distributed Arithmetic Coding Aided by Linear Codes

    Authors: Junwei Zhou, HaoYun Xiao, Jianwen Xi, Qiuzhen Lin

    Abstract: Distributed Arithmetic Coding (DAC) has emerged as a feasible solution to the Slepian-Wolf problem, particularly in scenarios with non-stationary sources and for data sequences with lengths ranging from small to medium. Due to the inherent decoding ambiguity in DAC, the number of candidate paths grows exponentially with the increase in source length. To select the correct decoding path from the se… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

    Comments: 7 pages, 7 figures

  41. arXiv:2504.08344  [pdf, other

    cs.CV

    EasyGenNet: An Efficient Framework for Audio-Driven Gesture Video Generation Based on Diffusion Model

    Authors: Renda Li, Xiaohua Qi, Qiang Ling, Jun Yu, Ziyi Chen, Peng Chang, Mei HanJing Xiao

    Abstract: Audio-driven cospeech video generation typically involves two stages: speech-to-gesture and gesture-to-video. While significant advances have been made in speech-to-gesture generation, synthesizing natural expressions and gestures remains challenging in gesture-to-video systems. In order to improve the generation effect, previous works adopted complex input and training strategies and required a l… ▽ More

    Submitted 11 April, 2025; originally announced April 2025.

  42. arXiv:2504.07389  [pdf, other

    cs.LG cs.AI cs.CL

    Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression

    Authors: Hanqi Xiao, Yi-Lin Sung, Elias Stengel-Eskin, Mohit Bansal

    Abstract: Post-training quantization (PTQ) reduces a model's memory footprint by mapping full precision weights into low bit weights without costly retraining, but can degrade its downstream performance especially in low 2- to 3-bit settings. We develop a new mixed-precision PTQ approach, Task-Circuit Quantization (TaCQ), that draws parallels to automated circuit discovery, directly conditioning the quantiz… ▽ More

    Submitted 9 April, 2025; originally announced April 2025.

    Comments: 24 pages. Code: https://github.com/The-Inscrutable-X/TACQ

  43. arXiv:2504.04766  [pdf, other

    cs.LG cs.AI

    KunPeng: A Global Ocean Environmental Model

    Authors: Yi Zhao, Jiaqi Li, Haitao Xia, Tianjiao Zhang, Zerong Zeng, Tianyu Ren, Yucheng Zhang, Chao Zhu, Shengtong Xu, Hongchun Yuan

    Abstract: Inspired by the similarity of the atmosphere-ocean physical coupling mechanism, this study innovatively migrates meteorological large-model techniques to the ocean domain, constructing the KunPeng global ocean environmental prediction model. Aimed at the discontinuous characteristics of marine space, we propose a terrain-adaptive mask constraint mechanism to mitigate effectively training divergenc… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

  44. arXiv:2504.04472  [pdf, other

    cs.SI

    Fast Maximization of Current Flow Group Closeness Centrality

    Authors: Haisong Xia, Zhongzhi Zhang

    Abstract: Derived from effective resistances, the current flow closeness centrality (CFCC) for a group of nodes measures the importance of node groups in an undirected graph with $n$ nodes. Given the widespread applications of identifying crucial nodes, we investigate the problem of maximizing CFCC for a node group $S$ subject to the cardinality constraint $|S|=k\ll n$. Despite the proven NP-hardness of thi… ▽ More

    Submitted 6 April, 2025; originally announced April 2025.

  45. arXiv:2504.03687  [pdf, other

    eess.SP cs.AI cs.CV

    Process Optimization and Deployment for Sensor-Based Human Activity Recognition Based on Deep Learning

    Authors: Hanyu Liu, Ying Yu, Hang Xiao, Siyao Li, Xuze Li, Jiarui Li, Haotian Tang

    Abstract: Sensor-based human activity recognition is a key technology for many human-centered intelligent applications. However, this research is still in its infancy and faces many unresolved challenges. To address these, we propose a comprehensive optimization process approach centered on multi-attention interaction. We first utilize unsupervised statistical feature-guided diffusion models for highly adap… ▽ More

    Submitted 22 March, 2025; originally announced April 2025.

  46. arXiv:2504.01822  [pdf, other

    cs.SE cs.CR

    Track and Trace: Automatically Uncovering Cross-chain Transactions in the Multi-blockchain Ecosystems

    Authors: Dan Lin, Ziye Zheng, Jiajing Wu, Jingjing Yang, Kaixin Lin, Huan Xiao, Bowen Song, Zibin Zheng

    Abstract: Cross-chain technology enables seamless asset transfer and message-passing within decentralized finance (DeFi) ecosystems, facilitating multi-chain coexistence in the current blockchain environment. However, this development also raises security concerns, as malicious actors exploit cross-chain asset flows to conceal the provenance and destination of assets, thereby facilitating illegal activities… ▽ More

    Submitted 2 April, 2025; originally announced April 2025.

  47. arXiv:2503.21843  [pdf, ps, other

    cs.CV cs.AI

    CMD-HAR: Cross-Modal Disentanglement for Wearable Human Activity Recognition

    Authors: Hanyu Liu, Siyao Li, Ying Yu, Yixuan Jiang, Hang Xiao, Jingxi Long, Haotian Tang, Chao Li

    Abstract: Human Activity Recognition (HAR) is a fundamental technology for numerous human - centered intelligent applications. Although deep learning methods have been utilized to accelerate feature extraction, issues such as multimodal data mixing, activity heterogeneity, and complex model deployment remain largely unresolved. The aim of this paper is to address issues such as multimodal data mixing, activ… ▽ More

    Submitted 4 July, 2025; v1 submitted 27 March, 2025; originally announced March 2025.

  48. arXiv:2503.21620  [pdf, other

    cs.AI

    UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

    Authors: Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Guanjing Xiong, Hongsheng Li

    Abstract: The recent DeepSeek-R1 has showcased the emergence of reasoning capabilities in LLMs through reinforcement learning (RL) with rule-based rewards. Despite its success in language models, its application in multi-modal domains, particularly in graphic user interface (GUI) agent tasks, remains under-explored. To address this issue, we propose UI-R1, the first framework to explore how rule-based RL ca… ▽ More

    Submitted 24 May, 2025; v1 submitted 27 March, 2025; originally announced March 2025.

    Comments: Updated UI-R1-E-3B

  49. arXiv:2503.20682  [pdf, other

    cs.CV

    GLRD: Global-Local Collaborative Reason and Debate with PSL for 3D Open-Vocabulary Detection

    Authors: Xingyu Peng, Si Liu, Chen Gao, Yan Bai, Beipeng Mu, Xiaofei Wang, Huaxia Xia

    Abstract: The task of LiDAR-based 3D Open-Vocabulary Detection (3D OVD) requires the detector to learn to detect novel objects from point clouds without off-the-shelf training labels. Previous methods focus on the learning of object-level representations and ignore the scene-level information, thus it is hard to distinguish objects with similar classes. In this work, we propose a Global-Local Collaborative… ▽ More

    Submitted 26 March, 2025; originally announced March 2025.

    Comments: 15 pages

  50. arXiv:2503.15015  [pdf, other

    cs.CR

    OFL: Opportunistic Federated Learning for Resource-Heterogeneous and Privacy-Aware Devices

    Authors: Yunlong Mao, Mingyang Niu, Ziqin Dang, Chengxi Li, Hanning Xia, Yuejuan Zhu, Haoyu Bian, Yuan Zhang, Jingyu Hua, Sheng Zhong

    Abstract: Efficient and secure federated learning (FL) is a critical challenge for resource-limited devices, especially mobile devices. Existing secure FL solutions commonly incur significant overhead, leading to a contradiction between efficiency and security. As a result, these two concerns are typically addressed separately. This paper proposes Opportunistic Federated Learning (OFL), a novel FL framework… ▽ More

    Submitted 19 March, 2025; originally announced March 2025.

    Comments: 14 pages, 13 figures