Skip to main content

Showing 1–50 of 1,058 results for author: Zha, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.05255  [pdf, ps, other

    cs.CV cs.CL

    Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning

    Authors: Yana Wei, Liang Zhao, Jianjian Sun, Kangheng Lin, Jisheng Yin, Jingcheng Hu, Yinmin Zhang, En Yu, Haoran Lv, Zejia Weng, Jia Wang, Chunrui Han, Yuang Peng, Qi Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Vishal M. Patel

    Abstract: The remarkable reasoning capability of large language models (LLMs) stems from cognitive behaviors that emerge through reinforcement with verifiable rewards. This work investigates how to transfer this principle to Multimodal LLMs (MLLMs) to unlock advanced visual reasoning. We introduce a two-stage paradigm built on Qwen2.5-VL-7B: a massive linguistic cold-start fine-tuning, followed by multimoda… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

  2. arXiv:2507.04857  [pdf, ps, other

    cs.SE

    Supporting Software Formal Verification with Large Language Models: An Experimental Study

    Authors: Weiqi Wang, Marie Farrell, Lucas C. Cordeiro, Liping Zhao

    Abstract: Formal methods have been employed for requirements verification for a long time. However, it is difficult to automatically derive properties from natural language requirements. SpecVerify addresses this challenge by integrating large language models (LLMs) with formal verification tools, providing a more flexible mechanism for expressing requirements. This framework combines Claude 3.5 Sonnet with… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

    Comments: Accepted for publication in 2025 IEEE 33rd International Requirements Engineering Conference (RE)

  3. arXiv:2507.03920  [pdf, ps, other

    cs.LG physics.chem-ph

    Combining Graph Neural Networks and Mixed Integer Linear Programming for Molecular Inference under the Two-Layered Model

    Authors: Jianshen Zhu, Naveed Ahmed Azam, Kazuya Haraguchi, Liang Zhao, Tatsuya Akutsu

    Abstract: Recently, a novel two-phase framework named mol-infer for inference of chemical compounds with prescribed abstract structures and desired property values has been proposed. The framework mol-infer is primarily based on using mixed integer linear programming (MILP) to simulate the computational process of machine learning methods and describe the necessary and sufficient conditions to ensure such a… ▽ More

    Submitted 5 July, 2025; originally announced July 2025.

    Comments: arXiv admin note: substantial text overlap with arXiv:2107.02381, arXiv:2109.02628

  4. arXiv:2507.03917  [pdf, ps, other

    cs.LG cs.CV

    Consistency-Aware Padding for Incomplete Multi-Modal Alignment Clustering Based on Self-Repellent Greedy Anchor Search

    Authors: Shubin Ma, Liang Zhao, Mingdong Lu, Yifan Guo, Bo Xu

    Abstract: Multimodal representation is faithful and highly effective in describing real-world data samples' characteristics by describing their complementary information. However, the collected data often exhibits incomplete and misaligned characteristics due to factors such as inconsistent sensor frequencies and device malfunctions. Existing research has not effectively addressed the issue of filling missi… ▽ More

    Submitted 5 July, 2025; originally announced July 2025.

    Comments: Accepted at IJCAI 2025. 9 pages, 3 figures

    ACM Class: I.2.6; I.5.3

  5. arXiv:2507.03542  [pdf, ps, other

    cs.CV

    Beyond Accuracy: Metrics that Uncover What Makes a `Good' Visual Descriptor

    Authors: Ethan Lin, Linxi Zhao, Atharva Sehgal, Jennifer J. Sun

    Abstract: Text-based visual descriptors-ranging from simple class names to more descriptive phrases-are widely used in visual concept discovery and image classification with vision-language models (VLMs). Their effectiveness, however, depends on a complex interplay of factors, including semantic clarity, presence in the VLM's pre-training data, and how well the descriptors serve as a meaningful representati… ▽ More

    Submitted 4 July, 2025; originally announced July 2025.

    Comments: VisCon @ CVPR 2025

  6. arXiv:2507.03298  [pdf, ps, other

    cs.LG

    Dyn-O: Building Structured World Models with Object-Centric Representations

    Authors: Zizhao Wang, Kaixin Wang, Li Zhao, Peter Stone, Jiang Bian

    Abstract: World models aim to capture the dynamics of the environment, enabling agents to predict and plan for future states. In most scenarios of interest, the dynamics are highly centered on interactions among objects within the environment. This motivates the development of world models that operate on object-centric rather than monolithic representations, with the goal of more effectively capturing envi… ▽ More

    Submitted 4 July, 2025; originally announced July 2025.

  7. arXiv:2507.02705  [pdf, ps, other

    cs.CV

    SIU3R: Simultaneous Scene Understanding and 3D Reconstruction Beyond Feature Alignment

    Authors: Qi Xu, Dongxu Wei, Lingzhe Zhao, Wenpu Li, Zhangchi Huang, Shunping Ji, Peidong Liu

    Abstract: Simultaneous understanding and 3D reconstruction plays an important role in developing end-to-end embodied intelligent systems. To achieve this, recent approaches resort to 2D-to-3D feature alignment paradigm, which leads to limited 3D understanding capability and potential semantic information loss. In light of this, we propose SIU3R, the first alignment-free framework for generalizable simultane… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  8. arXiv:2507.02379  [pdf

    cs.AI q-bio.BM

    An AI-native experimental laboratory for autonomous biomolecular engineering

    Authors: Mingyu Wu, Zhaoguo Wang, Jiabin Wang, Zhiyuan Dong, Jingkai Yang, Qingting Li, Tianyu Huang, Lei Zhao, Mingqiang Li, Fei Wang, Chunhai Fan, Haibo Chen

    Abstract: Autonomous scientific research, capable of independently conducting complex experiments and serving non-specialists, represents a long-held aspiration. Achieving it requires a fundamental paradigm shift driven by artificial intelligence (AI). While autonomous experimental systems are emerging, they remain confined to areas featuring singular objectives and well-defined, simple experimental workflo… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  9. arXiv:2507.01342  [pdf, ps, other

    cs.CV

    Learning Camera-Agnostic White-Balance Preferences

    Authors: Luxi Zhao, Mahmoud Afifi, Michael S. Brown

    Abstract: The image signal processor (ISP) pipeline in modern cameras consists of several modules that transform raw sensor data into visually pleasing images in a display color space. Among these, the auto white balance (AWB) module is essential for compensating for scene illumination. However, commercial AWB systems often strive to compute aesthetic white-balance preferences rather than accurate neutral c… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  10. arXiv:2507.00586  [pdf, ps, other

    cs.CV

    Context-Aware Academic Emotion Dataset and Benchmark

    Authors: Luming Zhao, Jingwen Xuan, Jiamin Lou, Yonghui Yu, Wenwu Yang

    Abstract: Academic emotion analysis plays a crucial role in evaluating students' engagement and cognitive states during the learning process. This paper addresses the challenge of automatically recognizing academic emotions through facial expressions in real-world learning environments. While significant progress has been made in facial expression recognition for basic emotions, academic emotion recognition… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: Accepted to ICCV 2025

  11. arXiv:2506.18922  [pdf, ps, other

    cs.CV cs.RO

    Correspondence-Free Multiview Point Cloud Registration via Depth-Guided Joint Optimisation

    Authors: Yiran Zhou, Yingyu Wang, Shoudong Huang, Liang Zhao

    Abstract: Multiview point cloud registration is a fundamental task for constructing globally consistent 3D models. Existing approaches typically rely on feature extraction and data association across multiple point clouds; however, these processes are challenging to obtain global optimal solution in complex environments. In this paper, we introduce a novel correspondence-free multiview point cloud registrat… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

    Comments: 8 pages, accepted for publication in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025)

  12. arXiv:2506.18256  [pdf, ps, other

    cs.RO

    Robot Tactile Gesture Recognition Based on Full-body Modular E-skin

    Authors: Shuo Jiang, Boce Hu, Linfeng Zhao, Lawson L. S. Wong

    Abstract: With the development of robot electronic skin technology, various tactile sensors, enhanced by AI, are unlocking a new dimension of perception for robots. In this work, we explore how robots equipped with electronic skin can recognize tactile gestures and interpret them as human commands. We developed a modular robot E-skin, composed of multiple irregularly shaped skin patches, which can be assemb… ▽ More

    Submitted 22 June, 2025; originally announced June 2025.

  13. arXiv:2506.17202  [pdf, ps, other

    cs.CV

    UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation

    Authors: Teng Li, Quanfeng Lu, Lirui Zhao, Hao Li, Xizhou Zhu, Yu Qiao, Jun Zhang, Wenqi Shao

    Abstract: Unified image understanding and generation has emerged as a promising paradigm in multimodal artificial intelligence. Despite recent progress, the optimal architectural design for such unified models remains an open challenge. In this work, we start by analyzing the modality alignment behaviors of task-specific expert models for understanding and generation, as well as current unified models. Our… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

    Comments: Code: https://github.com/tliby/UniFork

  14. arXiv:2506.15691  [pdf, other

    cs.LG cs.AI

    What Do Latent Action Models Actually Learn?

    Authors: Chuheng Zhang, Tim Pearce, Pushi Zhang, Kaixin Wang, Xiaoyu Chen, Wei Shen, Li Zhao, Jiang Bian

    Abstract: Latent action models (LAMs) aim to learn action-relevant changes from unlabeled videos by compressing changes between frames as latents. However, differences between video frames can be caused by controllable changes as well as exogenous noise, leading to an important concern -- do latents capture the changes caused by actions or irrelevant noise? This paper studies this issue analytically, presen… ▽ More

    Submitted 26 May, 2025; originally announced June 2025.

  15. arXiv:2506.13405  [pdf, ps, other

    cs.CL

    RealHiTBench: A Comprehensive Realistic Hierarchical Table Benchmark for Evaluating LLM-Based Table Analysis

    Authors: Pengzuo Wu, Yuhang Yang, Guangcheng Zhu, Chao Ye, Hong Gu, Xu Lu, Ruixuan Xiao, Bowen Bao, Yijing He, Liangyu Zha, Wentao Ye, Junbo Zhao, Haobo Wang

    Abstract: With the rapid advancement of Large Language Models (LLMs), there is an increasing need for challenging benchmarks to evaluate their capabilities in handling complex tabular data. However, existing benchmarks are either based on outdated data setups or focus solely on simple, flat table structures. In this paper, we introduce RealHiTBench, a comprehensive benchmark designed to evaluate the perform… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: ACL 2025

  16. arXiv:2506.12544  [pdf, ps, other

    eess.SY cs.RO

    Constrained Diffusers for Safe Planning and Control

    Authors: Jichen Zhang, Liqun Zhao, Antonis Papachristodoulou, Jack Umenberger

    Abstract: Diffusion models have shown remarkable potential in planning and control tasks due to their ability to represent multimodal distributions over actions and trajectories. However, ensuring safety under constraints remains a critical challenge for diffusion models. This paper proposes Constrained Diffusers, a novel framework that incorporates constraints into pre-trained diffusion models without retr… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

    Comments: 12 pages, 5 figures

  17. arXiv:2506.12480  [pdf, ps, other

    cs.LG eess.SP

    Quantizing Small-Scale State-Space Models for Edge AI

    Authors: Leo Zhao, Tristan Torchet, Melika Payvand, Laura Kriener, Filippo Moro

    Abstract: State-space models (SSMs) have recently gained attention in deep learning for their ability to efficiently model long-range dependencies, making them promising candidates for edge-AI applications. In this paper, we analyze the effects of quantization on small-scale SSMs with a focus on reducing memory and computational costs while maintaining task performance. Using the S4D architecture, we first… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

  18. arXiv:2506.11167  [pdf, ps, other

    cs.CV cs.LG

    Towards a general-purpose foundation model for fMRI analysis

    Authors: Cheng Wang, Yu Jiang, Zhihao Peng, Chenxin Li, Changbae Bang, Lin Zhao, Jinglei Lv, Jorge Sepulcre, Carl Yang, Lifang He, Tianming Liu, Daniel Barron, Quanzheng Li, Randy Hirschtick, Byung-Hoon Kim, Xiang Li, Yixuan Yuan

    Abstract: Functional Magnetic Resonance Imaging (fMRI) is essential for studying brain function and diagnosing neurological disorders, but current analysis methods face reproducibility and transferability issues due to complex pre-processing and task-specific models. We introduce NeuroSTORM (Neuroimaging Foundation Model with Spatial-Temporal Optimized Representation Modeling), a generalizable framework tha… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

  19. arXiv:2506.11017  [pdf, ps, other

    cs.CL cs.AI cs.PF

    TeleEval-OS: Performance evaluations of large language models for operations scheduling

    Authors: Yanyan Wang, Yingying Wang, Junli Liang, Yin Xu, Yunlong Liu, Yiming Xu, Zhengwang Jiang, Zhehe Li, Fei Li, Long Zhao, Kuang Xu, Qi Song, Xiangyang Li

    Abstract: The rapid advancement of large language models (LLMs) has significantly propelled progress in artificial intelligence, demonstrating substantial application potential across multiple specialized domains. Telecommunications operation scheduling (OS) is a critical aspect of the telecommunications industry, involving the coordinated management of networks, services, risks, and human resources to opti… ▽ More

    Submitted 5 May, 2025; originally announced June 2025.

  20. arXiv:2506.10910  [pdf, ps, other

    cs.CL

    Magistral

    Authors: Mistral-AI, :, Abhinav Rastogi, Albert Q. Jiang, Andy Lo, Gabrielle Berrada, Guillaume Lample, Jason Rute, Joep Barmentlo, Karmesh Yadav, Kartik Khandelwal, Khyathi Raghavi Chandu, Léonard Blier, Lucile Saulnier, Matthieu Dinot, Maxime Darrin, Neha Gupta, Roman Soletskyi, Sagar Vaze, Teven Le Scao, Yihan Wang, Adam Yang, Alexander H. Liu, Alexandre Sablayrolles, Amélie Héliou , et al. (76 additional authors not shown)

    Abstract: We introduce Magistral, Mistral's first reasoning model and our own scalable reinforcement learning (RL) pipeline. Instead of relying on existing implementations and RL traces distilled from prior models, we follow a ground up approach, relying solely on our own models and infrastructure. Notably, we demonstrate a stack that enabled us to explore the limits of pure RL training of LLMs, present a s… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

  21. arXiv:2506.03569  [pdf, ps, other

    cs.CL

    MiMo-VL Technical Report

    Authors: Xiaomi LLM-Core Team, :, Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zihan Jiang, Zhixian Zheng, Zhichao Song , et al. (50 additional authors not shown)

    Abstract: We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-language models delivering state-of-the-art performance in both general visual understanding and multimodal reasoning. MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, and scores 59.4 on OlympiadBench, surpassing models with up to 78B parameters. For GUI grounding applications, it sets a new standard with… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

    Comments: 32 pages

  22. arXiv:2506.03077  [pdf, ps, other

    cs.LG cs.AI

    StreamBP: Memory-Efficient Exact Backpropagation for Long Sequence Training of LLMs

    Authors: Qijun Luo, Mengqi Li, Lei Zhao, Xiao Li

    Abstract: Training language models on long sequence data is a demanding requirement for enhancing the model's capability on complex tasks, e.g., long-chain reasoning. However, as the sequence length scales up, the memory cost for storing activation values becomes huge during the Backpropagation (BP) process, even with the application of gradient checkpointing technique. To tackle this challenge, we propose… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

  23. arXiv:2506.02334  [pdf, ps, other

    cs.CV

    Generalized Category Discovery via Reciprocal Learning and Class-Wise Distribution Regularization

    Authors: Duo Liu, Zhiquan Tan, Linglan Zhao, Zhongqiang Zhang, Xiangzhong Fang, Weiran Huang

    Abstract: Generalized Category Discovery (GCD) aims to identify unlabeled samples by leveraging the base knowledge from labeled ones, where the unlabeled set consists of both base and novel classes. Since clustering methods are time-consuming at inference, parametric-based approaches have become more popular. However, recent parametric-based methods suffer from inferior base discrimination due to unreliable… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: ICML2025 Poster

  24. arXiv:2506.01600  [pdf, ps, other

    cs.RO cs.AI cs.CV

    WoMAP: World Models For Embodied Open-Vocabulary Object Localization

    Authors: Tenny Yin, Zhiting Mei, Tao Sun, Lihan Zha, Emily Zhou, Jeremy Bao, Miyu Yamane, Ola Shorinwa, Anirudha Majumdar

    Abstract: Language-instructed active object localization is a critical challenge for robots, requiring efficient exploration of partially observable environments. However, state-of-the-art approaches either struggle to generalize beyond demonstration datasets (e.g., imitation learning methods) or fail to generate physically grounded actions (e.g., VLMs). To address these limitations, we introduce WoMAP (Wor… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

  25. arXiv:2506.01396  [pdf, ps, other

    cs.LG cs.CR stat.ML

    Mitigating Disparate Impact of Differentially Private Learning through Bounded Adaptive Clipping

    Authors: Linzh Zhao, Aki Rehn, Mikko A. Heikkilä, Razane Tajeddine, Antti Honkela

    Abstract: Differential privacy (DP) has become an essential framework for privacy-preserving machine learning. Existing DP learning methods, however, often have disparate impacts on model predictions, e.g., for minority groups. Gradient clipping, which is often used in DP learning, can suppress larger gradients from challenging samples. We show that this problem is amplified by adaptive clipping, which will… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: NeurIPS 2025 under review. 22 pages, 8 figures

    ACM Class: I.2.6; K.4.2

  26. arXiv:2506.01111  [pdf, ps, other

    cs.SD cs.AI eess.AS

    FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion

    Authors: Shunian Chen, Xinyuan Xie, Zheshu Chen, Liyan Zhao, Owen Lee, Zhan Su, Qilin Sun, Benyou Wang

    Abstract: High-quality, large-scale audio captioning is crucial for advancing audio understanding, yet current automated methods often generate captions that lack fine-grained detail and contextual accuracy, primarily due to their reliance on limited unimodal or superficial multimodal information. Drawing inspiration from human auditory perception, which adeptly integrates cross-modal cues and performs soph… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

  27. arXiv:2505.24063  [pdf

    cs.CL cs.DB

    TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine

    Authors: Jiacheng Xie, Yang Yu, Ziyang Zhang, Shuai Zeng, Jiaxuan He, Ayush Vasireddy, Xiaoting Tang, Congyu Guo, Lening Zhao, Congcong Jing, Guanghui An, Dong Xu

    Abstract: Traditional Chinese Medicine (TCM), as an effective alternative medicine, has been receiving increasing attention. In recent years, the rapid development of large language models (LLMs) tailored for TCM has underscored the need for an objective and comprehensive evaluation framework to assess their performance on real-world tasks. However, existing evaluation datasets are limited in scope and prim… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: 22 pages, 4 figures

  28. arXiv:2505.23195  [pdf, other

    cs.LG cs.AI

    Less is More: Unlocking Specialization of Time Series Foundation Models via Structured Pruning

    Authors: Lifan Zhao, Yanyan Shen, Zhaoyang Liu, Xue Wang, Jiaji Deng

    Abstract: Scaling laws motivate the development of Time Series Foundation Models (TSFMs) that pre-train vast parameters and achieve remarkable zero-shot forecasting performance. Surprisingly, even after fine-tuning, TSFMs cannot consistently outperform smaller, specialized models trained on full-shot downstream data. A key question is how to realize effective adaptation of TSFMs for a target forecasting tas… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: Manuscript with fixed typos and figures

  29. arXiv:2505.21956  [pdf, ps, other

    cs.CV cs.AI cs.CL cs.LG

    Cross-modal RAG: Sub-dimensional Retrieval-Augmented Text-to-Image Generation

    Authors: Mengdan Zhu, Senhao Cheng, Guangji Bai, Yifei Zhang, Liang Zhao

    Abstract: Text-to-image generation increasingly demands access to domain-specific, fine-grained, and rapidly evolving knowledge that pretrained models cannot fully capture. Existing Retrieval-Augmented Generation (RAG) methods attempt to address this by retrieving globally relevant images, but they fail when no single image contains all desired elements from a complex user query. We propose Cross-modal RAG,… ▽ More

    Submitted 28 May, 2025; v1 submitted 28 May, 2025; originally announced May 2025.

  30. arXiv:2505.21890  [pdf, ps, other

    cs.CV

    Hyperspectral Gaussian Splatting

    Authors: Sunil Kumar Narayanan, Lingjun Zhao, Lu Gan, Yongsheng Chen

    Abstract: Hyperspectral imaging (HSI) has been widely used in agricultural applications for non-destructive estimation of plant nutrient composition and precise determination of nutritional elements in samples. Recently, 3D reconstruction methods have been used to create implicit neural representations of HSI scenes, which can help localize the target object's nutrient composition spatially and spectrally.… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  31. arXiv:2505.21845  [pdf, ps, other

    stat.ML cs.LG cs.SI stat.ME

    Spectral clustering for dependent community Hawkes process models of temporal networks

    Authors: Lingfei Zhao, Hadeel Soliman, Kevin S. Xu, Subhadeep Paul

    Abstract: Temporal networks observed continuously over time through timestamped relational events data are commonly encountered in application settings including online social media communications, financial transactions, and international relations. Temporal networks often exhibit community structure and strong dependence patterns among node pairs. This dependence can be modeled through mutual excitations,… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  32. arXiv:2505.21418  [pdf, ps, other

    cs.MA

    Autonomous Multi-Modal LLM Agents for Treatment Planning in Focused Ultrasound Ablation Surgery

    Authors: Lina Zhao, Jiaxing Bai, Zihao Bian, Qingyue Chen, Yafang Li, Guangbo Li, Min He, Huaiyuan Yao, Zongjiu Zhang

    Abstract: Focused Ultrasound Ablation Surgery (FUAS) has emerged as a promising non-invasive therapeutic modality, valued for its safety and precision. Nevertheless, its clinical implementation entails intricate tasks such as multimodal image interpretation, personalized dose planning, and real-time intraoperative decision-making processes that demand intelligent assistance to improve efficiency and reliabi… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  33. arXiv:2505.21333  [pdf, other

    cs.CV

    MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

    Authors: Yang Shi, Huanqian Wang, Wulin Xie, Huanyao Zhang, Lijie Zhao, Yi-Fan Zhang, Xinfeng Li, Chaoyou Fu, Zhuoer Wen, Wenting Liu, Zhuoran Zhang, Xinlong Chen, Bohan Zeng, Sihan Yang, Yuanxing Zhang, Pengfei Wan, Haotian Wang, Wenjing Yang

    Abstract: Multimodal Large Language Models (MLLMs) have achieved considerable accuracy in Optical Character Recognition (OCR) from static images. However, their efficacy in video OCR is significantly diminished due to factors such as motion blur, temporal variations, and visual effects inherent in video content. To provide clearer guidance for training practical MLLMs, we introduce the MME-VideoOCR benchmar… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: preprint

  34. arXiv:2505.21325  [pdf, ps, other

    cs.CV

    MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on

    Authors: Guangyuan Li, Siming Zheng, Hao Zhang, Jinwei Chen, Junsheng Luan, Binkai Ou, Lei Zhao, Bo Li, Peng-Tao Jiang

    Abstract: Video Virtual Try-On (VVT) aims to simulate the natural appearance of garments across consecutive video frames, capturing their dynamic variations and interactions with human body motion. However, current VVT methods still face challenges in terms of spatiotemporal consistency and garment content preservation. First, they use diffusion models based on the U-Net, which are limited in their expressi… ▽ More

    Submitted 28 May, 2025; v1 submitted 27 May, 2025; originally announced May 2025.

  35. arXiv:2505.21140  [pdf, ps, other

    cs.LG cs.AI

    HeteroBA: A Structure-Manipulating Backdoor Attack on Heterogeneous Graphs

    Authors: Honglin Gao, Xiang Li, Lan Zhao, Gaoxi Xiao

    Abstract: Heterogeneous graph neural networks (HGNNs) have recently drawn increasing attention for modeling complex multi-relational data in domains such as recommendation, finance, and social networks. While existing research has been largely focused on enhancing HGNNs' predictive performance, their robustness and security, especially under backdoor attacks, remain underexplored. In this paper, we propose… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  36. arXiv:2505.20643  [pdf, ps, other

    cs.LG cs.AI

    Can Past Experience Accelerate LLM Reasoning?

    Authors: Bo Pan, Liang Zhao

    Abstract: Allocating more compute to large language models (LLMs) reasoning has generally been demonstrated to improve their effectiveness, but also results in increased inference time. In contrast, humans can perform tasks faster and better with increased experience and exposure. Hence, this paper aims to investigate the question: Can LLMs also become faster at reasoning through recurrent exposure on relev… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  37. arXiv:2505.19299  [pdf, ps, other

    cs.CL cs.AI

    A Necessary Step toward Faithfulness: Measuring and Improving Consistency in Free-Text Explanations

    Authors: Lingjun Zhao, Hal Daumé III

    Abstract: Faithful free-text explanations are important to ensure transparency in high-stakes AI decision-making contexts, but they are challenging to generate by language models and assess by humans. In this paper, we present a measure for Prediction-EXplanation (PEX) consistency, by extending the concept of weight of evidence. This measure quantifies how much a free-text explanation supports or opposes a… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

  38. arXiv:2505.18399  [pdf, ps, other

    cs.CV cs.AI cs.LG

    Taming Diffusion for Dataset Distillation with High Representativeness

    Authors: Lin Zhao, Yushu Wu, Xinru Jiang, Jianyang Gu, Yanzhi Wang, Xiaolin Xu, Pu Zhao, Xue Lin

    Abstract: Recent deep learning models demand larger datasets, driving the need for dataset distillation to create compact, cost-efficient datasets while maintaining performance. Due to the powerful image generation capability of diffusion, it has been introduced to this field for generating distilled images. In this paper, we systematically investigate issues present in current diffusion-based dataset disti… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

    Comments: The paper is accepted by ICML 2025

  39. arXiv:2505.16463  [pdf, ps, other

    cs.CV cs.LG

    AnchorFormer: Differentiable Anchor Attention for Efficient Vision Transformer

    Authors: Jiquan Shan, Junxiao Wang, Lifeng Zhao, Liang Cai, Hongyuan Zhang, Ioannis Liritzis

    Abstract: Recently, vision transformers (ViTs) have achieved excellent performance on vision tasks by measuring the global self-attention among the image patches. Given $n$ patches, they will have quadratic complexity such as $\mathcal{O}(n^2)$ and the time cost is high when splitting the input image with a small granularity. Meanwhile, the pivotal information is often randomly gathered in a few regions of… ▽ More

    Submitted 18 June, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

  40. arXiv:2505.16249  [pdf, ps, other

    cs.RO cs.AI

    Manipulating Elasto-Plastic Objects With 3D Occupancy and Learning-Based Predictive Control

    Authors: Zhen Zhang, Xiangyu Chu, Yunxi Tang, Lulu Zhao, Jing Huang, Zhongliang Jiang, K. W. Samuel Au

    Abstract: Manipulating elasto-plastic objects remains a significant challenge due to severe self-occlusion, difficulties of representation, and complicated dynamics. This work proposes a novel framework for elasto-plastic object manipulation with a quasi-static assumption for motions, leveraging 3D occupancy to represent such objects, a learned dynamics model trained with 3D occupancy, and a learning-based… ▽ More

    Submitted 22 May, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

    Comments: 8 Pages, 13 figures, accepted for publication in IEEE Robotics and Automation Letters (RA-L)

  41. arXiv:2505.15962  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Pre-training Large Memory Language Models with Internal and External Knowledge

    Authors: Linxi Zhao, Sofian Zalouk, Christian K. Belardi, Justin Lovelace, Jin Peng Zhou, Kilian Q. Weinberger, Yoav Artzi, Jennifer J. Sun

    Abstract: Neural language models are black-boxes -- both linguistic patterns and factual knowledge are distributed across billions of opaque parameters. This entangled encoding makes it difficult to reliably inspect, verify, or update specific facts. We propose a new class of language models, Large Memory Language Models (LMLM) with a pre-training recipe that stores factual knowledge in both internal weight… ▽ More

    Submitted 2 July, 2025; v1 submitted 21 May, 2025; originally announced May 2025.

    Comments: Code, models, and data available at https://github.com/kilian-group/LMLM

  42. arXiv:2505.15818  [pdf, ps, other

    cs.CV

    InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition

    Authors: Yijie Zheng, Weijie Wu, Qingyun Li, Xuehui Wang, Xu Zhou, Aiai Ren, Jun Shen, Long Zhao, Guoqing Li, Xue Yang

    Abstract: Language-Guided object recognition in remote sensing imagery is crucial for large-scale mapping and automated data annotation. However, existing open-vocabulary and visual grounding methods rely on explicit category cues, limiting their ability to handle complex or implicit queries that require advanced reasoning. To address this issue, we introduce a new suite of tasks, including Instruction-Orie… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

  43. arXiv:2505.15098  [pdf, ps, other

    cs.RO cs.AI

    Object-Focus Actor for Data-efficient Robot Generalization Dexterous Manipulation

    Authors: Yihang Li, Tianle Zhang, Xuelong Wei, Jiayi Li, Lin Zhao, Dongchi Huang, Zhirui Fang, Minhua Zheng, Wenjun Dai, Xiaodong He

    Abstract: Robot manipulation learning from human demonstrations offers a rapid means to acquire skills but often lacks generalization across diverse scenes and object placements. This limitation hinders real-world applications, particularly in complex tasks requiring dexterous manipulation. Vision-Language-Action (VLA) paradigm leverages large-scale data to enhance generalization. However, due to data scarc… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

  44. arXiv:2505.14149  [pdf

    cs.CL cs.DL cs.IR

    Enhancing Keyphrase Extraction from Academic Articles Using Section Structure Information

    Authors: Chengzhi Zhang, Xinyi Yan, Lei Zhao, Yingyi Zhang

    Abstract: The exponential increase in academic papers has significantly increased the time required for researchers to access relevant literature. Keyphrase Extraction (KPE) offers a solution to this situation by enabling researchers to efficiently retrieve relevant literature. The current study on KPE from academic articles aims to improve the performance of extraction models through innovative approaches… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

    Journal ref: Scientometrics, 2025

  45. arXiv:2505.13911  [pdf

    eess.IV cs.AI cs.CV

    Bronchovascular Tree-Guided Weakly Supervised Learning Method for Pulmonary Segment Segmentation

    Authors: Ruijie Zhao, Zuopeng Tan, Xiao Xue, Longfei Zhao, Bing Li, Zicheng Liao, Ying Ming, Jiaru Wang, Ran Xiao, Sirong Piao, Rui Zhao, Qiqi Xu, Wei Song

    Abstract: Pulmonary segment segmentation is crucial for cancer localization and surgical planning. However, the pixel-wise annotation of pulmonary segments is laborious, as the boundaries between segments are indistinguishable in medical images. To this end, we propose a weakly supervised learning (WSL) method, termed Anatomy-Hierarchy Supervised Learning (AHSL), which consults the precise clinical anatomic… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

  46. arXiv:2505.13519  [pdf, other

    stat.ML cs.AI cs.LG

    Continuous Domain Generalization

    Authors: Zekun Cai, Yiheng Yao, Guangji Bai, Renhe Jiang, Xuan Song, Ryosuke Shibasaki, Liang Zhao

    Abstract: Real-world data distributions often shift continuously across multiple latent factors such as time, geography, and socioeconomic context. However, existing domain generalization approaches typically treat domains as discrete or evolving along a single axis (e.g., time), which fails to capture the complex, multi-dimensional nature of real-world variation. This paper introduces the task of Continuou… ▽ More

    Submitted 17 May, 2025; originally announced May 2025.

    Comments: 22 pages, 9 figures

  47. arXiv:2505.12276  [pdf, ps, other

    cs.SI

    Community detection of hypergraphs by Ricci flow

    Authors: Yulu Tian, Jicheng Ma, Yunyan Yang, Liang Zhao

    Abstract: Community detection in hypergraphs is both instrumental for functional module identification and intricate due to higher-order interactions among nodes. We define a hypergraph Ricci flow that directly operates on higher-order interactions of hypergraphs and prove long-time existence of the flow. Building on this theoretical foundation, we develop HyperRCD-a Ricci-flow-based community detection app… ▽ More

    Submitted 18 May, 2025; originally announced May 2025.

    Comments: 19 pages, 6 figures

  48. arXiv:2505.10685  [pdf, other

    cs.CV

    GaussianFormer3D: Multi-Modal Gaussian-based Semantic Occupancy Prediction with 3D Deformable Attention

    Authors: Lingjun Zhao, Sizhe Wei, James Hays, Lu Gan

    Abstract: 3D semantic occupancy prediction is critical for achieving safe and reliable autonomous driving. Compared to camera-only perception systems, multi-modal pipelines, especially LiDAR-camera fusion methods, can produce more accurate and detailed predictions. Although most existing works utilize a dense grid-based representation, in which the entire 3D space is uniformly divided into discrete voxels,… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  49. arXiv:2505.09847  [pdf, other

    cs.LG cs.AI cs.IR stat.ML

    Causal Predictive Optimization and Generation for Business AI

    Authors: Liyang Zhao, Olurotimi Seton, Himadeep Reddy Reddivari, Suvendu Jena, Shadow Zhao, Rachit Kumar, Changshuai Wei

    Abstract: The sales process involves sales functions converting leads or opportunities to customers and selling more products to existing customers. The optimization of the sales process thus is key to success of any B2B business. In this work, we introduce a principled approach to sales optimization and business AI, namely the Causal Predictive Optimization and Generation, which includes three layers: 1) p… ▽ More

    Submitted 21 May, 2025; v1 submitted 14 May, 2025; originally announced May 2025.

  50. arXiv:2505.09764  [pdf, ps, other

    cs.DC cs.NI

    FLASH: Fast All-to-All Communication in GPU Clusters

    Authors: Yiran Lei, Dongjoo Lee, Liangyu Zhao, Daniar Kurniawan, Chanmyeong Kim, Heetaek Jeong, Changsu Kim, Hyeonseong Choi, Liangcheng Yu, Arvind Krishnamurthy, Justine Sherry, Eriko Nurvitadhi

    Abstract: Scheduling All-to-All communications efficiently is fundamental to minimizing job completion times in distributed systems. Incast and straggler flows can slow down All-to-All transfers; and GPU clusters bring additional straggler challenges due to highly heterogeneous link capacities between technologies like NVLink and Ethernet. Existing schedulers all suffer high overheads relative to theoretica… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.