Skip to main content

Showing 1–50 of 473 results for author: Bai, X

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.08725  [pdf, other

    cs.CV

    Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving

    Authors: Zongchuang Zhao, Haoyu Fu, Dingkang Liang, Xin Zhou, Dingyuan Zhang, Hongwei Xie, Bing Wang, Xiang Bai

    Abstract: The Large Visual-Language Models (LVLMs) have significantly advanced image understanding. Their comprehension and reasoning capabilities enable promising applications in autonomous driving scenarios. However, existing research typically focuses on front-view perspectives and partial objects within scenes, struggling to achieve comprehensive scene understanding. Meanwhile, existing LVLMs suffer fro… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

    Comments: The dataset and code will be released at https://github.com/zc-zhao/DriveMonkey

  2. EAGLE: Contrastive Learning for Efficient Graph Anomaly Detection

    Authors: Jing Ren, Mingliang Hou, Zhixuan Liu, Xiaomei Bai

    Abstract: Graph anomaly detection is a popular and vital task in various real-world scenarios, which has been studied for several decades. Recently, many studies extending deep learning-based methods have shown preferable performance on graph anomaly detection. However, existing methods are lack of efficiency that is definitely necessary for embedded devices. Towards this end, we propose an Efficient Anomal… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  3. arXiv:2505.04380  [pdf, other

    eess.IV cs.CV cs.IR

    Tetrahedron-Net for Medical Image Registration

    Authors: Jinhai Xiang, Shuai Guo, Qianru Han, Dantong Shi, Xinwei He, Xiang Bai

    Abstract: Medical image registration plays a vital role in medical image processing. Extracting expressive representations for medical images is crucial for improving the registration quality. One common practice for this end is constructing a convolutional backbone to enable interactions with skip connections among feature extraction layers. The de facto structure, U-Net-like networks, has attempted to des… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

  4. arXiv:2505.02056  [pdf, other

    cs.CV cs.LG

    Handling Imbalanced Pseudolabels for Vision-Language Models with Concept Alignment and Confusion-Aware Calibrated Margin

    Authors: Yuchen Wang, Xuefeng Bai, Xiucheng Li, Weili Guan, Liqiang Nie, Xinyang Chen

    Abstract: Adapting vision-language models (VLMs) to downstream tasks with pseudolabels has gained increasing attention. A major obstacle is that the pseudolabels generated by VLMs tend to be imbalanced, leading to inferior performance. While existing methods have explored various strategies to address this, the underlying causes of imbalance remain insufficiently investigated. To fill this gap, we delve int… ▽ More

    Submitted 4 May, 2025; originally announced May 2025.

    Comments: Accepted to ICML 2025

  5. arXiv:2504.21682  [pdf, other

    cs.CV

    Visual Text Processing: A Comprehensive Review and Unified Evaluation

    Authors: Yan Shu, Weichao Zeng, Fangmin Zhao, Zeyu Chen, Zhenhang Li, Xiaomeng Yang, Yu Zhou, Paolo Rota, Xiang Bai, Lianwen Jin, Xu-Cheng Yin, Nicu Sebe

    Abstract: Visual text is a crucial component in both document and scene images, conveying rich semantic information and attracting significant attention in the computer vision community. Beyond traditional tasks such as text detection and recognition, visual text processing has witnessed rapid advancements driven by the emergence of foundation models, including text image reconstruction and text image manip… ▽ More

    Submitted 30 April, 2025; originally announced April 2025.

  6. arXiv:2504.15376  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.MM

    Towards Understanding Camera Motions in Any Video

    Authors: Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Tiffany Ling, Yuhan Huang, Sifan Liu, Mingyu Chen, Rushikesh Zawar, Xue Bai, Yilun Du, Chuang Gan, Deva Ramanan

    Abstract: We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that s… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

    Comments: Project site: https://linzhiqiu.github.io/papers/camerabench/

  7. arXiv:2504.12395  [pdf, other

    cs.CV

    InstantCharacter: Personalize Any Characters with a Scalable Diffusion Transformer Framework

    Authors: Jiale Tao, Yanbing Zhang, Qixun Wang, Yiji Cheng, Haofan Wang, Xu Bai, Zhengguang Zhou, Ruihuang Li, Linqing Wang, Chunyu Wang, Qin Lin, Qinglin Lu

    Abstract: Current learning-based subject customization approaches, predominantly relying on U-Net architectures, suffer from limited generalization ability and compromised image quality. Meanwhile, optimization-based methods require subject-specific fine-tuning, which inevitably degrades textual controllability. To address these challenges, we propose InstantCharacter, a scalable framework for character cus… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

    Comments: Tech Report. Code is available at https://github.com/Tencent/InstantCharacter

  8. arXiv:2504.09966  [pdf, other

    cs.CV

    SemiETS: Integrating Spatial and Content Consistencies for Semi-Supervised End-to-end Text Spotting

    Authors: Dongliang Luo, Hanshen Zhu, Ziyang Zhang, Dingkang Liang, Xudong Xie, Yuliang Liu, Xiang Bai

    Abstract: Most previous scene text spotting methods rely on high-quality manual annotations to achieve promising performance. To reduce their expensive costs, we study semi-supervised text spotting (SSTS) to exploit useful information from unlabeled images. However, directly applying existing semi-supervised methods of general scenes to SSTS will face new challenges: 1) inconsistent pseudo labels between de… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

    Comments: Accepted by CVPR2025. Code will be available at \url{https://github.com/DrLuo/SemiETS}

  9. arXiv:2504.05537  [pdf, other

    cs.CV cs.AI

    Towards Efficient Real-Time Video Motion Transfer via Generative Time Series Modeling

    Authors: Tasmiah Haque, Md. Asif Bin Syed, Byungheon Jeong, Xue Bai, Sumit Mohan, Somdyuti Paul, Imtiaz Ahmed, Srinjoy Das

    Abstract: We propose a deep learning framework designed to significantly optimize bandwidth for motion-transfer-enabled video applications, including video conferencing, virtual reality interactions, health monitoring systems, and vision-based real-time anomaly detection. To capture complex motion effectively, we utilize the First Order Motion Model (FOMM), which encodes dynamic objects by detecting keypoin… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

  10. arXiv:2504.02328  [pdf, other

    cs.CV

    Refining CLIP's Spatial Awareness: A Visual-Centric Perspective

    Authors: Congpei Qiu, Yanhao Wu, Wei Ke, Xiuxiu Bai, Tong Zhang

    Abstract: Contrastive Language-Image Pre-training (CLIP) excels in global alignment with language but exhibits limited sensitivity to spatial information, leading to strong performance in zero-shot classification tasks but underperformance in tasks requiring precise spatial understanding. Recent approaches have introduced Region-Language Alignment (RLA) to enhance CLIP's performance in dense multimodal task… ▽ More

    Submitted 3 April, 2025; originally announced April 2025.

    Comments: ICLR 2025

  11. arXiv:2504.01509  [pdf, other

    cs.CL

    PROPHET: An Inferable Future Forecasting Benchmark with Causal Intervened Likelihood Estimation

    Authors: Zhengwei Tao, Zhi Jin, Bincheng Li, Xiaoying Bai, Haiyan Zhao, Chengfeng Dou, Xiancai Chen, Jia Li, Linyu Li, Chongyang Tao

    Abstract: Predicting future events stands as one of the ultimate aspirations of artificial intelligence. Recent advances in large language model (LLM)-based systems have shown remarkable potential in forecasting future events, thereby garnering significant interest in the research community. Currently, several benchmarks have been established to evaluate the forecasting capabilities by formalizing the event… ▽ More

    Submitted 2 April, 2025; originally announced April 2025.

  12. arXiv:2503.21771  [pdf, other

    cs.CV

    A Unified Image-Dense Annotation Generation Model for Underwater Scenes

    Authors: Hongkai Lin, Dingkang Liang, Zhenghao Qi, Xiang Bai

    Abstract: Underwater dense prediction, especially depth estimation and semantic segmentation, is crucial for gaining a comprehensive understanding of underwater scenes. Nevertheless, high-quality and large-scale underwater datasets with dense annotations remain scarce because of the complex environment and the exorbitant data collection costs. This paper proposes a unified Text-to-Image and DEnse annotation… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

    Comments: Accepted by CVPR 2025. The code is available at https: //github.com/HongkLin/TIDE

  13. arXiv:2503.19755  [pdf, other

    cs.CV

    ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

    Authors: Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, Xiang Bai

    Abstract: End-to-end (E2E) autonomous driving methods still struggle to make correct decisions in interactive closed-loop evaluation due to limited causal reasoning capability. Current methods attempt to leverage the powerful understanding and reasoning abilities of Vision-Language Models (VLMs) to resolve this dilemma. However, the problem is still open that few VLMs for E2E methods perform well in the clo… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

  14. arXiv:2503.16854  [pdf, other

    cs.CV

    Generative Compositor for Few-Shot Visual Information Extraction

    Authors: Zhibo Yang, Wei Hua, Sibo Song, Cong Yao, Yingying Zhu, Wenqing Cheng, Xiang Bai

    Abstract: Visual Information Extraction (VIE), aiming at extracting structured information from visually rich document images, plays a pivotal role in document processing. Considering various layouts, semantic scopes, and languages, VIE encompasses an extensive range of types, potentially numbering in the thousands. However, many of these types suffer from a lack of training data, which poses significant ch… ▽ More

    Submitted 21 March, 2025; originally announced March 2025.

  15. arXiv:2503.13882  [pdf, other

    cs.LG cs.AI

    MoK-RAG: Mixture of Knowledge Paths Enhanced Retrieval-Augmented Generation for Embodied AI Environments

    Authors: Zhengsheng Guo, Linwei Zheng, Xinyang Chen, Xuefeng Bai, Kehai Chen, Min Zhang

    Abstract: While human cognition inherently retrieves information from diverse and specialized knowledge sources during decision-making processes, current Retrieval-Augmented Generation (RAG) systems typically operate through single-source knowledge retrieval, leading to a cognitive-algorithmic discrepancy. To bridge this gap, we introduce MoK-RAG, a novel multi-source RAG framework that implements a mixture… ▽ More

    Submitted 18 March, 2025; originally announced March 2025.

  16. arXiv:2503.13587  [pdf, other

    cs.CV

    Seeing the Future, Perceiving the Future: A Unified Driving World Model for Future Generation and Perception

    Authors: Dingkang Liang, Dingyuan Zhang, Xin Zhou, Sifan Tu, Tianrui Feng, Xiaofan Li, Yumeng Zhang, Mingyang Du, Xiao Tan, Xiang Bai

    Abstract: We present UniFuture, a simple yet effective driving world model that seamlessly integrates future scene generation and perception within a single framework. Unlike existing models focusing solely on pixel-level future prediction or geometric reasoning, our approach jointly models future appearance (i.e., RGB image) and geometry (i.e., depth), ensuring coherent predictions. Specifically, during th… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

    Comments: The project page is at https://github.com/dk-liang/UniFuture

  17. arXiv:2503.13401  [pdf, other

    cs.CL cs.AI

    Using the Tools of Cognitive Science to Understand Large Language Models at Different Levels of Analysis

    Authors: Alexander Ku, Declan Campbell, Xuechunzi Bai, Jiayi Geng, Ryan Liu, Raja Marjieh, R. Thomas McCoy, Andrew Nam, Ilia Sucholutsky, Veniamin Veselovsky, Liyi Zhang, Jian-Qiao Zhu, Thomas L. Griffiths

    Abstract: Modern artificial intelligence systems, such as large language models, are increasingly powerful but also increasingly hard to understand. Recognizing this problem as analogous to the historical difficulties in understanding the human mind, we argue that methods developed in cognitive science can be useful for understanding large language models. We propose a framework for applying these methods b… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

  18. arXiv:2503.11647  [pdf, other

    cs.CV

    ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

    Authors: Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, Di Zhang

    Abstract: Camera control has been actively studied in text or image conditioned video generation tasks. However, altering camera trajectories of a given video remains under-explored, despite its importance in the field of video creation. It is non-trivial due to the extra constraints of maintaining multiple-frame appearance and dynamic synchronization. To address this, we present ReCamMaster, a camera-contr… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

    Comments: Project page: https://jianhongbai.github.io/ReCamMaster/

  19. arXiv:2503.10211  [pdf, other

    cs.CL cs.SD eess.AS

    Adaptive Inner Speech-Text Alignment for LLM-based Speech Translation

    Authors: Henglyu Liu, Andong Chen, Kehai Chen, Xuefeng Bai, Meizhi Zhong, Yuan Qiu, Min Zhang

    Abstract: Recent advancement of large language models (LLMs) has led to significant breakthroughs across various tasks, laying the foundation for the development of LLM-based speech translation systems. Existing methods primarily focus on aligning inputs and outputs across modalities while overlooking deeper semantic alignment within model representations. To address this limitation, we propose an Adaptive… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

    Comments: 12 pages, 7 figures

  20. arXiv:2503.10093  [pdf, other

    cs.CL

    Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model

    Authors: Qiyuan Deng, Xuefeng Bai, Kehai Chen, Yaowei Wang, Liqiang Nie, Min Zhang

    Abstract: Reinforcement Learning (RL) algorithms for safety alignment of Large Language Models (LLMs), such as Direct Preference Optimization (DPO), encounter the challenge of distribution shift. Current approaches typically address this issue through online sampling from the target policy, which requires significant computational resources. In this paper, we hypothesize that during off-policy training, whi… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

  21. arXiv:2503.08328  [pdf, other

    cs.LG cs.IR

    MFRS: A Multi-Frequency Reference Series Approach to Scalable and Accurate Time-Series Forecasting

    Authors: Liang Yu, Lai Tu, Xiang Bai

    Abstract: Multivariate time-series forecasting holds immense value across diverse applications, requiring methods to effectively capture complex temporal and inter-variable dynamics. A key challenge lies in uncovering the intrinsic patterns that govern predictability, beyond conventional designs, focusing on network architectures to explore latent relationships or temporal dependencies. Inspired by signal d… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

  22. arXiv:2503.08101  [pdf, other

    cs.CV

    Accelerate 3D Object Detection Models via Zero-Shot Attention Key Pruning

    Authors: Lizhen Xu, Xiuxiu Bai, Xiaojun Jia, Jianwu Fang, Shanmin Pang

    Abstract: Query-based methods with dense features have demonstrated remarkable success in 3D object detection tasks. However, the computational demands of these models, particularly with large image sizes and multiple transformer layers, pose significant challenges for efficient running on edge devices. Existing pruning and distillation methods either need retraining or are designed for ViT models, which ar… ▽ More

    Submitted 12 March, 2025; v1 submitted 11 March, 2025; originally announced March 2025.

    Comments: The code can be found at https://github.com/iseri27/tg_gbc

  23. arXiv:2503.07539  [pdf, other

    cs.CL

    XIFBench: Evaluating Large Language Models on Multilingual Instruction Following

    Authors: Zhenyu Li, Kehai Chen, Yunfei Long, Xuefeng Bai, Yaoyin Zhang, Xuchen Wei, Juntao Li, Min Zhang

    Abstract: Large Language Models (LLMs) have demonstrated remarkable instruction-following capabilities across various applications. However, their performance in multilingual settings remains poorly understood, as existing evaluations lack fine-grained constraint analysis. We introduce XIFBench, a comprehensive constraint-based benchmark for assessing multilingual instruction-following abilities of LLMs, fe… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

  24. arXiv:2503.07170  [pdf, other

    cs.CL cs.AI

    DeFine: A Decomposed and Fine-Grained Annotated Dataset for Long-form Article Generation

    Authors: Ming Wang, Fang Wang, Minghao Hu, Li He, Haiyang Wang, Jun Zhang, Tianwei Yan, Li Li, Zhunchen Luo, Wei Luo, Xiaoying Bai, Guotong Geng

    Abstract: Long-form article generation (LFAG) presents challenges such as maintaining logical consistency, comprehensive topic coverage, and narrative coherence across extended articles. Existing datasets often lack both the hierarchical structure and fine-grained annotation needed to effectively decompose tasks, resulting in shallow, disorganized article generation. To address these limitations, we introdu… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

  25. arXiv:2503.02911  [pdf, other

    cs.SE cs.AI cs.CL

    Text2Scenario: Text-Driven Scenario Generation for Autonomous Driving Test

    Authors: Xuan Cai, Xuesong Bai, Zhiyong Cui, Danmu Xie, Daocheng Fu, Haiyang Yu, Yilong Ren

    Abstract: Autonomous driving (AD) testing constitutes a critical methodology for assessing performance benchmarks prior to product deployment. The creation of segmented scenarios within a simulated environment is acknowledged as a robust and effective strategy; however, the process of tailoring these scenarios often necessitates laborious and time-consuming manual efforts, thereby hindering the development… ▽ More

    Submitted 4 March, 2025; originally announced March 2025.

  26. arXiv:2503.02519  [pdf, other

    cs.CL

    Generator-Assistant Stepwise Rollback Framework for Large Language Model Agent

    Authors: Xingzuo Li, Kehai Chen, Yunfei Long, Xuefeng Bai, Yong Xu, Min Zhang

    Abstract: Large language model (LLM) agents typically adopt a step-by-step reasoning framework, in which they interleave the processes of thinking and acting to accomplish the given task. However, this paradigm faces a deep-rooted one-pass issue whereby each generated intermediate thought is plugged into the trajectory regardless of its correctness, which can cause irreversible error propagation. To address… ▽ More

    Submitted 10 March, 2025; v1 submitted 4 March, 2025; originally announced March 2025.

  27. arXiv:2503.02034  [pdf, other

    cs.CV cs.AI

    Abn-BLIP: Abnormality-aligned Bootstrapping Language-Image Pre-training for Pulmonary Embolism Diagnosis and Report Generation from CTPA

    Authors: Zhusi Zhong, Yuli Wang, Lulu Bi, Zhuoqi Ma, Sun Ho Ahn, Christopher J. Mullin, Colin F. Greineder, Michael K. Atalay, Scott Collins, Grayson L. Baird, Cheng Ting Lin, Webster Stayman, Todd M. Kolb, Ihab Kamel, Harrison X. Bai, Zhicheng Jiao

    Abstract: Medical imaging plays a pivotal role in modern healthcare, with computed tomography pulmonary angiography (CTPA) being a critical tool for diagnosing pulmonary embolism and other thoracic conditions. However, the complexity of interpreting CTPA scans and generating accurate radiology reports remains a significant challenge. This paper introduces Abn-BLIP (Abnormality-aligned Bootstrapping Language… ▽ More

    Submitted 3 March, 2025; originally announced March 2025.

  28. arXiv:2502.20869  [pdf, other

    cs.CV

    PathVG: A New Benchmark and Dataset for Pathology Visual Grounding

    Authors: Chunlin Zhong, Shuang Hao, Junhua Wu, Xiaona Chang, Jiwei Jiang, Xiu Nie, He Tang, Xiang Bai

    Abstract: With the rapid development of computational pathology, many AI-assisted diagnostic tasks have emerged. Cellular nuclei segmentation can segment various types of cells for downstream analysis, but it relies on predefined categories and lacks flexibility. Moreover, pathology visual question answering can perform image-level understanding but lacks region-level detection capability. To address this,… ▽ More

    Submitted 28 February, 2025; originally announced February 2025.

    Comments: 10pages, 4figures

  29. arXiv:2502.20859  [pdf, other

    cs.CL

    The Power of Personality: A Human Simulation Perspective to Investigate Large Language Model Agents

    Authors: Yifan Duan, Yihong Tang, Xuefeng Bai, Kehai Chen, Juntao Li, Min Zhang

    Abstract: Large language models (LLMs) excel in both closed tasks (including problem-solving, and code generation) and open tasks (including creative writing), yet existing explanations for their capabilities lack connections to real-world human intelligence. To fill this gap, this paper systematically investigates LLM intelligence through the lens of ``human simulation'', addressing three core questions: (… ▽ More

    Submitted 28 February, 2025; originally announced February 2025.

  30. arXiv:2502.20757  [pdf, other

    cs.CL

    The Rise of Darkness: Safety-Utility Trade-Offs in Role-Playing Dialogue Agents

    Authors: Yihong Tang, Kehai Chen, Xuefeng Bai, Zhengyu Niu, Bo Wang, Jie Liu, Min Zhang

    Abstract: Large Language Models (LLMs) have made remarkable advances in role-playing dialogue agents, demonstrating their utility in character simulations. However, it remains challenging for these agents to balance character portrayal utility with content safety because this essential character simulation often comes with the risk of generating unsafe content. To address this issue, we first conduct a syst… ▽ More

    Submitted 28 February, 2025; originally announced February 2025.

  31. arXiv:2502.20387  [pdf, other

    cs.CV

    InsTaG: Learning Personalized 3D Talking Head from Few-Second Video

    Authors: Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Jun Zhou, Lin Gu

    Abstract: Despite exhibiting impressive performance in synthesizing lifelike personalized 3D talking heads, prevailing methods based on radiance fields suffer from high demands for training data and time for each new identity. This paper introduces InsTaG, a 3D talking head synthesis framework that allows a fast learning of realistic personalized 3D talking head from few training data. Built upon a lightwei… ▽ More

    Submitted 27 February, 2025; originally announced February 2025.

    Comments: Accepted at CVPR 2025. Project page: https://fictionarry.github.io/InsTaG/

  32. arXiv:2502.17863  [pdf, other

    cs.CV cs.AI

    ASurvey: Spatiotemporal Consistency in Video Generation

    Authors: Zhiyu Yin, Kehai Chen, Xuefeng Bai, Ruili Jiang, Juntao Li, Hongdong Li, Jin Liu, Yang Xiang, Jun Yu, Min Zhang

    Abstract: Video generation, by leveraging a dynamic visual generation method, pushes the boundaries of Artificial Intelligence Generated Content (AIGC). Video generation presents unique challenges beyond static image generation, requiring both high-quality individual frames and temporal coherence to maintain consistency across the spatiotemporal sequence. Recent works have aimed at addressing the spatiotemp… ▽ More

    Submitted 25 February, 2025; originally announced February 2025.

  33. arXiv:2502.16161  [pdf, other

    cs.CV cs.CL

    OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models

    Authors: Wenwen Yu, Zhibo Yang, Jianqiang Wan, Sibo Song, Jun Tang, Wenqing Cheng, Yuliang Liu, Xiang Bai

    Abstract: Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding and the emergence of large language models capable of processing document-based questions. While various methods have been proposed to tackle the complexities of VsTP, existing solutions often rely on task-specific architectures and objectives for individu… ▽ More

    Submitted 22 February, 2025; originally announced February 2025.

  34. arXiv:2502.11806  [pdf, other

    cs.CL

    Exploring Translation Mechanism of Large Language Models

    Authors: Hongbin Zhang, Kehai Chen, Xuefeng Bai, Xiucheng Li, Yang Xiang, Min Zhang

    Abstract: Large language models (LLMs) have succeeded remarkably in multilingual translation tasks. However, the inherent translation mechanisms of LLMs remain poorly understood, largely due to sophisticated architectures and vast parameter scales. In response to this issue, this study explores the translation mechanism of LLM from the perspective of computational components (e.g., attention heads and MLPs)… ▽ More

    Submitted 25 February, 2025; v1 submitted 17 February, 2025; originally announced February 2025.

  35. arXiv:2502.10999  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    ControlText: Unlocking Controllable Fonts in Multilingual Text Rendering without Font Annotations

    Authors: Bowen Jiang, Yuan Yuan, Xinyi Bai, Zhuoqun Hao, Alyson Yin, Yaojie Hu, Wenyu Liao, Lyle Ungar, Camillo J. Taylor

    Abstract: This work demonstrates that diffusion models can achieve font-controllable multilingual text rendering using just raw images without font label annotations. Visual text rendering remains a significant challenge. While recent methods condition diffusion on glyphs, it is impossible to retrieve exact font annotations from large-scale, real-world datasets, which prevents user-specified font control. T… ▽ More

    Submitted 16 February, 2025; originally announced February 2025.

    Comments: This is preliminary work and code will be released at github.com/bowen-upenn/ControlText

  36. arXiv:2502.10498  [pdf, other

    cs.CV

    The Role of World Models in Shaping Autonomous Driving: A Comprehensive Survey

    Authors: Sifan Tu, Xin Zhou, Dingkang Liang, Xingyu Jiang, Yumeng Zhang, Xiaofan Li, Xiang Bai

    Abstract: Driving World Model (DWM), which focuses on predicting scene evolution during the driving process, has emerged as a promising paradigm in pursuing autonomous driving. These methods enable autonomous driving systems to better perceive, understand, and interact with dynamic driving environments. In this survey, we provide a comprehensive overview of the latest progress in DWM. We categorize existing… ▽ More

    Submitted 14 February, 2025; originally announced February 2025.

    Comments: For continuous updates, please follow the repository: https://github.com/LMD0311/Awesome-World-Model

  37. arXiv:2502.09352  [pdf, other

    cs.LG cs.CV math.OC

    Wasserstein distributional adversarial training for deep neural networks

    Authors: Xingjian Bai, Guangyi He, Yifan Jiang, Jan Obloj

    Abstract: Design of adversarial attacks for deep neural networks, as well as methods of adversarial training against them, are subject of intense research. In this paper, we propose methods to train against distributional attack threats, extending the TRADES method used for pointwise attacks. Our approach leverages recent contributions and relies on sensitivity analysis for Wasserstein distributionally robu… ▽ More

    Submitted 13 February, 2025; originally announced February 2025.

    Comments: 15 pages, 4 figures

  38. arXiv:2501.15588  [pdf, other

    eess.IV cs.CV

    Tumor Detection, Segmentation and Classification Challenge on Automated 3D Breast Ultrasound: The TDSC-ABUS Challenge

    Authors: Gongning Luo, Mingwang Xu, Hongyu Chen, Xinjie Liang, Xing Tao, Dong Ni, Hyunsu Jeong, Chulhong Kim, Raphael Stock, Michael Baumgartner, Yannick Kirchhoff, Maximilian Rokuss, Klaus Maier-Hein, Zhikai Yang, Tianyu Fan, Nicolas Boutry, Dmitry Tereshchenko, Arthur Moine, Maximilien Charmetant, Jan Sauer, Hao Du, Xiang-Hui Bai, Vipul Pai Raikar, Ricardo Montoya-del-Angel, Robert Marti , et al. (12 additional authors not shown)

    Abstract: Breast cancer is one of the most common causes of death among women worldwide. Early detection helps in reducing the number of deaths. Automated 3D Breast Ultrasound (ABUS) is a newer approach for breast screening, which has many advantages over handheld mammography such as safety, speed, and higher detection rate of breast cancer. Tumor detection, segmentation, and classification are key componen… ▽ More

    Submitted 26 January, 2025; originally announced January 2025.

  39. arXiv:2501.15394  [pdf, other

    cs.CV

    Doracamom: Joint 3D Detection and Occupancy Prediction with Multi-view 4D Radars and Cameras for Omnidirectional Perception

    Authors: Lianqing Zheng, Jianan Liu, Runwei Guan, Long Yang, Shouyi Lu, Yuanzhe Li, Xiaokai Bai, Jie Bai, Zhixiong Ma, Hui-Liang Shen, Xichan Zhu

    Abstract: 3D object detection and occupancy prediction are critical tasks in autonomous driving, attracting significant attention. Despite the potential of recent vision-based methods, they encounter challenges under adverse conditions. Thus, integrating cameras with next-generation 4D imaging radar to achieve unified multi-task perception is highly significant, though research in this domain remains limite… ▽ More

    Submitted 3 March, 2025; v1 submitted 25 January, 2025; originally announced January 2025.

  40. arXiv:2501.14729  [pdf, other

    cs.CV

    HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation

    Authors: Xin Zhou, Dingkang Liang, Sifan Tu, Xiwu Chen, Yikang Ding, Dingyuan Zhang, Feiyang Tan, Hengshuang Zhao, Xiang Bai

    Abstract: Driving World Models (DWMs) have become essential for autonomous driving by enabling future scene prediction. However, existing DWMs are limited to scene generation and fail to incorporate scene understanding, which involves interpreting and reasoning about the driving environment. In this paper, we present a unified Driving World Model named HERMES. We seamlessly integrate 3D scene understanding… ▽ More

    Submitted 12 March, 2025; v1 submitted 24 January, 2025; originally announced January 2025.

    Comments: The code will be available at https://github.com/LMD0311/HERMES

  41. arXiv:2501.11592  [pdf, other

    cs.LG cs.AI cs.CL

    Training-free Ultra Small Model for Universal Sparse Reconstruction in Compressed Sensing

    Authors: Chaoqing Tang, Huanze Zhuang, Guiyun Tian, Zhenli Zeng, Yi Ding, Wenzhong Liu, Xiang Bai

    Abstract: Pre-trained large models attract widespread attention in recent years, but they face challenges in applications that require high interpretability or have limited resources, such as physical sensing, medical imaging, and bioinformatics. Compressed Sensing (CS) is a well-proved theory that drives many recent breakthroughs in these applications. However, as a typical under-determined linear system,… ▽ More

    Submitted 23 January, 2025; v1 submitted 20 January, 2025; originally announced January 2025.

  42. arXiv:2501.05460  [pdf, other

    cs.DC cs.AI cs.CV cs.LG

    Efficiently Serving Large Multimodal Models Using EPD Disaggregation

    Authors: Gursimran Singh, Xinglu Wang, Yifan Hu, Timothy Yu, Linzi Xing, Wei Jiang, Zhefeng Wang, Xiaolong Bai, Yi Li, Ying Xiong, Yong Zhang, Zhenan Fan

    Abstract: Large Multimodal Models (LMMs) extend Large Language Models (LLMs) by handling diverse inputs such as images, audio, and video, but at the cost of adding a multimodal encoding stage that increases both computational and memory overhead. This step negatively impacting key Service Level Objectives (SLOs) like time to first token (TTFT) and end-to-end throughput (E2ETP). We introduce Encode-Prefill-D… ▽ More

    Submitted 5 February, 2025; v1 submitted 25 December, 2024; originally announced January 2025.

    Comments: 16 pages, 11 figures

  43. arXiv:2501.01427  [pdf, other

    cs.CV

    VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control

    Authors: Yuanpeng Tu, Hao Luo, Xi Chen, Sihui Ji, Xiang Bai, Hengshuang Zhao

    Abstract: Despite significant advancements in video generation, inserting a given object into videos remains a challenging task. The difficulty lies in preserving the appearance details of the reference object and accurately modeling coherent motions at the same time. In this paper, we propose VideoAnydoor, a zero-shot video object insertion framework with high-fidelity detail preservation and precise motio… ▽ More

    Submitted 7 January, 2025; v1 submitted 2 January, 2025; originally announced January 2025.

    Comments: Project page: https://videoanydoor.github.io/

  44. arXiv:2501.00321  [pdf, other

    cs.CV cs.AI

    OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

    Authors: Ling Fu, Biao Yang, Zhebin Kuang, Jiajun Song, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Mingxin Huang, Zhang Li, Guozhi Tang, Bin Shan, Chunhui Lin, Qi Liu, Binghong Wu, Hao Feng, Hao Liu, Can Huang, Jingqun Tang, Wei Chen, Lianwen Jin, Yuliang Liu, Xiang Bai

    Abstract: Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest recently. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities on certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge th… ▽ More

    Submitted 31 December, 2024; originally announced January 2025.

  45. arXiv:2412.19412  [pdf, other

    cs.CV

    MINIMA: Modality Invariant Image Matching

    Authors: Jiangwei Ren, Xingyu Jiang, Zizhuo Li, Dingkang Liang, Xin Zhou, Xiang Bai

    Abstract: Image matching for both cross-view and cross-modality plays a critical role in multimodal perception. In practice, the modality gap caused by different imaging systems/styles poses great challenges to the matching task. Existing works try to extract invariant features for specific modalities and train on limited datasets, showing poor generalization. In this paper, we present MINIMA, a unified ima… ▽ More

    Submitted 29 March, 2025; v1 submitted 26 December, 2024; originally announced December 2024.

    Comments: Accepted to CVPR 2025. The dataset and code are available at https://github.com/LSXI7/MINIMA

  46. arXiv:2412.13540  [pdf, other

    cs.CL cs.CV

    Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning

    Authors: Yingjie Zhu, Xuefeng Bai, Kehai Chen, Yang Xiang, Jun Yu, Min Zhang

    Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across diverse tasks. Despite great success, recent studies show that LVLMs encounter substantial limitations when engaging with visual graphs. To study the reason behind these limitations, we propose VGCure, a comprehensive benchmark covering 22 tasks for examining the fundamental graph understanding and reasoning capac… ▽ More

    Submitted 17 February, 2025; v1 submitted 18 December, 2024; originally announced December 2024.

  47. arXiv:2412.12643  [pdf, other

    cs.CL

    LLM-based Discriminative Reasoning for Knowledge Graph Question Answering

    Authors: Mufan Xu, Kehai Chen, Xuefeng Bai, Muyun Yang, Tiejun Zhao, Min Zhang

    Abstract: Large language models (LLMs) based on generative pre-trained Transformer have achieved remarkable performance on knowledge graph question-answering (KGQA) tasks. However, LLMs often produce ungrounded subgraph planning or reasoning results in KGQA due to the hallucinatory behavior brought by the generative paradigm. To tackle this issue, we propose READS to reformulate the KGQA process into discri… ▽ More

    Submitted 7 March, 2025; v1 submitted 17 December, 2024; originally announced December 2024.

  48. arXiv:2412.12499  [pdf, other

    cs.CL cs.AI

    LinguaLIFT: An Effective Two-stage Instruction Tuning Framework for Low-Resource Language Reasoning

    Authors: Hongbin Zhang, Kehai Chen, Xuefeng Bai, Yang Xiang, Min Zhang

    Abstract: Large language models (LLMs) have exhibited impressive multilingual reasoning capabilities, driven by extensive multilingual pre-training corpora and instruction fine-tuning data. However, a performance gap exists between high- and low-resource language reasoning tasks due to the language imbalance in the pre-training corpus, which is exacerbated by evaluation bias in existing reasoning benchmarks… ▽ More

    Submitted 17 February, 2025; v1 submitted 16 December, 2024; originally announced December 2024.

  49. arXiv:2412.10734  [pdf, other

    cs.CV

    OmniHD-Scenes: A Next-Generation Multimodal Dataset for Autonomous Driving

    Authors: Lianqing Zheng, Long Yang, Qunshu Lin, Wenjin Ai, Minghao Liu, Shouyi Lu, Jianan Liu, Hongze Ren, Jingyue Mo, Xiaokai Bai, Jie Bai, Zhixiong Ma, Xichan Zhu

    Abstract: The rapid advancement of deep learning has intensified the need for comprehensive data for use by autonomous driving algorithms. High-quality datasets are crucial for the development of effective data-driven autonomous driving solutions. Next-generation autonomous driving datasets must be multimodal, incorporating data from advanced sensors that feature extensive data coverage, detailed annotation… ▽ More

    Submitted 23 January, 2025; v1 submitted 14 December, 2024; originally announced December 2024.

  50. arXiv:2412.05187  [pdf, other

    cs.AI cs.CV cs.HC cs.RO

    SurgBox: Agent-Driven Operating Room Sandbox with Surgery Copilot

    Authors: Jinlin Wu, Xusheng Liang, Xuexue Bai, Zhen Chen

    Abstract: Surgical interventions, particularly in neurology, represent complex and high-stakes scenarios that impose substantial cognitive burdens on surgical teams. Although deliberate education and practice can enhance cognitive capabilities, surgical training opportunities remain limited due to patient safety concerns. To address these cognitive challenges in surgical training and operation, we propose S… ▽ More

    Submitted 6 December, 2024; originally announced December 2024.

    Comments: This work is accepted by IEEE Big Data 2024