Skip to main content

Showing 1–50 of 252 results for author: Tian, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.07608  [pdf, ps, other

    cs.CL cs.AI cs.LG

    MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining

    Authors: Xiaomi LLM-Core Team, :, Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peidian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, Yifan Song, Zihan Jiang, Bowen Ye, Can Cai , et al. (40 additional authors not shown)

    Abstract: We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  2. arXiv:2504.21312  [pdf, other

    cs.PL

    Annotating and Auditing the Safety Properties of Unsafe Rust

    Authors: Zihao Rao, Hongliang Tian, Xin Wang, Hui Xu

    Abstract: Unsafe code is a critical topic in ensuring the security of system software development in Rust. It is the sole source of potential undefined behaviors, assuming the compiler is sound. To avoid the misuse of unsafe code, Rust developers should provide clear safety property annotations for unsafe APIs. However, there is limited official guidance and few best practices for annotating unsafe code. Ev… ▽ More

    Submitted 30 April, 2025; originally announced April 2025.

  3. arXiv:2504.20607  [pdf, other

    cs.CV

    EfficientHuman: Efficient Training and Reconstruction of Moving Human using Articulated 2D Gaussian

    Authors: Hao Tian, Rui Liu, Wen Shen, Yilong Hu, Zhihao Zheng, Xiaolin Qin

    Abstract: 3D Gaussian Splatting (3DGS) has been recognized as a pioneering technique in scene reconstruction and novel view synthesis. Recent work on reconstructing the 3D human body using 3DGS attempts to leverage prior information on human pose to enhance rendering quality and improve training speed. However, it struggles to effectively fit dynamic surface planes due to multi-view inconsistency and redund… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

    Comments: 11 pages, 3 figures

  4. arXiv:2504.20584  [pdf, other

    cs.RO cs.CV

    Hydra: Marker-Free RGB-D Hand-Eye Calibration

    Authors: Martin Huber, Huanyu Tian, Christopher E. Mower, Lucas-Raphael Müller, Sébastien Ourselin, Christos Bergeles, Tom Vercauteren

    Abstract: This work presents an RGB-D imaging-based approach to marker-free hand-eye calibration using a novel implementation of the iterative closest point (ICP) algorithm with a robust point-to-plane (PTP) objective formulated on a Lie algebra. Its applicability is demonstrated through comprehensive experiments using three well known serial manipulators and two RGB-D cameras. With only three randomly chos… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

  5. arXiv:2504.13420  [pdf, other

    cs.RO cs.SE

    Testing the Fault-Tolerance of Multi-Sensor Fusion Perception in Autonomous Driving Systems

    Authors: Haoxiang Tian, Wenqiang Ding, Xingshuo Han, Guoquan Wu, An Guo, Junqi Zhang. Wei Chen, Jun Wei, Tianwei Zhang

    Abstract: High-level Autonomous Driving Systems (ADSs), such as Google Waymo and Baidu Apollo, typically rely on multi-sensor fusion (MSF) based approaches to perceive their surroundings. This strategy increases perception robustness by combining the respective strengths of the camera and LiDAR and directly affects the safety-critical driving decisions of autonomous vehicles (AVs). However, in real-world au… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

  6. arXiv:2504.10479  [pdf, other

    cs.CV

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Authors: Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang , et al. (26 additional authors not shown)

    Abstract: We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single p… ▽ More

    Submitted 18 April, 2025; v1 submitted 14 April, 2025; originally announced April 2025.

    Comments: Technical Report

  7. arXiv:2504.03053  [pdf, other

    cs.RO

    Push-Grasp Policy Learning Using Equivariant Models and Grasp Score Optimization

    Authors: Boce Hu, Heng Tian, Dian Wang, Haojie Huang, Xupeng Zhu, Robin Walters, Robert Platt

    Abstract: Goal-conditioned robotic grasping in cluttered environments remains a challenging problem due to occlusions caused by surrounding objects, which prevent direct access to the target object. A promising solution to mitigate this issue is combining pushing and grasping policies, enabling active rearrangement of the scene to facilitate target retrieval. However, existing methods often overlook the ric… ▽ More

    Submitted 3 April, 2025; originally announced April 2025.

  8. arXiv:2503.23684  [pdf

    cs.CV

    Detail-aware multi-view stereo network for depth estimation

    Authors: Haitao Tian, Junyang Li, Chenxing Wang, Helong Jiang

    Abstract: Multi-view stereo methods have achieved great success for depth estimation based on the coarse-to-fine depth learning frameworks, however, the existing methods perform poorly in recovering the depth of object boundaries and detail regions. To address these issues, we propose a detail-aware multi-view stereo network (DA-MVSNet) with a coarse-to-fine framework. The geometric depth clues hidden in th… ▽ More

    Submitted 30 March, 2025; originally announced March 2025.

  9. arXiv:2503.22512  [pdf, other

    cs.SE

    Unlocking LLM Repair Capabilities in Low-Resource Programming Languages Through Cross-Language Translation and Multi-Agent Refinement

    Authors: Wenqiang Luo, Jacky Wai Keung, Boyang Yang, Jacques Klein, Tegawende F. Bissyande, Haoye Tian, Bach Le

    Abstract: Recent advances in leveraging LLMs for APR have demonstrated impressive capabilities in fixing software defects. However, current LLM-based approaches predominantly focus on mainstream programming languages like Java and Python, neglecting less prevalent but emerging languages such as Rust due to expensive training resources, limited datasets, and insufficient community support. This narrow focus… ▽ More

    Submitted 17 April, 2025; v1 submitted 28 March, 2025; originally announced March 2025.

  10. arXiv:2503.21710  [pdf, other

    cs.SE

    Enhancing Repository-Level Software Repair via Repository-Aware Knowledge Graphs

    Authors: Boyang Yang, Haoye Tian, Jiadong Ren, Shunfu Jin, Yang Liu, Feng Liu, Bach Le

    Abstract: Repository-level software repair faces challenges in bridging semantic gaps between issue descriptions and code patches. Existing approaches, which mostly depend on large language models (LLMs), suffer from semantic ambiguities, limited structural context understanding, and insufficient reasoning capability. To address these limitations, we propose KGCompass with two innovations: (1) a novel repos… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

  11. arXiv:2503.21412  [pdf, other

    cs.AI

    Federated Intelligence: When Large AI Models Meet Federated Fine-Tuning and Collaborative Reasoning at the Network Edge

    Authors: Wanli Ni, Haofeng Sun, Huiqing Ao, Hui Tian

    Abstract: Large artificial intelligence (AI) models exhibit remarkable capabilities in various application scenarios, but deploying them at the network edge poses significant challenges due to issues such as data privacy, computational resources, and latency. In this paper, we explore federated fine-tuning and collaborative reasoning techniques to facilitate the implementation of large AI models in resource… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

    Comments: 8 pages, 6 figures

    Journal ref: IEEE Internet of Things Magazine, 2025

  12. arXiv:2503.10639  [pdf, other

    cs.CV

    GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

    Authors: Rongyao Fang, Chengqi Duan, Kun Wang, Linjiang Huang, Hao Li, Shilin Yan, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Xihui Liu, Hongsheng Li

    Abstract: Current image generation and editing methods primarily process textual prompts as direct inputs without reasoning about visual composition and explicit operations. We present Generation Chain-of-Thought (GoT), a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images. This approach transforms conventional text-to-image generation a… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

    Comments: Dataset and models are released in https://github.com/rongyaofang/GoT

  13. arXiv:2503.06150  [pdf, other

    cs.LG

    Do Fairness Interventions Come at the Cost of Privacy: Evaluations for Binary Classifiers

    Authors: Huan Tian, Guangsheng Zhang, Bo Liu, Tianqing Zhu, Ming Ding, Wanlei Zhou

    Abstract: While in-processing fairness approaches show promise in mitigating biased predictions, their potential impact on privacy leakage remains under-explored. We aim to address this gap by assessing the privacy risks of fairness-enhanced binary classifiers via membership inference attacks (MIAs) and attribute inference attacks (AIAs). Surprisingly, our results reveal that enhancing fairness does not nec… ▽ More

    Submitted 11 March, 2025; v1 submitted 8 March, 2025; originally announced March 2025.

    Comments: Accepted to IEEE Transactions on Dependable and Secure Computing (TDSC)

  14. arXiv:2503.05706  [pdf, other

    cs.CY stat.AP

    The Impact of Building-Induced Visibility Restrictions on Intersection Accidents

    Authors: Hanlin Tian, Yuxiang Feng, Wei Zhou, Anupriya, Mohammed Quddus, Yiannis Demiris, Panagiotis Angeloudis

    Abstract: Traffic accidents, especially at intersections, are a major road safety concern. Previous research has extensively studied intersection-related accidents, but the effect of building-induced visibility restrictions at intersections on accident rates has been under-explored, particularly in urban contexts. Using OpenStreetMap data, the UK's geographic and accident datasets, and the UK Traffic Count… ▽ More

    Submitted 13 February, 2025; originally announced March 2025.

    Comments: TRBAM-24-02409

  15. arXiv:2503.03158  [pdf, other

    cs.SE

    A Systematic Survey on Debugging Techniques for Machine Learning Systems

    Authors: Thanh-Dat Nguyen, Haoye Tian, Bach Le, Patanamon Thongtanunam, Shane McIntosh

    Abstract: Debugging ML software (i.e., the detection, localization and fixing of faults) poses unique challenges compared to traditional software largely due to the probabilistic nature and heterogeneity of its development process. Various methods have been proposed for testing, diagnosing, and repairing ML systems. However, the big picture informing important research directions that really address the dir… ▽ More

    Submitted 4 March, 2025; originally announced March 2025.

  16. arXiv:2503.00741  [pdf, other

    eess.IV cs.CV

    LesionDiffusion: Towards Text-controlled General Lesion Synthesis

    Authors: Henrui Tian, Wenhui Lei, Linrui Dai, Hanyu Chen, Xiaofan Zhang

    Abstract: Fully-supervised lesion recognition methods in medical imaging face challenges due to the reliance on large annotated datasets, which are expensive and difficult to collect. To address this, synthetic lesion generation has become a promising approach. However, existing models struggle with scalability, fine-grained control over lesion attributes, and the generation of complex structures. We propos… ▽ More

    Submitted 18 March, 2025; v1 submitted 2 March, 2025; originally announced March 2025.

    Comments: 10 pages, 4 figures

  17. arXiv:2502.13847  [pdf, other

    cs.CL cs.AI cs.LG

    DH-RAG: A Dynamic Historical Context-Powered Retrieval-Augmented Generation Method for Multi-Turn Dialogue

    Authors: Feiyuan Zhang, Dezhi Zhu, James Ming, Yilun Jin, Di Chai, Liu Yang, Han Tian, Zhaoxin Fan, Kai Chen

    Abstract: Retrieval-Augmented Generation (RAG) systems have shown substantial benefits in applications such as question answering and multi-turn dialogue \citep{lewis2020retrieval}. However, traditional RAG methods, while leveraging static knowledge bases, often overlook the potential of dynamic historical information in ongoing conversations. To bridge this gap, we introduce DH-RAG, a Dynamic Historical Co… ▽ More

    Submitted 19 February, 2025; originally announced February 2025.

  18. Maintenance of Structural Hole Spanners in Dynamic Networks

    Authors: Diksha Goel, Hong Shen, Hui Tian, Mingyu Guo

    Abstract: Structural Hole (SH) spanners are the set of users who bridge different groups of users and are vital in numerous applications. Despite their importance, existing work for identifying SH spanners focuses only on static networks. However, real-world networks are highly dynamic where the underlying structure of the network evolves continuously. Consequently, we study SH spanner problem for dynamic n… ▽ More

    Submitted 16 February, 2025; originally announced February 2025.

    Comments: 6 pages, 5 figures, published in 2021 IEEE 46th Conference on Local Computer Networks (LCN). arXiv admin note: substantial text overlap with arXiv:2302.13292, arXiv:2310.10667

    MSC Class: 68R10 (Graph Theory)

    Journal ref: 2021 IEEE 46th Conference on Local Computer Networks (LCN), pages 339-342

  19. arXiv:2502.10391  [pdf, other

    cs.CL cs.CV

    MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

    Authors: Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen, Fan Yang, Zhang Zhang, Tingting Gao, Di Zhang, Liang Wang, Rong Jin, Tieniu Tan

    Abstract: Despite notable advancements in Multimodal Large Language Models (MLLMs), most state-of-the-art models have not undergone thorough alignment with human preferences. This gap exists because current alignment research has primarily achieved progress in specific areas (e.g., hallucination reduction), while the broader question of whether aligning models with human preferences can systematically enhan… ▽ More

    Submitted 14 February, 2025; originally announced February 2025.

    Comments: Project Page: https://mm-rlhf.github.io/

  20. arXiv:2502.00340  [pdf, other

    cs.LG cs.CL cs.DC

    Enhancing Token Filtering Efficiency in Large Language Model Training with Collider

    Authors: Di Chai, Pengbo Li, Feiyuan Zhang, Yilun Jin, Han Tian, Junxue Zhang, Kai Chen

    Abstract: Token filtering has been proposed to enhance utility of large language models (LLMs) by eliminating inconsequential tokens during training. While using fewer tokens should reduce computational workloads, existing studies have not succeeded in achieving higher efficiency. This is primarily due to the insufficient sparsity caused by filtering tokens only in the output layers, as well as inefficient… ▽ More

    Submitted 1 February, 2025; originally announced February 2025.

  21. arXiv:2501.19051  [pdf, other

    cs.NI

    Swift: Rethinking RDMA Control Plane for Elastic Computing

    Authors: Junxue Zhang, Han Tian, Xinyang Huang, Wenxue Li, Kaiqiang Xu, Dian Shen, Yong Wang, Kai Chen

    Abstract: Elastic computing enables dynamic scaling to meet workload demands, and Remote Direct Memory Access (RDMA) enhances this by providing high-throughput, low-latency network communication. However, integrating RDMA into elastic computing remains a challenge, particularly in control plane operations for RDMA connection setup. This paper revisits the assumptions of prior work on high-performance RDMA… ▽ More

    Submitted 31 January, 2025; originally announced January 2025.

  22. arXiv:2501.18371  [pdf, other

    cs.AR cs.CR

    FLASH-FHE: A Heterogeneous Architecture for Fully Homomorphic Encryption Acceleration

    Authors: Junxue Zhang, Xiaodian Cheng, Gang Cao, Meng Dai, Yijun Sun, Han Tian, Dian Shen, Yong Wang, Kai Chen

    Abstract: While many hardware accelerators have recently been proposed to address the inefficiency problem of fully homomorphic encryption (FHE) schemes, none of them is able to deliver optimal performance when facing real-world FHE workloads consisting of a mixture of shallow and deep computations, due primarily to their homogeneous design principle. This paper presents FLASH-FHE, the first FHE accelerat… ▽ More

    Submitted 30 January, 2025; originally announced January 2025.

  23. arXiv:2501.12796  [pdf, other

    cs.SD cs.IR cs.LG eess.AS

    Hybrid Losses for Hierarchical Embedding Learning

    Authors: Haokun Tian, Stefan Lattner, Brian McFee, Charalampos Saitis

    Abstract: In traditional supervised learning, the cross-entropy loss treats all incorrect predictions equally, ignoring the relevance or proximity of wrong labels to the correct answer. By leveraging a tree hierarchy for fine-grained labels, we investigate hybrid losses, such as generalised triplet and cross-entropy losses, to enforce similarity between labels within a multi-task learning framework. We prop… ▽ More

    Submitted 22 January, 2025; originally announced January 2025.

    Comments: Accepted to ICASSP 2025

  24. arXiv:2501.09367  [pdf, other

    cs.DC

    PICE: A Semantic-Driven Progressive Inference System for LLM Serving in Cloud-Edge Networks

    Authors: Huiyou Zhan, Xuan Zhang, Haisheng Tan, Han Tian, Dongping Yong, Junyang Zhang, Xiang-Yang Li

    Abstract: Large language models (LLMs), while driving a new wave of interactive AI applications across numerous domains, suffer from high inference costs and heavy cloud dependency. Motivated by the redundancy phenomenon in linguistics, we propose a progressive inference paradigm over cloud and edge, i.e., firstly generating the sketch of the answer by LLMs at cloud, and then conducting parallel extension t… ▽ More

    Submitted 16 January, 2025; originally announced January 2025.

  25. arXiv:2501.04012  [pdf, other

    cs.MM cs.LG

    FlexCache: Flexible Approximate Cache System for Video Diffusion

    Authors: Desen Sun, Henry Tian, Tim Lu, Sihang Liu

    Abstract: Text-to-Video applications receive increasing attention from the public. Among these, diffusion models have emerged as the most prominent approach, offering impressive quality in visual content generation. However, it still suffers from substantial computational complexity, often requiring several minutes to generate a single video. While prior research has addressed the computational overhead in… ▽ More

    Submitted 17 December, 2024; originally announced January 2025.

  26. arXiv:2501.03905  [pdf, other

    cs.NI cs.LG

    mFabric: An Efficient and Scalable Fabric for Mixture-of-Experts Training

    Authors: Xudong Liao, Yijun Sun, Han Tian, Xinchen Wan, Yilun Jin, Zilong Wang, Zhenghang Ren, Xinyang Huang, Wenxue Li, Kin Fai Tse, Zhizhen Zhong, Guyue Liu, Ying Zhang, Xiaofeng Ye, Yiming Zhang, Kai Chen

    Abstract: Mixture-of-Expert (MoE) models outperform conventional models by selectively activating different subnets, named \emph{experts}, on a per-token basis. This gated computation generates dynamic communications that cannot be determined beforehand, challenging the existing GPU interconnects that remain \emph{static} during the distributed training process. In this paper, we advocate for a first-of-its… ▽ More

    Submitted 7 January, 2025; originally announced January 2025.

    Comments: Corresponding authors: [email protected] (Z. Zhong), [email protected] (K. Chen)

  27. arXiv:2501.00829  [pdf, other

    cs.NE cs.AI

    An LLM-Empowered Adaptive Evolutionary Algorithm For Multi-Component Deep Learning Systems

    Authors: Haoxiang Tian, Xingshuo Han, Guoquan Wu, An Guo, Yuan Zhou. Jie Zhang, Shuo Li, Jun Wei, Tianwei Zhang

    Abstract: Multi-objective evolutionary algorithms (MOEAs) are widely used for searching optimal solutions in complex multi-component applications. Traditional MOEAs for multi-component deep learning (MCDL) systems face challenges in enhancing the search efficiency while maintaining the diversity. To combat these, this paper proposes $μ$MOEA, the first LLM-empowered adaptive evolutionary search algorithm to… ▽ More

    Submitted 1 January, 2025; originally announced January 2025.

    Comments: 9

  28. arXiv:2412.19055  [pdf, other

    cs.CV cs.LG

    SpectralKD: A Unified Framework for Interpreting and Distilling Vision Transformers via Spectral Analysis

    Authors: Huiyuan Tian, Bonan Xu, Shijian Li, Gang Pan

    Abstract: Knowledge Distillation (KD) has achieved widespread success in compressing large Vision Transformers (ViTs), but a unified theoretical framework for both ViTs and KD is still lacking. In this paper, we propose SpectralKD, a novel unified analytical framework that offers deeper insights into ViTs and optimizes KD via spectral analysis. Our model-wise analysis reveals that CaiT concentrates informat… ▽ More

    Submitted 30 January, 2025; v1 submitted 25 December, 2024; originally announced December 2024.

  29. arXiv:2412.14988  [pdf

    cs.CV cs.LG

    Stitch Contrast and Segment_Learning a Human Action Segmentation Model Using Trimmed Skeleton Videos

    Authors: Haitao Tian, Pierre Payeur

    Abstract: Existing skeleton-based human action classification models rely on well-trimmed action-specific skeleton videos for both training and testing, precluding their scalability to real-world applications where untrimmed videos exhibiting concatenated actions are predominant. To overcome this limitation, recently introduced skeleton action segmentation models involve un-trimmed skeleton videos into end-… ▽ More

    Submitted 21 December, 2024; v1 submitted 19 December, 2024; originally announced December 2024.

    Comments: Accepted at AAAI 2025

  30. arXiv:2412.09613  [pdf, other

    cs.CV

    PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

    Authors: Chenyu Yang, Xuan Dong, Xizhou Zhu, Weijie Su, Jiahao Wang, Hao Tian, Zhe Chen, Wenhai Wang, Lewei Lu, Jifeng Dai

    Abstract: Large Vision-Language Models (VLMs) have been extended to understand both images and videos. Visual token compression is leveraged to reduce the considerable token length of visual inputs. To meet the needs of different tasks, existing high-performance models usually process images and videos separately with different token compression strategies, limiting the capabilities of combining images and… ▽ More

    Submitted 12 December, 2024; originally announced December 2024.

  31. arXiv:2412.09117  [pdf, other

    cs.RO cs.IT eess.SP

    Reconfigurable Intelligent Surface for Internet of Robotic Things

    Authors: Wanli Ni, Ruyu Luo, Xinran Zhang, Peng Wang, Wen Wang, Hui Tian

    Abstract: With the rapid development of artificial intelligence, robotics, and Internet of Things, multi-robot systems are progressively acquiring human-like environmental perception and understanding capabilities, empowering them to complete complex tasks through autonomous decision-making and interaction. However, the Internet of Robotic Things (IoRT) faces significant challenges in terms of spectrum reso… ▽ More

    Submitted 12 December, 2024; originally announced December 2024.

    Comments: 9 pages, 4 figures

    Journal ref: IEEE Internet of Things Magazine, 2025

  32. An End-to-End Collaborative Learning Approach for Connected Autonomous Vehicles in Occluded Scenarios

    Authors: Leandro Parada, Hanlin Tian, Jose Escribano, Panagiotis Angeloudis

    Abstract: Collaborative navigation becomes essential in situations of occluded scenarios in autonomous driving where independent driving policies are likely to lead to collisions. One promising approach to address this issue is through the use of Vehicle-to-Vehicle (V2V) networks that allow for the sharing of perception information with nearby agents, preventing catastrophic accidents. In this article, we p… ▽ More

    Submitted 11 December, 2024; originally announced December 2024.

    Journal ref: Journal reference: 2023 IEEE 26th International Conference on Intelligent Transportation Systems, pp. 5548-5554, 2023

  33. arXiv:2412.08032  [pdf, other

    cs.CE

    Energy-Efficient Robust Beamforming for Multi-Functional RIS-Aided Wireless Communication under Imperfect CSI

    Authors: Ailing Zheng, Wanli Ni, Wen Wang, Hui Tian, Chau Yuen

    Abstract: The robust beamforming design in multi-functional reconfigurable intelligent surface (MF-RIS) assisted wireless networks is investigated in this work, where the MF-RIS supports signal reflection, refraction, and amplification to address the double-fading attenuation and half-space coverage issues faced by traditional RISs. Specifically, we aim to maximize the system energy efficiency by jointly op… ▽ More

    Submitted 10 December, 2024; originally announced December 2024.

    Comments: 15 pages, 6 figures, and this paper has been accepted by IEEE Transactions on Communications

  34. arXiv:2412.05271  [pdf, other

    cs.CV

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Authors: Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yimin Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao , et al. (17 additional authors not shown)

    Abstract: We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision… ▽ More

    Submitted 13 January, 2025; v1 submitted 6 December, 2024; originally announced December 2024.

    Comments: Technical Report

  35. arXiv:2412.03924  [pdf, ps, other

    cs.CV

    Privacy-Preserving in Medical Image Analysis: A Review of Methods and Applications

    Authors: Yanming Zhu, Xuefei Yin, Alan Wee-Chung Liew, Hui Tian

    Abstract: With the rapid advancement of artificial intelligence and deep learning, medical image analysis has become a critical tool in modern healthcare, significantly improving diagnostic accuracy and efficiency. However, AI-based methods also raise serious privacy concerns, as medical images often contain highly sensitive patient information. This review offers a comprehensive overview of privacy-preserv… ▽ More

    Submitted 5 December, 2024; originally announced December 2024.

  36. arXiv:2412.01072  [pdf, other

    cs.SE

    When Fine-Tuning LLMs Meets Data Privacy: An Empirical Study of Federated Learning in LLM-Based Program Repair

    Authors: Wenqiang Luo, Jacky Wai Keung, Boyang Yang, He Ye, Claire Le Goues, Tegawende F. Bissyande, Haoye Tian, Bach Le

    Abstract: Software systems have been evolving rapidly and inevitably introducing bugs at an increasing rate, leading to significant losses in resources consumed by software maintenance. Recently, large language models (LLMs) have demonstrated remarkable potential in enhancing software development and maintenance practices, particularly in automated program repair (APR) with improved accuracy and efficiency… ▽ More

    Submitted 1 December, 2024; originally announced December 2024.

  37. arXiv:2412.00744  [pdf, other

    cs.RO cs.AI

    A Cross-Scene Benchmark for Open-World Drone Active Tracking

    Authors: Haowei Sun, Jinwu Hu, Zhirui Zhang, Haoyuan Tian, Xinze Xie, Yufeng Wang, Zhuliang Yu, Xiaohua Xie, Mingkui Tan

    Abstract: Drone Visual Active Tracking aims to autonomously follow a target object by controlling the motion system based on visual observations, providing a more practical solution for effective tracking in dynamic environments. However, accurate Drone Visual Active Tracking using reinforcement learning remains challenging due to the absence of a unified benchmark, the complexity of open-world environments… ▽ More

    Submitted 1 December, 2024; originally announced December 2024.

    Comments: 25 pages

  38. arXiv:2410.18107  [pdf, other

    cs.SE cs.AI

    In-Context Code-Text Learning for Bimodal Software Engineering

    Authors: Xunzhu Tang, Liran Wang, Yonghui Liu, Linzheng Chai, Jian Yang, Zhoujun Li, Haoye Tian, Jacques Klein, Tegawende F. Bissyande

    Abstract: Bimodal software analysis initially appeared to be within reach with the advent of large language models. Unfortunately, the complex interplay of natural language text and code in software engineering, presents unique challenges that prevent pretrained models to generalize to a variety of tasks. We postulate that in-context learning for the code-text bimodality is a promising avenue. This paper th… ▽ More

    Submitted 8 October, 2024; originally announced October 2024.

  39. arXiv:2410.16261  [pdf, other

    cs.CV

    Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance

    Authors: Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Jifeng Dai, Wenhai Wang

    Abstract: Multimodal large language models (MLLMs) have demonstrated impressive performance in vision-language tasks across a broad spectrum of domains. However, the large model scale and associated high computational costs pose significant challenges for training and deploying MLLMs on consumer-grade GPUs or edge devices, thereby hindering their widespread application. In this work, we introduce Mini-Inter… ▽ More

    Submitted 7 November, 2024; v1 submitted 21 October, 2024; originally announced October 2024.

    Comments: Technical report

  40. arXiv:2410.13861  [pdf, other

    cs.CV

    PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

    Authors: Rongyao Fang, Chengqi Duan, Kun Wang, Hao Li, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Hongsheng Li, Xihui Liu

    Abstract: Recent advancements in multimodal foundation models have yielded significant progress in vision-language understanding. Initial attempts have also explored the potential of multimodal large language models (MLLMs) for visual content generation. However, existing works have insufficiently addressed the varying granularity demands of different image generation tasks within a unified MLLM paradigm -… ▽ More

    Submitted 21 October, 2024; v1 submitted 17 October, 2024; originally announced October 2024.

    Comments: Project page: https://rongyaofang.github.io/puma/

  41. arXiv:2410.12474  [pdf, other

    cs.CV cs.LG

    Mind the Gap Between Prototypes and Images in Cross-domain Finetuning

    Authors: Hongduan Tian, Feng Liu, Zhanke Zhou, Tongliang Liu, Chengqi Zhang, Bo Han

    Abstract: In cross-domain few-shot classification (CFC), recent works mainly focus on adapting a simple transformation head on top of a frozen pre-trained backbone with few labeled data to project embeddings into a task-specific metric space where classification can be performed by measuring similarities between image instance and prototype representations. Technically, an assumption implicitly adopted in s… ▽ More

    Submitted 20 October, 2024; v1 submitted 16 October, 2024; originally announced October 2024.

  42. arXiv:2410.07407  [pdf, other

    cs.AR

    Optimized Spatial Architecture Mapping Flow for Transformer Accelerators

    Authors: Haocheng Xu, Faraz Tahmasebi, Ye Qiao, Hongzheng Tian, Hyoukjun Kwon, Sitao Huang

    Abstract: Recent innovations in Transformer-based large language models have significantly advanced the field of general-purpose neural language understanding and generation. With billions of trainable parameters, deployment of these large models relies on high-performance hardware accelerators to efficiently deliver the required computation. Spatial architectures, such as TPUs, offer a promising solution t… ▽ More

    Submitted 9 October, 2024; originally announced October 2024.

  43. SoVAR: Building Generalizable Scenarios from Accident Reports for Autonomous Driving Testing

    Authors: An Guo, Yuan Zhou, Haoxiang Tian, Chunrong Fang, Yunjian Sun, Weisong Sun, Xinyu Gao, Anh Tuan Luu, Yang Liu, Zhenyu Chen

    Abstract: Autonomous driving systems (ADSs) have undergone remarkable development and are increasingly employed in safety-critical applications. However, recently reported data on fatal accidents involving ADSs suggests that the desired level of safety has not yet been fully achieved. Consequently, there is a growing need for more comprehensive and targeted testing approaches to ensure safe driving. Scenari… ▽ More

    Submitted 12 September, 2024; originally announced September 2024.

    Journal ref: 39th IEEE/ACM International Conference on Automated Software Engineering (ASE '24), October 27-November 1, 2024, Sacramento, CA, USA

  44. arXiv:2408.16886  [pdf, other

    eess.IV cs.CV

    LV-UNet: A Lightweight and Vanilla Model for Medical Image Segmentation

    Authors: Juntao Jiang, Mengmeng Wang, Huizhong Tian, Lingbo Cheng, Yong Liu

    Abstract: While large models have achieved significant progress in computer vision, challenges such as optimization complexity, the intricacy of transformer architectures, computational constraints, and practical application demands highlight the importance of simpler model designs in medical image segmentation. This need is particularly pronounced in mobile medical devices, which require lightweight, deplo… ▽ More

    Submitted 2 December, 2024; v1 submitted 29 August, 2024; originally announced August 2024.

    Comments: Accepted by IEEE BIBM2024 ML4BMI workshop

  45. arXiv:2408.13257  [pdf, other

    cs.CV

    MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

    Authors: Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, Liang Wang, Rong Jin, Tieniu Tan

    Abstract: Comprehensive evaluation of Multimodal Large Language Models (MLLMs) has recently garnered widespread attention in the research community. However, we observe that existing benchmarks present several common barriers that make it difficult to measure the significant challenges that models face in the real world, including: 1) small data scale leads to a large performance variance; 2) reliance on mo… ▽ More

    Submitted 5 February, 2025; v1 submitted 23 August, 2024; originally announced August 2024.

    Comments: Project Page: https://mme-realworld.github.io/; accepted by ICLR 2025

  46. arXiv:2408.12526  [pdf, other

    cs.LG

    Exploiting Student Parallelism for Efficient GPU Inference of BERT-like Models in Online Services

    Authors: Weiyan Wang, Yilun Jin, Yiming Zhang, Victor Junqiu Wei, Han Tian, Li Chen, Jinbao Xue, Yangyu Tao, Di Wang, Kai Chen

    Abstract: Due to high accuracy, BERT-like models have been widely adopted by text mining and web searching. However, large BERT-like models suffer from inefficient online inference, facing the following two problems on GPUs: (1) their high accuracy relies on the large model depth, which linearly increases the sequential computation on GPUs; (2) stochastic and dynamic online workloads cause extra costs from… ▽ More

    Submitted 18 March, 2025; v1 submitted 22 August, 2024; originally announced August 2024.

  47. arXiv:2408.05090  [pdf, other

    cs.CV cs.MM

    Loc4Plan: Locating Before Planning for Outdoor Vision and Language Navigation

    Authors: Huilin Tian, Jingke Meng, Wei-Shi Zheng, Yuan-Ming Li, Junkai Yan, Yunong Zhang

    Abstract: Vision and Language Navigation (VLN) is a challenging task that requires agents to understand instructions and navigate to the destination in a visual environment.One of the key challenges in outdoor VLN is keeping track of which part of the instruction was completed. To alleviate this problem, previous works mainly focus on grounding the natural language to the visual input, but neglecting the cr… ▽ More

    Submitted 9 August, 2024; originally announced August 2024.

    Comments: arXiv admin note: text overlap with arXiv:2203.13838 by other authors

  48. arXiv:2408.02718  [pdf, other

    cs.CV

    MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

    Authors: Fanqing Meng, Jin Wang, Chuanhao Li, Quanfeng Lu, Hao Tian, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Yu Qiao, Ping Luo, Kaipeng Zhang, Wenqi Shao

    Abstract: The capability to process multiple images is crucial for Large Vision-Language Models (LVLMs) to develop a more thorough and nuanced understanding of a scene. Recent multi-image LVLMs have begun to address this need. However, their evaluation has not kept pace with their development. To fill this gap, we introduce the Multimodal Multi-image Understanding (MMIU) benchmark, a comprehensive evaluatio… ▽ More

    Submitted 5 August, 2024; originally announced August 2024.

    Comments: Project Page: https://mmiu-bench.github.io/

  49. arXiv:2407.21570  [pdf

    cs.RO

    Vision and Contact based Optimal Control for Autonomous Trocar Docking

    Authors: Christopher E. Mower, Martin Huber, Huanyu Tian, Ayoob Davoodi, Emmanuel Vander Poorten, Tom Vercauteren, Christos Bergeles

    Abstract: Future operating theatres will be equipped with robots to perform various surgical tasks including, for example, endoscope control. Human-in-the-loop supervisory control architectures where the surgeon selects from several autonomous sequences is already being successfully applied in preclinical tests. Inserting an endoscope into a trocar or introducer is a key step for every keyhole surgical proc… ▽ More

    Submitted 31 July, 2024; originally announced July 2024.

    Comments: Presented at the 12th Conference on New Technologies for Computer and Robot Assisted Surgery

  50. MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity

    Authors: Yangzhou Liu, Yue Cao, Zhangwei Gao, Weiyun Wang, Zhe Chen, Wenhai Wang, Hao Tian, Lewei Lu, Xizhou Zhu, Tong Lu, Yu Qiao, Jifeng Dai

    Abstract: Despite the effectiveness of vision-language supervised fine-tuning in enhancing the performance of Vision Large Language Models (VLLMs). However, existing visual instruction tuning datasets include the following limitations: (1) Instruction annotation quality: despite existing VLLMs exhibiting strong performance, instructions generated by those advanced VLLMs may still suffer from inaccuracies, s… ▽ More

    Submitted 7 August, 2024; v1 submitted 22 July, 2024; originally announced July 2024.

    Comments: 18 pages, 8 figures, technical report

    Report number: 67

    Journal ref: Sci China Inf Sci, 2024