Skip to main content

Showing 1–50 of 357 results for author: Cao, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.10473  [pdf, ps, other

    cs.CV

    Consistent Quantity-Quality Control across Scenes for Deployment-Aware Gaussian Splatting

    Authors: Fengdi Zhang, Hongkun Cao, Ruqi Huang

    Abstract: To reduce storage and computational costs, 3D Gaussian splatting (3DGS) seeks to minimize the number of Gaussians used while preserving high rendering quality, introducing an inherent trade-off between Gaussian quantity and rendering quality. Existing methods strive for better quantity-quality performance, but lack the ability for users to intuitively adjust this trade-off to suit practical needs… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  2. arXiv:2505.03739  [pdf, other

    cs.CL cs.AI

    VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

    Authors: Zuwei Long, Yunhang Shen, Chaoyou Fu, Heting Gao, Lijiang Li, Peixian Chen, Mengdan Zhang, Hang Shao, Jian Li, Jinlong Peng, Haoyu Cao, Ke Li, Rongrong Ji, Xing Sun

    Abstract: With the growing requirement for natural human-computer interaction, speech-based systems receive increasing attention as speech is one of the most common forms of daily communication. However, the existing speech models still experience high latency when generating the first audio token during streaming, which poses a significant bottleneck for deployment. To address this issue, we propose VITA-A… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

    Comments: Training and Inference Codes: https://github.com/VITA-MLLM/VITA-Audio

  3. arXiv:2504.19249  [pdf, other

    cs.CV

    ODExAI: A Comprehensive Object Detection Explainable AI Evaluation

    Authors: Loc Phuc Truong Nguyen, Hung Truong Thanh Nguyen, Hung Cao

    Abstract: Explainable Artificial Intelligence (XAI) techniques for interpreting object detection models remain in an early stage, with no established standards for systematic evaluation. This absence of consensus hinders both the comparative analysis of methods and the informed selection of suitable approaches. To address this gap, we introduce the Object Detection Explainable AI Evaluation (ODExAI), a comp… ▽ More

    Submitted 27 April, 2025; originally announced April 2025.

  4. arXiv:2504.19104  [pdf, other

    cs.RO

    MISO: Multiresolution Submap Optimization for Efficient Globally Consistent Neural Implicit Reconstruction

    Authors: Yulun Tian, Hanwen Cao, Sunghwan Kim, Nikolay Atanasov

    Abstract: Neural implicit representations have had a significant impact on simultaneous localization and mapping (SLAM) by enabling robots to build continuous, differentiable, and high-fidelity 3D maps from sensor data. However, as the scale and complexity of the environment increase, neural SLAM approaches face renewed challenges in the back-end optimization process to keep up with runtime requirements and… ▽ More

    Submitted 27 April, 2025; originally announced April 2025.

    Comments: To appear at RSS 2025 (15 pages, 11 figures)

  5. arXiv:2504.12314  [pdf, other

    cs.CL cs.AI

    How to Detect and Defeat Molecular Mirage: A Metric-Driven Benchmark for Hallucination in LLM-based Molecular Comprehension

    Authors: Hao Li, Liuzhenghao Lv, He Cao, Zijing Liu, Zhiyuan Yan, Yu Wang, Yonghong Tian, Yu Li, Li Yuan

    Abstract: Large language models are increasingly used in scientific domains, especially for molecular understanding and analysis. However, existing models are affected by hallucination issues, resulting in errors in drug design and utilization. In this paper, we first analyze the sources of hallucination in LLMs for molecular comprehension tasks, specifically the knowledge shortcut phenomenon observed in th… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

    Comments: 17 pages

  6. arXiv:2504.09196  [pdf, other

    cs.CV

    RT-DATR:Real-time Unsupervised Domain Adaptive Detection Transformer with Adversarial Feature Learning

    Authors: Feng Lv, Chunlong Xia, Shuo Wang, Huo Cao

    Abstract: Despite domain-adaptive object detectors based on CNN and transformers have made significant progress in cross-domain detection tasks, it is regrettable that domain adaptation for real-time transformer-based detectors has not yet been explored. Directly applying existing domain adaptation algorithms has proven to be suboptimal. In this paper, we propose RT-DATR, a simple and efficient real-time do… ▽ More

    Submitted 12 April, 2025; originally announced April 2025.

  7. arXiv:2504.09185  [pdf, ps, other

    cs.LG cs.AI

    Repetitive Contrastive Learning Enhances Mamba's Selectivity in Time Series Prediction

    Authors: Wenbo Yan, Hanzhong Cao, Ying Tan

    Abstract: Long sequence prediction is a key challenge in time series forecasting. While Mamba-based models have shown strong performance due to their sequence selection capabilities, they still struggle with insufficient focus on critical time steps and incomplete noise suppression, caused by limited selective abilities. To address this, we introduce Repetitive Contrastive Learning (RCL), a token-level cont… ▽ More

    Submitted 12 April, 2025; originally announced April 2025.

  8. arXiv:2504.02867  [pdf, other

    cs.CL cs.AI

    Multi-Agent LLM Judge: automatic personalized LLM judge design for evaluating natural language generation applications

    Authors: Hongliu Cao, Ilias Driouich, Robin Singh, Eoin Thomas

    Abstract: Large Language Models (LLMs) have demonstrated impressive performance across diverse domains, yet they still encounter challenges such as insufficient domain-specific knowledge, biases, and hallucinations. This underscores the need for robust evaluation methodologies to accurately assess LLM-based applications. Traditional evaluation methods, which rely on word overlap or text embeddings, are inad… ▽ More

    Submitted 1 April, 2025; originally announced April 2025.

    Comments: Presented at SophiaSummit2024

  9. BiSeg-SAM: Weakly-Supervised Post-Processing Framework for Boosting Binary Segmentation in Segment Anything Models

    Authors: Encheng Su, Hu Cao, Alois Knoll

    Abstract: Accurate segmentation of polyps and skin lesions is essential for diagnosing colorectal and skin cancers. While various segmentation methods for polyps and skin lesions using fully supervised deep learning techniques have been developed, the pixel-level annotation of medical images by doctors is both time-consuming and costly. Foundational vision models like the Segment Anything Model (SAM) have d… ▽ More

    Submitted 2 April, 2025; originally announced April 2025.

    Comments: 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

  10. arXiv:2504.00584  [pdf, other

    cs.CL cs.AI

    Enhancing Negation Awareness in Universal Text Embeddings: A Data-efficient and Computational-efficient Approach

    Authors: Hongliu Cao

    Abstract: Negation plays an important role in various natural language processing tasks such as Natural Language Inference and Sentiment Analysis tasks. Numerous prior studies have found that contextual text embedding models such as BERT, ELMO, RoBERTa or XLNet face challenges in accurately understanding negation. Recent advancements in universal text embeddings have demonstrated superior performance over c… ▽ More

    Submitted 1 April, 2025; originally announced April 2025.

  11. arXiv:2503.22524  [pdf, other

    cs.RO cs.AI

    Robust Offline Imitation Learning Through State-level Trajectory Stitching

    Authors: Shuze Wang, Yunpeng Mei, Hongjie Cao, Yetian Yuan, Gang Wang, Jian Sun, Jie Chen

    Abstract: Imitation learning (IL) has proven effective for enabling robots to acquire visuomotor skills through expert demonstrations. However, traditional IL methods are limited by their reliance on high-quality, often scarce, expert data, and suffer from covariate shift. To address these challenges, recent advances in offline IL have incorporated suboptimal, unlabeled datasets into the training. In this p… ▽ More

    Submitted 28 March, 2025; originally announced March 2025.

  12. arXiv:2503.21307  [pdf, other

    cs.CV cs.AI

    InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression

    Authors: Dongchen Lu, Yuyao Sun, Zilu Zhang, Leping Huang, Jianliang Zeng, Mao Shu, Huo Cao

    Abstract: Most multimodal large language models (MLLMs) treat visual tokens as "a sequence of text", integrating them with text tokens into a large language model (LLM). However, a great quantity of visual tokens significantly increases the demand for computational resources and time. In this paper, we propose InternVL-X, which outperforms the InternVL model in both performance and efficiency by incorporati… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

  13. arXiv:2503.20285  [pdf, other

    cs.LG cs.AI

    Model-Based Offline Reinforcement Learning with Adversarial Data Augmentation

    Authors: Hongye Cao, Fan Feng, Jing Huo, Shangdong Yang, Meng Fang, Tianpei Yang, Yang Gao

    Abstract: Model-based offline Reinforcement Learning (RL) constructs environment models from offline datasets to perform conservative policy optimization. Existing approaches focus on learning state transitions through ensemble models, rollouting conservative estimation to mitigate extrapolation errors. However, the static data makes it challenging to develop a robust policy, and offline agents cannot acces… ▽ More

    Submitted 26 March, 2025; originally announced March 2025.

  14. arXiv:2503.18454  [pdf, other

    cs.CV cs.LG

    InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment

    Authors: Yunhong Lu, Qichao Wang, Hengyuan Cao, Xierui Wang, Xiaoyin Xu, Min Zhang

    Abstract: Without using explicit reward, direct preference optimization (DPO) employs paired human preference data to fine-tune generative models, a method that has garnered considerable attention in large language models (LLMs). However, exploration of aligning text-to-image (T2I) diffusion models with human preferences remains limited. In comparison to supervised fine-tuning, existing methods that align d… ▽ More

    Submitted 24 March, 2025; originally announced March 2025.

    Comments: Accepted by CVPR2025

  15. arXiv:2503.12095  [pdf

    cs.CV

    Towards Vision Zero: The Accid3nD Dataset

    Authors: Walter Zimmer, Ross Greer, Daniel Lehmberg, Marc Pavel, Holger Caesar, Xingcheng Zhou, Ahmed Ghita, Mohan Trivedi, Rui Song, Hu Cao, Akshay Gopalkrishnan, Alois C. Knoll

    Abstract: Even though a significant amount of work has been done to increase the safety of transportation networks, accidents still occur regularly. They must be understood as unavoidable and sporadic outcomes of traffic networks. No public dataset contains 3D annotations of real-world accidents recorded from roadside sensors. We present the Accid3nD dataset, a collection of real-world highway accidents in… ▽ More

    Submitted 15 March, 2025; originally announced March 2025.

  16. arXiv:2503.11012  [pdf, other

    cs.RO

    Robotic Sim-to-Real Transfer for Long-Horizon Pick-and-Place Tasks in the Robotic Sim2Real Competition

    Authors: Ming Yang, Hongyu Cao, Lixuan Zhao, Chenrui Zhang, Yaran Chen

    Abstract: This paper presents a fully autonomous robotic system that performs sim-to-real transfer in complex long-horizon tasks involving navigation, recognition, grasping, and stacking in an environment with multiple obstacles. The key feature of the system is the ability to overcome typical sensing and actuation discrepancies during sim-to-real transfer and to achieve consistent performance without any… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

    Comments: 7 pages, 7 figures, accepted for presentation at ICRA 2025. The final version will be available in IEEE Xplore

  17. arXiv:2503.08495  [pdf, other

    cs.CL

    Enhancing Multi-Hop Fact Verification with Structured Knowledge-Augmented Large Language Models

    Authors: Han Cao, Lingwei Wei, Wei Zhou, Songlin Hu

    Abstract: The rapid development of social platforms exacerbates the dissemination of misinformation, which stimulates the research in fact verification. Recent studies tend to leverage semantic features to solve this problem as a single-hop task. However, the process of verifying a claim requires several pieces of evidence with complicated inner logic and relations to verify the given claim in real-world si… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

    Comments: Accepted by AAAI 2025

  18. arXiv:2503.06744  [pdf, other

    cs.CV

    CoDa-4DGS: Dynamic Gaussian Splatting with Context and Deformation Awareness for Autonomous Driving

    Authors: Rui Song, Chenwei Liang, Yan Xia, Walter Zimmer, Hu Cao, Holger Caesar, Andreas Festag, Alois Knoll

    Abstract: Dynamic scene rendering opens new avenues in autonomous driving by enabling closed-loop simulations with photorealistic data, which is crucial for validating end-to-end algorithms. However, the complex and highly dynamic nature of traffic environments presents significant challenges in accurately rendering these scenes. In this paper, we introduce a novel 4D Gaussian Splatting (4DGS) approach, whi… ▽ More

    Submitted 9 March, 2025; originally announced March 2025.

  19. arXiv:2503.04794  [pdf, other

    cs.RO

    Runtime Learning of Quadruped Robots in Wild Environments

    Authors: Yihao Cai, Yanbing Mao, Lui Sha, Hongpeng Cao, Marco Caccamo

    Abstract: This paper presents a runtime learning framework for quadruped robots, enabling them to learn and adapt safely in dynamic wild environments. The framework integrates sensing, navigation, and control, forming a closed-loop system for the robot. The core novelty of this framework lies in two interactive and complementary components within the control module: the high-performance (HP)-Student and the… ▽ More

    Submitted 1 March, 2025; originally announced March 2025.

  20. arXiv:2503.02408  [pdf, other

    cs.RO

    Predictive Kinematic Coordinate Control for Aerial Manipulators based on Modified Kinematics Learning

    Authors: Zhengzhen Li, Jiahao Shen, Mengyu Ji, Huazi Cao, Shiyu Zhao

    Abstract: High-precision manipulation has always been a developmental goal for aerial manipulators. This paper investigates the kinematic coordinate control issue in aerial manipulators. We propose a predictive kinematic coordinate control method, which includes a learning-based modified kinematic model and a model predictive control (MPC) scheme based on weight allocation. Compared to existing methods, our… ▽ More

    Submitted 4 March, 2025; originally announced March 2025.

    Comments: accepted by ICRA 2025

  21. arXiv:2503.02172  [pdf, other

    cs.AI cs.SE

    KGCompiler: Deep Learning Compilation Optimization for Knowledge Graph Complex Logical Query Answering

    Authors: Hongyu Lin, Haoran Luo, Hanghang Cao, Yang Liu, Shihao Gao, Kaichun Yao, Libo Zhang, Mingjie Xing, Yanjun Wu

    Abstract: Complex Logical Query Answering (CLQA) involves intricate multi-hop logical reasoning over large-scale and potentially incomplete Knowledge Graphs (KGs). Although existing CLQA algorithms achieve high accuracy in answering such queries, their reasoning time and memory usage scale significantly with the number of First-Order Logic (FOL) operators involved, creating serious challenges for practical… ▽ More

    Submitted 3 March, 2025; originally announced March 2025.

  22. arXiv:2502.17295  [pdf, other

    cs.HC cs.RO

    Co-Designing Augmented Reality Tools for High-Stakes Clinical Teamwork

    Authors: Angelique Taylor, Tauhid Tanjim, Huajie Cao, Jalynn Blu Nicoly, Jonathan I. Segal, Jonathan St. George, Soyon Kim, Kevin Ching, Francisco R. Ortega, Hee Rin Lee

    Abstract: How might healthcare workers (HCWs) leverage augmented reality head-mounted displays (AR-HMDs) to enhance teamwork? Although AR-HMDs have shown immense promise in supporting teamwork in healthcare settings, design for Emergency Department (ER) teams has received little attention. The ER presents unique challenges, including procedural recall, medical errors, and communication gaps. To address this… ▽ More

    Submitted 24 February, 2025; originally announced February 2025.

    Comments: 19 pages, 7 figures, submitted to DIS 2025

    MSC Class: H.5.1; J.3

  23. arXiv:2502.16587  [pdf, other

    cs.RO

    Human2Robot: Learning Robot Actions from Paired Human-Robot Videos

    Authors: Sicheng Xie, Haidong Cao, Zejia Weng, Zhen Xing, Shiwei Shen, Jiaqi Leng, Xipeng Qiu, Yanwei Fu, Zuxuan Wu, Yu-Gang Jiang

    Abstract: Distilling knowledge from human demonstrations is a promising way for robots to learn and act. Existing work often overlooks the differences between humans and robots, producing unsatisfactory results. In this paper, we study how perfectly aligned human-robot pairs benefit robot learning. Capitalizing on VR-based teleportation, we introduce H\&R, a third-person dataset with 2,600 episodes, each of… ▽ More

    Submitted 4 April, 2025; v1 submitted 23 February, 2025; originally announced February 2025.

  24. arXiv:2502.14344  [pdf, other

    cs.CV

    Towards Accurate Binary Spiking Neural Networks: Learning with Adaptive Gradient Modulation Mechanism

    Authors: Yu Liang, Wenjie Wei, Ammar Belatreche, Honglin Cao, Zijian Zhou, Shuai Wang, Malu Zhang, Yang Yang

    Abstract: Binary Spiking Neural Networks (BSNNs) inherit the eventdriven paradigm of SNNs, while also adopting the reduced storage burden of binarization techniques. These distinct advantages grant BSNNs lightweight and energy-efficient characteristics, rendering them ideal for deployment on resource-constrained edge devices. However, due to the binary synaptic weights and non-differentiable spike function,… ▽ More

    Submitted 20 February, 2025; originally announced February 2025.

    Comments: 9 pages, 8 figures, AAAI conference

  25. arXiv:2502.14326  [pdf, other

    cs.CR

    Browser Fingerprint Detection and Anti-Tracking

    Authors: Kaitong Lin, Huazhu Cao, Amin Milani Fard

    Abstract: Digital fingerprints have brought great convenience and benefits to many online businesses. However, they pose a significant threat to the privacy and security of ordinary users. In this paper, we investigate the effectiveness of current anti-tracking methods against digital fingerprints and design a browser extension that can effectively resist digital fingerprints and record the website's collec… ▽ More

    Submitted 20 February, 2025; originally announced February 2025.

  26. arXiv:2502.11090  [pdf, other

    cs.CL cs.AI

    SafeDialBench: A Fine-Grained Safety Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks

    Authors: Hongye Cao, Yanming Wang, Sijia Jing, Ziyue Peng, Zhixin Bai, Zhe Cao, Meng Fang, Fan Feng, Boyan Wang, Jiaheng Liu, Tianpei Yang, Jing Huo, Yang Gao, Fanyu Meng, Xi Yang, Chao Deng, Junlan Feng

    Abstract: With the rapid advancement of Large Language Models (LLMs), the safety of LLMs has been a critical concern requiring precise assessment. Current benchmarks primarily concentrate on single-turn dialogues or a single jailbreak attack method to assess the safety. Additionally, these benchmarks have not taken into account the LLM's capability of identifying and handling unsafe information in detail. T… ▽ More

    Submitted 17 February, 2025; v1 submitted 16 February, 2025; originally announced February 2025.

  27. arXiv:2502.10097  [pdf, other

    cs.AI cs.LG

    Causal Information Prioritization for Efficient Reinforcement Learning

    Authors: Hongye Cao, Fan Feng, Tianpei Yang, Jing Huo, Yang Gao

    Abstract: Current Reinforcement Learning (RL) methods often suffer from sample-inefficiency, resulting from blind exploration strategies that neglect causal relationships among states, actions, and rewards. Although recent causal approaches aim to address this problem, they lack grounded modeling of reward-guided causal understanding of states and actions for goal-orientation, thus impairing learning effici… ▽ More

    Submitted 14 February, 2025; originally announced February 2025.

  28. arXiv:2502.10077  [pdf, other

    cs.AI cs.LG

    Towards Empowerment Gain through Causal Structure Learning in Model-Based RL

    Authors: Hongye Cao, Fan Feng, Meng Fang, Shaokang Dong, Tianpei Yang, Jing Huo, Yang Gao

    Abstract: In Model-Based Reinforcement Learning (MBRL), incorporating causal structures into dynamics models provides agents with a structured understanding of the environments, enabling efficient decision. Empowerment as an intrinsic motivation enhances the ability of agents to actively control their environments by maximizing the mutual information between future states and actions. We posit that empowerm… ▽ More

    Submitted 14 February, 2025; originally announced February 2025.

  29. arXiv:2502.09747  [pdf, other

    cs.CL

    The Widespread Adoption of Large Language Model-Assisted Writing Across Society

    Authors: Weixin Liang, Yaohui Zhang, Mihai Codreanu, Jiayu Wang, Hancheng Cao, James Zou

    Abstract: The recent advances in large language models (LLMs) attracted significant public and policymaker interest in its adoption patterns. In this paper, we systematically analyze LLM-assisted writing across four domains-consumer complaints, corporate communications, job postings, and international organization press releases-from January 2022 to September 2024. Our dataset includes 687,241 consumer comp… ▽ More

    Submitted 17 February, 2025; v1 submitted 13 February, 2025; originally announced February 2025.

  30. arXiv:2502.07384   

    cs.CE

    SAGEPhos: Sage Bio-Coupled and Augmented Fusion for Phosphorylation Site Detection

    Authors: Jingjie Zhang, Hanqun Cao, Zijun Gao, Xiaorui Wang, Chunbin Gu

    Abstract: Phosphorylation site prediction based on kinase-substrate interaction plays a vital role in understanding cellular signaling pathways and disease mechanisms. Computational methods for this task can be categorized into kinase-family-focused and individual kinase-targeted approaches. Individual kinase-targeted methods have gained prominence for their ability to explore a broader protein space and pr… ▽ More

    Submitted 16 April, 2025; v1 submitted 11 February, 2025; originally announced February 2025.

    Comments: Due to significant disagreements within the author team regarding the content of the paper and an inability to reach a consensus, we have decided to withdraw the current version to allow for further verification and refinement of the research content

  31. arXiv:2502.06887  [pdf, ps, other

    cs.LG cs.AI

    Gradient Based Method for the Fusion of Lattice Quantizers

    Authors: Liyuan Zhang, Hanzhong Cao, Jiaheng Li, Minyang Yu

    Abstract: In practical applications, lattice quantizers leverage discrete lattice points to approximate arbitrary points in the lattice. An effective lattice quantizer significantly enhances both the accuracy and efficiency of these approximations. In the context of high-dimensional lattice quantization, previous work proposed utilizing low-dimensional optimal lattice quantizers and addressed the challenge… ▽ More

    Submitted 9 February, 2025; originally announced February 2025.

  32. arXiv:2502.05905  [pdf, other

    cs.CV

    QP-SNN: Quantized and Pruned Spiking Neural Networks

    Authors: Wenjie Wei, Malu Zhang, Zijian Zhou, Ammar Belatreche, Yimeng Shan, Yu Liang, Honglin Cao, Jieyuan Zhang, Yang Yang

    Abstract: Brain-inspired Spiking Neural Networks (SNNs) leverage sparse spikes to encode information and operate in an asynchronous event-driven manner, offering a highly energy-efficient paradigm for machine intelligence. However, the current SNN community focuses primarily on performance improvement by developing large-scale models, which limits the applicability of SNNs in resource-limited edge devices.… ▽ More

    Submitted 9 February, 2025; originally announced February 2025.

    Comments: 26 pages, 17 figures, Published as a conference paper at ICLR 2025

  33. arXiv:2502.05177  [pdf, ps, other

    cs.CV

    Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy

    Authors: Yunhang Shen, Chaoyou Fu, Shaoqi Dong, Xiong Wang, Yi-Fan Zhang, Peixian Chen, Mengdan Zhang, Haoyu Cao, Ke Li, Xiawu Zheng, Yan Zhang, Yiyi Zhou, Ran He, Caifeng Shan, Rongrong Ji, Xing Sun

    Abstract: We introduce Long-VITA, a simple yet effective large multi-modal model for long-context visual-language understanding tasks. It is adept at concurrently processing and analyzing modalities of image, video, and text over 4K frames or 1M tokens while delivering advanced performances on short-context multi-modal tasks. We propose an effective multi-modal training schema that starts with large languag… ▽ More

    Submitted 18 February, 2025; v1 submitted 7 February, 2025; originally announced February 2025.

    Comments: https://github.com/VITA-MLLM/Long-VITA

  34. arXiv:2502.02449  [pdf, other

    cs.CV

    TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes

    Authors: Xingcheng Zhou, Konstantinos Larintzakis, Hao Guo, Walter Zimmer, Mingyu Liu, Hu Cao, Jiajie Zhang, Venkatnarayanan Lakshminarasimhan, Leah Strand, Alois C. Knoll

    Abstract: We present TUMTraffic-VideoQA, a novel dataset and benchmark designed for spatio-temporal video understanding in complex roadside traffic scenarios. The dataset comprises 1,000 videos, featuring 85,000 multiple-choice QA pairs, 2,300 object captioning, and 5,700 object grounding annotations, encompassing diverse real-world conditions such as adverse weather and traffic anomalies. By incorporating… ▽ More

    Submitted 4 February, 2025; originally announced February 2025.

  35. arXiv:2501.14249  [pdf, other

    cs.LG cs.AI cs.CL

    Humanity's Last Exam

    Authors: Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dmitry Dodonov, Tung Nguyen, Jaeho Lee, Daron Anderson, Mikhail Doroshenko, Alun Cennyth Stokes , et al. (1084 additional authors not shown)

    Abstract: Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of… ▽ More

    Submitted 19 April, 2025; v1 submitted 24 January, 2025; originally announced January 2025.

    Comments: 29 pages, 6 figures

  36. arXiv:2501.13492  [pdf, other

    cs.CV

    Quantized Spike-driven Transformer

    Authors: Xuerui Qiu, Malu Zhang, Jieyuan Zhang, Wenjie Wei, Honglin Cao, Junsheng Guo, Rui-Jie Zhu, Yimeng Shan, Yang Yang, Haizhou Li

    Abstract: Spiking neural networks are emerging as a promising energy-efficient alternative to traditional artificial neural networks due to their spike-driven paradigm. However, recent research in the SNN domain has mainly focused on enhancing accuracy by designing large-scale Transformer structures, which typically rely on substantial computational resources, limiting their deployment on resource-constrain… ▽ More

    Submitted 23 March, 2025; v1 submitted 23 January, 2025; originally announced January 2025.

    Comments: Accepted by ICLR 2025

  37. arXiv:2501.10325  [pdf, other

    cs.CV

    DiffStereo: High-Frequency Aware Diffusion Model for Stereo Image Restoration

    Authors: Huiyun Cao, Yuan Shi, Bin Xia, Xiaoyu Jin, Wenming Yang

    Abstract: Diffusion models (DMs) have achieved promising performance in image restoration but haven't been explored for stereo images. The application of DM in stereo image restoration is confronted with a series of challenges. The need to reconstruct two images exacerbates DM's computational cost. Additionally, existing latent DMs usually focus on semantic information and remove high-frequency details as r… ▽ More

    Submitted 17 January, 2025; originally announced January 2025.

    Comments: 9 pages, 6 figures

  38. arXiv:2501.05934  [pdf, other

    cs.LG cs.DC

    Encoded Spatial Attribute in Multi-Tier Federated Learning

    Authors: Asfia Kawnine, Francis Palma, Seyed Alireza Rahimi Azghadi, Hung Cao

    Abstract: This research presents an Encoded Spatial Multi-Tier Federated Learning approach for a comprehensive evaluation of aggregated models for geospatial data. In the client tier, encoding spatial information is introduced to better predict the target outcome. The research aims to assess the performance of these models across diverse datasets and spatial attributes, highlighting variations in predictive… ▽ More

    Submitted 10 January, 2025; originally announced January 2025.

    Comments: IEEE ICCE 2025

  39. arXiv:2501.05904  [pdf, other

    cs.CV

    Binary Event-Driven Spiking Transformer

    Authors: Honglin Cao, Zijian Zhou, Wenjie Wei, Ammar Belatreche, Yu Liang, Dehao Zhang, Malu Zhang, Yang Yang, Haizhou Li

    Abstract: Transformer-based Spiking Neural Networks (SNNs) introduce a novel event-driven self-attention paradigm that combines the high performance of Transformers with the energy efficiency of SNNs. However, the larger model size and increased computational demands of the Transformer structure limit their practicality in resource-constrained scenarios. In this paper, we integrate binarization techniques i… ▽ More

    Submitted 10 January, 2025; originally announced January 2025.

    Comments: 11 pages, 5 figures

  40. arXiv:2501.02450  [pdf, other

    cs.CV

    GCP: Guarded Collaborative Perception with Spatial-Temporal Aware Malicious Agent Detection

    Authors: Yihang Tao, Senkang Hu, Yue Hu, Haonan An, Hangcheng Cao, Yuguang Fang

    Abstract: Collaborative perception significantly enhances autonomous driving safety by extending each vehicle's perception range through message sharing among connected and autonomous vehicles. Unfortunately, it is also vulnerable to adversarial message attacks from malicious agents, resulting in severe performance degradation. While existing defenses employ hypothesis-and-verification frameworks to detect… ▽ More

    Submitted 5 January, 2025; originally announced January 2025.

    Comments: 15 pages

  41. arXiv:2501.01957  [pdf, other

    cs.CV cs.SD eess.AS

    VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

    Authors: Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, Long Ma, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, Ran He

    Abstract: Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality difference… ▽ More

    Submitted 21 January, 2025; v1 submitted 3 January, 2025; originally announced January 2025.

    Comments: https://github.com/VITA-MLLM/VITA (2K+ Stars by now)

  42. arXiv:2412.16948  [pdf

    cs.CV

    DTSGAN: Learning Dynamic Textures via Spatiotemporal Generative Adversarial Network

    Authors: Xiangtian Li, Xiaobo Wang, Zhen Qi, Han Cao, Zhaoyang Zhang, Ao Xiang

    Abstract: Dynamic texture synthesis aims to generate sequences that are visually similar to a reference video texture and exhibit specific stationary properties in time. In this paper, we introduce a spatiotemporal generative adversarial network (DTSGAN) that can learn from a single dynamic texture by capturing its motion and content distribution. With the pipeline of DTSGAN, a new video sequence is generat… ▽ More

    Submitted 22 December, 2024; originally announced December 2024.

  43. arXiv:2412.15803  [pdf, other

    cs.LG cs.AI

    WebLLM: A High-Performance In-Browser LLM Inference Engine

    Authors: Charlie F. Ruan, Yucheng Qin, Xun Zhou, Ruihang Lai, Hongyi Jin, Yixin Dong, Bohan Hou, Meng-Shiun Yu, Yiyan Zhai, Sudeep Agarwal, Hangrui Cao, Siyuan Feng, Tianqi Chen

    Abstract: Advancements in large language models (LLMs) have unlocked remarkable capabilities. While deploying these models typically requires server-grade GPUs and cloud-based inference, the recent emergence of smaller open-source models and increasingly powerful consumer devices have made on-device deployment practical. The web browser as a platform for on-device deployment is universally accessible, provi… ▽ More

    Submitted 20 December, 2024; originally announced December 2024.

  44. arXiv:2412.15564  [pdf, other

    cs.GR cs.CG

    Robust and Feature-Preserving Offset Meshing

    Authors: Hongyi Cao, Gang Xu, Renshu Gu, Jinlan Xu, Xiaoyu Zhang, Timon Rabczuk, Yuzhe Luo, Xifeng Gao

    Abstract: We introduce a novel offset meshing approach that can robustly handle a 3D surface mesh with an arbitrary geometry and topology configurations, while nicely capturing the sharp features on the original input for both inward and outward offsets. Compared to the existing approaches focusing on constant-radius offset, to the best of our knowledge, we propose the first-ever solution for mitered offset… ▽ More

    Submitted 19 December, 2024; originally announced December 2024.

  45. arXiv:2412.13224  [pdf, other

    cs.RO cs.AI cs.LG

    Physics-model-guided Worst-case Sampling for Safe Reinforcement Learning

    Authors: Hongpeng Cao, Yanbing Mao, Lui Sha, Marco Caccamo

    Abstract: Real-world accidents in learning-enabled CPS frequently occur in challenging corner cases. During the training of deep reinforcement learning (DRL) policy, the standard setup for training conditions is either fixed at a single initial condition or uniformly sampled from the admissible state space. This setup often overlooks the challenging but safety-critical corner cases. To bridge this gap, this… ▽ More

    Submitted 16 December, 2024; originally announced December 2024.

    Comments: under review

  46. arXiv:2412.12079  [pdf, other

    cs.CV

    UniLoc: Towards Universal Place Recognition Using Any Single Modality

    Authors: Yan Xia, Zhendong Li, Yun-Jin Li, Letian Shi, Hu Cao, João F. Henriques, Daniel Cremers

    Abstract: To date, most place recognition methods focus on single-modality retrieval. While they perform well in specific environments, cross-modal methods offer greater flexibility by allowing seamless switching between map and query sources. It also promises to reduce computation requirements by having a unified model, and achieving greater sample efficiency by sharing parameters. In this work, we develop… ▽ More

    Submitted 16 December, 2024; originally announced December 2024.

    Comments: 14 pages, 10 figures

  47. arXiv:2412.07991  [pdf

    q-bio.QM cs.LG

    dsLassoCov: a federated machine learning approach incorporating covariate control

    Authors: Han Cao, Augusto Anguita, Charline Warembourg, Xavier Escriba-Montagut, Martine Vrijheid, Juan R. Gonzalez, Tim Cadman, Verena Schneider-Lindner, Daniel Durstewitz, Xavier Basagana, Emanuel Schwarz

    Abstract: Machine learning has been widely adopted in biomedical research, fueled by the increasing availability of data. However, integrating datasets across institutions is challenging due to legal restrictions and data governance complexities. Federated learning allows the direct, privacy preserving training of machine learning models using geographically distributed datasets, but faces the challenge of… ▽ More

    Submitted 10 December, 2024; originally announced December 2024.

    Comments: 17 pages, 5 figures

  48. arXiv:2412.06488  [pdf, other

    cs.RO cs.CV

    Enhancing Scene Coordinate Regression with Efficient Keypoint Detection and Sequential Information

    Authors: Kuan Xu, Zeyu Jiang, Haozhi Cao, Shenghai Yuan, Chen Wang, Lihua Xie

    Abstract: Scene Coordinate Regression (SCR) is a visual localization technique that utilizes deep neural networks (DNN) to directly regress 2D-3D correspondences for camera pose estimation. However, current SCR methods often face challenges in handling repetitive textures and meaningless areas due to their reliance on implicit triangulation. In this paper, we propose an efficient and accurate SCR system. Co… ▽ More

    Submitted 13 May, 2025; v1 submitted 9 December, 2024; originally announced December 2024.

    Comments: 8 pages, 6 figures

  49. arXiv:2412.03096  [pdf, other

    cs.CL

    TOOL-ED: Enhancing Empathetic Response Generation with the Tool Calling Capability of LLM

    Authors: Huiying Cao, Yiqun Zhang, Shi Feng, Xiaocui Yang, Daling Wang, Yifei Zhang

    Abstract: Empathetic conversation is a crucial characteristic in daily conversations between individuals. Nowadays, Large Language models (LLMs) have shown outstanding performance in generating empathetic responses. Knowledge bases like COMET can assist LLMs in mitigating illusions and enhancing the understanding of users' intentions and emotions. However, models remain heavily reliant on fixed knowledge ba… ▽ More

    Submitted 8 December, 2024; v1 submitted 4 December, 2024; originally announced December 2024.

  50. arXiv:2411.19951  [pdf, other

    cs.CV cs.CL cs.LG

    Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation

    Authors: Shukang Yin, Chaoyou Fu, Sirui Zhao, Yunhang Shen, Chunjiang Ge, Yan Yang, Zuwei Long, Yuhan Dai, Yongdong Luo, Haoyu Cao, Tong Xu, Xing Sun, Caifeng Shan, Ran He, Enhong Chen

    Abstract: Recent years have witnessed the success of Multimodal Large Language Models (MLLMs) in the vision understanding domain. The success of these models can largely be attributed to the dominant scaling law, which states that larger parameter sizes and data volumes contribute to better performance. Notably, data scaling has mainly been powered by automatic data pipelines, which center around the self-i… ▽ More

    Submitted 17 March, 2025; v1 submitted 29 November, 2024; originally announced November 2024.

    Comments: Project page: https://github.com/VITA-MLLM/Sparrow