Skip to main content

Showing 1–50 of 133 results for author: Jing, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.05056  [pdf, ps, other

    cs.CV cs.AI

    INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling

    Authors: Xin Dong, Shichao Dong, Jin Wang, Jing Huang, Li Zhou, Zenghui Sun, Lihua Jing, Jingsong Lan, Xiaoyong Zhu, Bo Zheng

    Abstract: Hallucinations in large vision-language models (LVLMs) pose significant challenges for real-world applications, as LVLMs may generate responses that appear plausible yet remain inconsistent with the associated visual content. This issue rarely occurs in human cognition. We argue that this discrepancy arises from humans' ability to effectively leverage multimodal interaction information in data sam… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

  2. arXiv:2507.02503  [pdf, ps, other

    cs.LG cs.AI cs.CE

    Continual Gradient Low-Rank Projection Fine-Tuning for LLMs

    Authors: Chenxu Wang, Yilin Lyu, Zicheng Sun, Liping Jing

    Abstract: Continual fine-tuning of Large Language Models (LLMs) is hampered by the trade-off between efficiency and expressiveness. Low-Rank Adaptation (LoRA) offers efficiency but constrains the model's ability to learn new tasks and transfer knowledge due to its low-rank nature and reliance on explicit parameter constraints. We propose GORP (Gradient LOw Rank Projection) for Continual Learning, a novel tr… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: 15 pages, 6 figures, accepted by ACL 2025 main

  3. Visual hallucination detection in large vision-language models via evidential conflict

    Authors: Tao Huang, Zhekun Liu, Rui Wang, Yang Zhang, Liping Jing

    Abstract: Despite the remarkable multimodal capabilities of Large Vision-Language Models (LVLMs), discrepancies often occur between visual inputs and textual outputs--a phenomenon we term visual hallucination. This critical reliability gap poses substantial risks in safety-critical Artificial Intelligence (AI) applications, necessitating a comprehensive evaluation benchmark and effective detection methods.… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

    Journal ref: International Journal of Approximate Reasoning, Volume 186, November 2025, Article 109507

  4. arXiv:2506.17335  [pdf, ps, other

    cs.SE cs.AI

    LMR-BENCH: Evaluating LLM Agent's Ability on Reproducing Language Modeling Research

    Authors: Shuo Yan, Ruochen Li, Ziming Luo, Zimu Wang, Daoyang Li, Liqiang Jing, Kaiyu He, Peilin Wu, George Michalopoulos, Yue Zhang, Ziyang Zhang, Mian Zhang, Zhiyu Chen, Xinya Du

    Abstract: Large language model (LLM) agents have demonstrated remarkable potential in advancing scientific discovery. However, their capability in the fundamental yet crucial task of reproducing code from research papers, especially in the NLP domain, remains underexplored. This task includes unique complex reasoning challenges in the intellectual synthesis of abstract concepts and the comprehension of code… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

  5. arXiv:2506.07323  [pdf, ps, other

    cs.SD cs.AI eess.AS

    Speech Recognition on TV Series with Video-guided Post-Correction

    Authors: Haoyuan Yang, Yue Zhang, Liqiang Jing

    Abstract: Automatic Speech Recognition (ASR) has achieved remarkable success with deep learning, driving advancements in conversational artificial intelligence, media transcription, and assistive technologies. However, ASR systems still struggle in complex environments such as TV series, where overlapping speech, domain-specific terminology, and long-range contextual dependencies pose significant challenges… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

  6. arXiv:2505.23830  [pdf, ps, other

    cs.CL

    EvoMoE: Expert Evolution in Mixture of Experts for Multimodal Large Language Models

    Authors: Linglin Jing, Yuting Gao, Zhigang Wang, Wang Lan, Yiwen Tang, Wenhai Wang, Kaipeng Zhang, Qingpei Guo

    Abstract: Recent advancements have shown that the Mixture of Experts (MoE) approach significantly enhances the capacity of large language models (LLMs) and improves performance on downstream tasks. Building on these promising results, multi-modal large language models (MLLMs) have increasingly adopted MoE techniques. However, existing multi-modal MoE tuning methods typically face two key challenges: expert… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  7. arXiv:2505.03790  [pdf, other

    cs.LG cs.AI eess.SP

    A Time-Series Data Augmentation Model through Diffusion and Transformer Integration

    Authors: Yuren Zhang, Zhongnan Pu, Lei Jing

    Abstract: With the development of Artificial Intelligence, numerous real-world tasks have been accomplished using technology integrated with deep learning. To achieve optimal performance, deep neural networks typically require large volumes of data for training. Although advances in data augmentation have facilitated the acquisition of vast datasets, most of this data is concentrated in domains like images… ▽ More

    Submitted 1 May, 2025; originally announced May 2025.

    Comments: 10 pages,22 figures

  8. arXiv:2505.01958  [pdf, ps, other

    cs.CV cs.CL

    A Comprehensive Analysis for Visual Object Hallucination in Large Vision-Language Models

    Authors: Liqiang Jing, Guiming Hardy Chen, Ehsan Aghazadeh, Xin Eric Wang, Xinya Du

    Abstract: Large Vision-Language Models (LVLMs) demonstrate remarkable capabilities in multimodal tasks, but visual object hallucination remains a persistent issue. It refers to scenarios where models generate inaccurate visual object-related information based on the query input, potentially leading to misinformation and concerns about safety and reliability. Previous works focus on the evaluation and mitiga… ▽ More

    Submitted 3 May, 2025; originally announced May 2025.

  9. arXiv:2505.00755  [pdf, other

    cs.CV cs.AI

    P2P-Insole: Human Pose Estimation Using Foot Pressure Distribution and Motion Sensors

    Authors: Atsuya Watanabe, Ratna Aisuwarya, Lei Jing

    Abstract: This work presents P2P-Insole, a low-cost approach for estimating and visualizing 3D human skeletal data using insole-type sensors integrated with IMUs. Each insole, fabricated with e-textile garment techniques, costs under USD 1, making it significantly cheaper than commercial alternatives and ideal for large-scale production. Our approach uses foot pressure distribution, acceleration, and rotati… ▽ More

    Submitted 1 May, 2025; originally announced May 2025.

  10. arXiv:2504.02876  [pdf, other

    cs.CV cs.LG

    Multimodal Reference Visual Grounding

    Authors: Yangxiao Lu, Ruosen Li, Liqiang Jing, Jikai Wang, Xinya Du, Yunhui Guo, Nicholas Ruozzi, Yu Xiang

    Abstract: Visual grounding focuses on detecting objects from images based on language expressions. Recent Large Vision-Language Models (LVLMs) have significantly advanced visual grounding performance by training large models with large-scale datasets. However, the problem remains challenging, especially when similar objects appear in the input image. For example, an LVLM may not be able to differentiate Die… ▽ More

    Submitted 1 April, 2025; originally announced April 2025.

    Comments: Project page with our code and dataset: https://irvlutd.github.io/MultiGrounding

  11. arXiv:2503.18377  [pdf, other

    cs.LG cs.AI

    Maximum Redundancy Pruning: A Principle-Driven Layerwise Sparsity Allocation for LLMs

    Authors: Chang Gao, Kang Zhao, Jianfei Chen, Liping Jing

    Abstract: Large language models (LLMs) have demonstrated impressive capabilities, but their enormous size poses significant challenges for deployment in real-world applications. To address this issue, researchers have sought to apply network pruning techniques to LLMs. A critical challenge in pruning is allocation the sparsity for each layer. Recent sparsity allocation methods is often based on heuristics o… ▽ More

    Submitted 24 March, 2025; originally announced March 2025.

  12. arXiv:2503.14674  [pdf, ps, other

    cs.CV

    Elevating Visual Question Answering through Implicitly Learned Reasoning Pathways in LVLMs

    Authors: Liu Jing, Amirul Rahman

    Abstract: Large Vision-Language Models (LVLMs) have shown remarkable progress in various multimodal tasks, yet they often struggle with complex visual reasoning that requires multi-step inference. To address this limitation, we propose MF-SQ-LLaVA, a novel approach that enhances LVLMs by enabling implicit self-questioning through end-to-end training. Our method involves augmenting visual question answering… ▽ More

    Submitted 18 March, 2025; originally announced March 2025.

  13. arXiv:2503.12800  [pdf, other

    cs.CV

    Pairwise Similarity Regularization for Semi-supervised Graph Medical Image Segmentation

    Authors: Jialu Zhou, Dianxi Shi, Shaowu Yang, Chunping Qiu, Luoxi Jing, Mengzhu Wang

    Abstract: With fully leveraging the value of unlabeled data, semi-supervised medical image segmentation algorithms significantly reduces the limitation of limited labeled data, achieving a significant improvement in accuracy. However, the distributional shift between labeled and unlabeled data weakens the utilization of information from the labeled data. To alleviate the problem, we propose a graph network… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

  14. arXiv:2502.06877  [pdf, other

    cs.LG

    WirelessGPT: A Generative Pre-trained Multi-task Learning Framework for Wireless Communication

    Authors: Tingting Yang, Ping Zhang, Mengfan Zheng, Yuxuan Shi, Liwen Jing, Jianbo Huang, Nan Li

    Abstract: This paper introduces WirelessGPT, a pioneering foundation model specifically designed for multi-task learning in wireless communication and sensing. Specifically, WirelessGPT leverages large-scale wireless channel datasets for unsupervised pretraining and extracting universal channel representations, which captures complex spatiotemporal dependencies. In fact,this task-agnostic design adapts Wire… ▽ More

    Submitted 8 February, 2025; originally announced February 2025.

    Comments: 8 pages, 4 figures

  15. arXiv:2501.02811  [pdf, other

    cs.CV

    First-place Solution for Streetscape Shop Sign Recognition Competition

    Authors: Bin Wang, Li Jing

    Abstract: Text recognition technology applied to street-view storefront signs is increasingly utilized across various practical domains, including map navigation, smart city planning analysis, and business value assessments in commercial districts. This technology holds significant research and commercial potential. Nevertheless, it faces numerous challenges. Street view images often contain signboards with… ▽ More

    Submitted 22 April, 2025; v1 submitted 6 January, 2025; originally announced January 2025.

    Comments: technical report

  16. arXiv:2412.18091   

    cs.AI

    AutoSculpt: A Pattern-based Model Auto-pruning Framework Using Reinforcement Learning and Graph Learning

    Authors: Lixian Jing, Jianpeng Qi, Junyu Dong, Yanwei Yu

    Abstract: As deep neural networks (DNNs) are increasingly deployed on edge devices, optimizing models for constrained computational resources is critical. Existing auto-pruning methods face challenges due to the diversity of DNN models, various operators (e.g., filters), and the difficulty in balancing pruning granularity with model accuracy. To address these limitations, we introduce AutoSculpt, a pattern-… ▽ More

    Submitted 19 June, 2025; v1 submitted 23 December, 2024; originally announced December 2024.

    Comments: I have identified a significant and fundamental flaw in the methodology described in Section 3 of the manuscript. This flaw pertains to a critical error in the implementation of the model's training procedure, which renders the reported performance metrics unreliable. This issue is not correctable through an erratum or replacement as it undermines the core findings and validity of the entire study

  17. arXiv:2412.16232  [pdf, other

    cs.CV cs.AI cs.LG

    Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven Optimization

    Authors: Yue Zhang, Liqiang Jing, Vibhav Gogate

    Abstract: We introduce a new task called Defeasible Visual Entailment (DVE), where the goal is to allow the modification of the entailment relationship between an image premise and a text hypothesis based on an additional update. While this concept is well-established in Natural Language Inference, it remains unexplored in visual entailment. At a high level, DVE enables models to refine their initial interp… ▽ More

    Submitted 8 February, 2025; v1 submitted 18 December, 2024; originally announced December 2024.

    Comments: Accepted by AAAI 2025

  18. arXiv:2412.14626  [pdf, other

    cs.CL cs.AI

    Learning to Generate Research Idea with Dynamic Control

    Authors: Ruochen Li, Liqiang Jing, Chi Han, Jiawei Zhou, Xinya Du

    Abstract: The rapid advancements in large language models (LLMs) have demonstrated their potential to accelerate scientific discovery, particularly in automating the process of research ideation. LLM-based systems have shown promise in generating hypotheses and research ideas. However, current approaches predominantly rely on prompting-based pre-trained models, limiting their ability to optimize generated c… ▽ More

    Submitted 19 December, 2024; originally announced December 2024.

  19. arXiv:2412.09870  [pdf, ps, other

    cs.CV

    Dynamic Cross-Modal Alignment for Robust Semantic Location Prediction

    Authors: Liu Jing, Amirul Rahman

    Abstract: Semantic location prediction from multimodal social media posts is a critical task with applications in personalized services and human mobility analysis. This paper introduces \textit{Contextualized Vision-Language Alignment (CoVLA)}, a discriminative framework designed to address the challenges of contextual ambiguity and modality discrepancy inherent in this task. CoVLA leverages a Contextual A… ▽ More

    Submitted 13 December, 2024; originally announced December 2024.

  20. arXiv:2411.11016  [pdf, other

    cs.CV cs.AI

    Time Step Generating: A Universal Synthesized Deepfake Image Detector

    Authors: Ziyue Zeng, Haoyuan Liu, Dingjie Peng, Luoxu Jing, Hiroshi Watanabe

    Abstract: Currently, high-fidelity text-to-image models are developed in an accelerating pace. Among them, Diffusion Models have led to a remarkable improvement in the quality of image generation, making it vary challenging to distinguish between real and synthesized images. It simultaneously raises serious concerns regarding privacy and security. Some methods are proposed to distinguish the diffusion model… ▽ More

    Submitted 19 November, 2024; v1 submitted 17 November, 2024; originally announced November 2024.

    Comments: 9 pages, 7 figures

    MSC Class: 62H30; 68T07 ACM Class: I.4.9; I.4.7; I.5.2

  21. arXiv:2410.21276  [pdf, other

    cs.CL cs.AI cs.CV cs.CY cs.LG cs.SD eess.AS

    GPT-4o System Card

    Authors: OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander MÄ…dry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis , et al. (395 additional authors not shown)

    Abstract: GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 mil… ▽ More

    Submitted 25 October, 2024; originally announced October 2024.

  22. arXiv:2410.16135  [pdf, other

    cs.LG cs.AI

    Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs

    Authors: Kang Zhao, Tao Yuan, Han Bao, Zhenfeng Su, Chang Gao, Zhaofeng Sun, Zichen Liang, Liping Jing, Jianfei Chen

    Abstract: To date, 2:4 sparsity has stood as the only sparse pattern that can be accelerated using sparse tensor cores on GPUs. In practice, 2:4 sparsity often possesses low actual speedups ($\leq 1.3$) and requires fixed sparse ratios, meaning that other ratios, such as 4:8, 8:16, or those exceeding 50% sparsity, do not incur any speedups on GPUs. Recent studies suggest that V:N:M sparsity is promising in… ▽ More

    Submitted 2 June, 2025; v1 submitted 21 October, 2024; originally announced October 2024.

  23. arXiv:2410.12158  [pdf, other

    cs.CV

    SAM-Guided Masked Token Prediction for 3D Scene Understanding

    Authors: Zhimin Chen, Liang Yang, Yingwei Li, Longlong Jing, Bing Li

    Abstract: Foundation models have significantly enhanced 2D task performance, and recent works like Bridge3D have successfully applied these models to improve 3D scene understanding through knowledge distillation, marking considerable advancements. Nonetheless, challenges such as the misalignment between 2D and 3D representations and the persistent long-tail distribution in 3D datasets still restrict the eff… ▽ More

    Submitted 17 October, 2024; v1 submitted 15 October, 2024; originally announced October 2024.

    Comments: Accepted by NeurIPS 2024

  24. arXiv:2410.08500  [pdf, ps, other

    cs.RO cs.AI

    Aerial Vision-and-Language Navigation via Semantic-Topo-Metric Representation Guided LLM Reasoning

    Authors: Yunpeng Gao, Zhigang Wang, Linglin Jing, Dong Wang, Xuelong Li, Bin Zhao

    Abstract: Aerial Vision-and-Language Navigation (VLN) is a novel task enabling Unmanned Aerial Vehicles (UAVs) to navigate in outdoor environments through natural language instructions and visual cues. It remains challenging due to the complex spatial relationships in outdoor aerial scenes. In this paper, we propose an end-to-end zero-shot framework for aerial VLN tasks, where the large language model (LLM)… ▽ More

    Submitted 3 July, 2025; v1 submitted 10 October, 2024; originally announced October 2024.

  25. arXiv:2409.16494  [pdf, other

    cs.CV cs.CL

    A Unified Hallucination Mitigation Framework for Large Vision-Language Models

    Authors: Yue Chang, Liqiang Jing, Xiaopeng Zhang, Yue Zhang

    Abstract: Hallucination is a common problem for Large Vision-Language Models (LVLMs) with long generations which is difficult to eradicate. The generation with hallucinations is partially inconsistent with the image content. To mitigate hallucination, current studies either focus on the process of model inference or the results of model generation, but the solutions they design sometimes do not deal appropr… ▽ More

    Submitted 24 September, 2024; originally announced September 2024.

    Comments: Accepted by TMLR

  26. arXiv:2409.13612  [pdf, ps, other

    cs.CV

    FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs

    Authors: Bowen Yan, Zhengsong Zhang, Liqiang Jing, Eftekhar Hossain, Xinya Du

    Abstract: The rapid development of Large Vision-Language Models (LVLMs) often comes with widespread hallucination issues, making cost-effective and comprehensive assessments increasingly vital. Current approaches mainly rely on costly annotations and are not comprehensive -- in terms of evaluating all aspects such as relations, attributes, and dependencies between aspects. Therefore, we introduce the FIHA (… ▽ More

    Submitted 2 June, 2025; v1 submitted 20 September, 2024; originally announced September 2024.

    Comments: Accepted by Findings of ACL 2025

  27. arXiv:2409.07703  [pdf, other

    cs.AI cs.CL

    DSBench: How Far Are Data Science Agents from Becoming Data Science Experts?

    Authors: Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, Dong Yu

    Abstract: Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have demonstrated impressive language/vision reasoning abilities, igniting the recent trend of building agents for targeted applications such as shopping assistants or AI software engineers. Recently, many data science benchmarks have been proposed to investigate their performance in the data science domain. However, existing da… ▽ More

    Submitted 11 April, 2025; v1 submitted 11 September, 2024; originally announced September 2024.

  28. arXiv:2408.14267  [pdf, other

    cs.LG cs.CV

    1-Bit FQT: Pushing the Limit of Fully Quantized Training to 1-bit

    Authors: Chang Gao, Jianfei Chen, Kang Zhao, Jiaqi Wang, Liping Jing

    Abstract: Fully quantized training (FQT) accelerates the training of deep neural networks by quantizing the activations, weights, and gradients into lower precision. To explore the ultimate limit of FQT (the lowest achievable precision), we make a first attempt to 1-bit FQT. We provide a theoretical analysis of FQT based on Adam and SGD, revealing that the gradient variance influences the convergence of FQT… ▽ More

    Submitted 26 August, 2024; originally announced August 2024.

  29. arXiv:2408.12312  [pdf, other

    cs.CV

    MakeupAttack: Feature Space Black-box Backdoor Attack on Face Recognition via Makeup Transfer

    Authors: Ming Sun, Lihua Jing, Zixuan Zhu, Rui Wang

    Abstract: Backdoor attacks pose a significant threat to the training process of deep neural networks (DNNs). As a widely-used DNN-based application in real-world scenarios, face recognition systems once implanted into the backdoor, may cause serious consequences. Backdoor research on face recognition is still in its early stages, and the existing backdoor triggers are relatively simple and visible. Furtherm… ▽ More

    Submitted 22 August, 2024; originally announced August 2024.

  30. arXiv:2407.08836  [pdf, ps, other

    cs.CL cs.AI

    Fault Diagnosis in Power Grids with Large Language Model

    Authors: Liu Jing, Amirul Rahman

    Abstract: Power grid fault diagnosis is a critical task for ensuring the reliability and stability of electrical infrastructure. Traditional diagnostic systems often struggle with the complexity and variability of power grid data. This paper proposes a novel approach that leverages Large Language Models (LLMs), specifically ChatGPT and GPT-4, combined with advanced prompt engineering to enhance fault diagno… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: 11 pages

  31. arXiv:2407.03240  [pdf, other

    cs.CV

    Cyclic Refiner: Object-Aware Temporal Representation Learning for Multi-View 3D Detection and Tracking

    Authors: Mingzhe Guo, Zhipeng Zhang, Liping Jing, Yuan He, Ke Wang, Heng Fan

    Abstract: We propose a unified object-aware temporal learning framework for multi-view 3D detection and tracking tasks. Having observed that the efficacy of the temporal fusion strategy in recent multi-view perception methods may be weakened by distractors and background clutters in historical frames, we propose a cyclic learning mechanism to improve the robustness of multi-view representation learning. The… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: Accepted by IJCV

  32. arXiv:2406.17680  [pdf, other

    cs.CV

    End-to-End Autonomous Driving without Costly Modularization and 3D Manual Annotation

    Authors: Mingzhe Guo, Zhipeng Zhang, Yuan He, Ke Wang, Liping Jing

    Abstract: We propose UAD, a method for vision-based end-to-end autonomous driving (E2EAD), achieving the best open-loop evaluation performance in nuScenes, meanwhile showing robust closed-loop driving quality in CARLA. Our motivation stems from the observation that current E2EAD models still mimic the modular architecture in typical driving stacks, with carefully designed supervised perception and predictio… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: 17 pages, 10 figures and 15 tables

  33. arXiv:2406.15695  [pdf, ps, other

    cs.CL

    SS-GEN: A Social Story Generation Framework with Large Language Models

    Authors: Yi Feng, Mingyang Song, Jiaqi Wang, Zhuang Chen, Guanqun Bi, Minlie Huang, Liping Jing, Jian Yu

    Abstract: Children with Autism Spectrum Disorder (ASD) often misunderstand social situations and struggle to participate in daily routines. Social Stories are traditionally crafted by psychology experts under strict constraints to address these challenges but are costly and limited in diversity. As Large Language Models (LLMs) advance, there's an opportunity to develop more automated, affordable, and access… ▽ More

    Submitted 4 July, 2025; v1 submitted 21 June, 2024; originally announced June 2024.

    Comments: AAAI 2025 (Oral)

  34. arXiv:2405.16571  [pdf, other

    cs.CL

    A Preliminary Empirical Study on Prompt-based Unsupervised Keyphrase Extraction

    Authors: Mingyang Song, Yi Feng, Liping Jing

    Abstract: Pre-trained large language models can perform natural language processing downstream tasks by conditioning on human-designed prompts. However, a prompt-based approach often requires "prompt engineering" to design different prompts, primarily hand-crafted through laborious trial and error, requiring human intervention and expertise. It is a challenging problem when constructing a prompt-based keyph… ▽ More

    Submitted 26 May, 2024; originally announced May 2024.

    Comments: work in progress

  35. arXiv:2405.04390  [pdf, other

    cs.CV

    DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving

    Authors: Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, Liping Jing, Yiming Nie, Bin Dai

    Abstract: Vision-centric autonomous driving has recently raised wide attention due to its lower cost. Pre-training is essential for extracting a universal representation. However, current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks, overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task. In this paper, we address this challenge by i… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

    Comments: Accepted by CVPR2024

  36. arXiv:2405.00236  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    STT: Stateful Tracking with Transformers for Autonomous Driving

    Authors: Longlong Jing, Ruichi Yu, Xu Chen, Zhengli Zhao, Shiwei Sheng, Colin Graber, Qi Chen, Qinru Li, Shangxuan Wu, Han Deng, Sangjin Lee, Chris Sweeney, Qiurui He, Wei-Chih Hung, Tong He, Xingyi Zhou, Farshid Moussavi, Zijian Guo, Yin Zhou, Mingxing Tan, Weilong Yang, Congcong Li

    Abstract: Tracking objects in three-dimensional space is critical for autonomous driving. To ensure safety while driving, the tracker must be able to reliably track objects across frames and accurately estimate their states such as velocity and acceleration in the present. Existing works frequently focus on the association task while either neglecting the model performance on state estimation or deploying c… ▽ More

    Submitted 30 April, 2024; originally announced May 2024.

    Comments: ICRA 2024

  37. arXiv:2404.16452  [pdf, other

    cs.CV

    PAD: Patch-Agnostic Defense against Adversarial Patch Attacks

    Authors: Lihua Jing, Rui Wang, Wenqi Ren, Xin Dong, Cong Zou

    Abstract: Adversarial patch attacks present a significant threat to real-world object detectors due to their practical feasibility. Existing defense methods, which rely on attack data or prior knowledge, struggle to effectively address a wide range of adversarial patches. In this paper, we show two inherent characteristics of adversarial patches, semantic independence and spatial heterogeneity, independent… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: Accepted by CVPR 2024

  38. The Victim and The Beneficiary: Exploiting a Poisoned Model to Train a Clean Model on Poisoned Data

    Authors: Zixuan Zhu, Rui Wang, Cong Zou, Lihua Jing

    Abstract: Recently, backdoor attacks have posed a serious security threat to the training process of deep neural networks (DNNs). The attacked model behaves normally on benign samples but outputs a specific result when the trigger is present. However, compared with the rocketing progress of backdoor attacks, existing defenses are difficult to deal with these threats effectively or require benign samples to… ▽ More

    Submitted 31 May, 2024; v1 submitted 17 April, 2024; originally announced April 2024.

    Comments: 13 pages, 6 figures, published to ICCV

    Journal ref: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2023: 155-164

  39. arXiv:2404.05046  [pdf, other

    cs.CV cs.CL

    FGAIF: Aligning Large Vision-Language Models with Fine-grained AI Feedback

    Authors: Liqiang Jing, Xinya Du

    Abstract: Large Vision-Language Models (LVLMs) have demonstrated proficiency in tackling a variety of visual-language tasks. However, current LVLMs suffer from misalignment between text and image modalities which causes three kinds of hallucination problems, i.e., object existence, object attribute, and object relationship. To tackle this issue, existing methods mainly utilize Reinforcement Learning (RL) to… ▽ More

    Submitted 6 May, 2025; v1 submitted 7 April, 2024; originally announced April 2024.

  40. arXiv:2403.16788  [pdf, other

    cs.CV

    HPL-ESS: Hybrid Pseudo-Labeling for Unsupervised Event-based Semantic Segmentation

    Authors: Linglin Jing, Yiming Ding, Yunpeng Gao, Zhigang Wang, Xu Yan, Dong Wang, Gerald Schaefer, Hui Fang, Bin Zhao, Xuelong Li

    Abstract: Event-based semantic segmentation has gained popularity due to its capability to deal with scenarios under high-speed motion and extreme lighting conditions, which cannot be addressed by conventional RGB cameras. Since it is hard to annotate event data, previous approaches rely on event-to-image reconstruction to obtain pseudo labels for training. However, this will inevitably introduce noise, and… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

  41. arXiv:2403.15715  [pdf, other

    cs.CL

    EDDA: A Encoder-Decoder Data Augmentation Framework for Zero-Shot Stance Detection

    Authors: Daijun Ding, Li Dong, Zhichao Huang, Guangning Xu, Xu Huang, Bo Liu, Liwen Jing, Bowen Zhang

    Abstract: Stance detection aims to determine the attitude expressed in text towards a given target. Zero-shot stance detection (ZSSD) has emerged to classify stances towards unseen targets during inference. Recent data augmentation techniques for ZSSD increase transferable knowledge between targets through text or target augmentation. However, these methods exhibit limitations. Target augmentation lacks log… ▽ More

    Submitted 23 March, 2024; originally announced March 2024.

  42. arXiv:2403.02637  [pdf, other

    cs.CV

    BSDP: Brain-inspired Streaming Dual-level Perturbations for Online Open World Object Detection

    Authors: Yu Chen, Liyan Ma, Liping Jing, Jian Yu

    Abstract: Humans can easily distinguish the known and unknown categories and can recognize the unknown object by learning it once instead of repeating it many times without forgetting the learned object. Hence, we aim to make deep learning models simulate the way people learn. We refer to such a learning manner as OnLine Open World Object Detection(OLOWOD). Existing OWOD approaches pay more attention to the… ▽ More

    Submitted 4 March, 2024; originally announced March 2024.

    Comments: 29 pages, 12 figures

  43. arXiv:2402.18107  [pdf, other

    cs.MM

    Multimodal Interaction Modeling via Self-Supervised Multi-Task Learning for Review Helpfulness Prediction

    Authors: HongLin Gong, Mengzhao Jia, Liqiang Jing

    Abstract: In line with the latest research, the task of identifying helpful reviews from a vast pool of user-generated textual and visual data has become a prominent area of study. Effective modal representations are expected to possess two key attributes: consistency and differentiation. Current methods designed for Multimodal Review Helpfulness Prediction (MRHP) face limitations in capturing distinctive i… ▽ More

    Submitted 25 March, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

    Comments: 10 pages,4 figures, 4 tables

  44. arXiv:2402.11414  [pdf, other

    cs.CL

    Fine-grained and Explainable Factuality Evaluation for Multimodal Summarization

    Authors: Yue Zhang, Jingxuan Zuo, Liqiang Jing

    Abstract: Multimodal summarization aims to generate a concise summary based on the input text and image. However, the existing methods potentially suffer from unfactual output. To evaluate the factuality of multimodal summarization models, we propose two fine-grained and explainable evaluation frameworks (FALLACIOUS) for different application scenarios, i.e. reference-based factuality evaluation framework a… ▽ More

    Submitted 27 December, 2024; v1 submitted 17 February, 2024; originally announced February 2024.

    Comments: AAAI 2025

  45. arXiv:2402.06038  [pdf, other

    cs.LG cs.AI cs.CV

    Understanding Contrastive Representation Learning from Positive Unlabeled (PU) Data

    Authors: Anish Acharya, Li Jing, Bhargav Bhushanam, Dhruv Choudhary, Michael Rabbat, Sujay Sanghavi, Inderjit S Dhillon

    Abstract: Pretext Invariant Representation Learning (PIRL) followed by Supervised Fine-Tuning (SFT) has become a standard paradigm for learning with limited labels. We extend this approach to the Positive Unlabeled (PU) setting, where only a small set of labeled positives and a large unlabeled pool -- containing both positives and negatives are available. We study this problem under two regimes: (i) without… ▽ More

    Submitted 10 April, 2025; v1 submitted 8 February, 2024; originally announced February 2024.

  46. arXiv:2402.03658  [pdf, other

    cs.CL cs.MM

    Sentiment-enhanced Graph-based Sarcasm Explanation in Dialogue

    Authors: Kun Ouyang, Liqiang Jing, Xuemeng Song, Meng Liu, Yupeng Hu, Liqiang Nie

    Abstract: Sarcasm Explanation in Dialogue (SED) is a new yet challenging task, which aims to generate a natural language explanation for the given sarcastic dialogue that involves multiple modalities (\ie utterance, video, and audio). Although existing studies have achieved great success based on the generative pretrained language model BART, they overlook exploiting the sentiments residing in the utterance… ▽ More

    Submitted 6 January, 2025; v1 submitted 5 February, 2024; originally announced February 2024.

    Comments: This paper got accepted by IEEE TMM

  47. arXiv:2402.03635  [pdf, ps, other

    cs.IR

    Retrieval Augmented Cross-Modal Tag Recommendation in Software Q&A Sites

    Authors: Sijin Lu, Pengyu Xu, Bing Liu, Hongjian Sun, Liping Jing, Jian Yu

    Abstract: Posts in software Q\&A sites often consist of three main parts: title, description and code, which are interconnected and jointly describe the question. Existing tag recommendation methods often treat different modalities as a whole or inadequately consider the interaction between different modalities. Additionally, they focus on extracting information directly from the post itself, neglecting the… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

  48. arXiv:2401.04317  [pdf, other

    cs.CV cs.CL

    Vision Reimagined: AI-Powered Breakthroughs in WiFi Indoor Imaging

    Authors: Jianyang Shi, Bowen Zhang, Amartansh Dubey, Ross Murch, Liwen Jing

    Abstract: Indoor imaging is a critical task for robotics and internet-of-things. WiFi as an omnipresent signal is a promising candidate for carrying out passive imaging and synchronizing the up-to-date information to all connected devices. This is the first research work to consider WiFi indoor imaging as a multi-modal image generation task that converts the measured WiFi power into a high-resolution indoor… ▽ More

    Submitted 8 January, 2024; originally announced January 2024.

  49. arXiv:2401.02402  [pdf, other

    cs.CV

    3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

    Authors: Zihao Xiao, Longlong Jing, Shangxuan Wu, Alex Zihao Zhu, Jingwei Ji, Chiyu Max Jiang, Wei-Chih Hung, Thomas Funkhouser, Weicheng Kuo, Anelia Angelova, Yin Zhou, Shiwei Sheng

    Abstract: 3D panoptic segmentation is a challenging perception task, especially in autonomous driving. It aims to predict both semantic and instance annotations for 3D points in a scene. Although prior 3D panoptic segmentation approaches have achieved great performance on closed-set benchmarks, generalizing these approaches to unseen things and unseen stuff categories remains an open problem. For unseen obj… ▽ More

    Submitted 2 April, 2024; v1 submitted 4 January, 2024; originally announced January 2024.

  50. arXiv:2401.01761  [pdf, other

    cs.CL

    Cross-target Stance Detection by Exploiting Target Analytical Perspectives

    Authors: Daijun Ding, Rong Chen, Liwen Jing, Bowen Zhang, Xu Huang, Li Dong, Xiaowen Zhao, Ge Song

    Abstract: Cross-target stance detection (CTSD) is an important task, which infers the attitude of the destination target by utilizing annotated data derived from the source target. One important approach in CTSD is to extract domain-invariant features to bridge the knowledge gap between multiple targets. However, the analysis of informal and short text structure, and implicit expressions, complicate the ext… ▽ More

    Submitted 3 January, 2024; v1 submitted 3 January, 2024; originally announced January 2024.