Skip to main content

Showing 1–50 of 213 results for author: Duan, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.01888  [pdf, ps, other

    cs.CV

    Rethinking Score Distilling Sampling for 3D Editing and Generation

    Authors: Xingyu Miao, Haoran Duan, Yang Long, Jungong Han

    Abstract: Score Distillation Sampling (SDS) has emerged as a prominent method for text-to-3D generation by leveraging the strengths of 2D diffusion models. However, SDS is limited to generation tasks and lacks the capability to edit existing 3D assets. Conversely, variants of SDS that introduce editing capabilities often can not generate new 3D assets effectively. In this work, we observe that the processes… ▽ More

    Submitted 3 May, 2025; originally announced May 2025.

  2. arXiv:2505.00063  [pdf, other

    cs.CL cs.CV

    GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling

    Authors: Siqi Li, Yufan Shen, Xiangnan Chen, Jiayi Chen, Hengwei Ju, Haodong Duan, Song Mao, Hongbin Zhou, Bo Zhang, Pinlong Cai, Licheng Wen, Botian Shi, Yong Liu, Xinyu Cai, Yu Qiao

    Abstract: The rapid advancement of multimodal large language models (MLLMs) has profoundly impacted the document domain, creating a wide array of application scenarios. This progress highlights the need for a comprehensive benchmark to evaluate these models' capabilities across various document-specific tasks. However, existing benchmarks often fail to locate specific model weaknesses or guide systematic im… ▽ More

    Submitted 30 April, 2025; originally announced May 2025.

  3. arXiv:2504.21308  [pdf, other

    cs.CV

    AGHI-QA: A Subjective-Aligned Dataset and Metric for AI-Generated Human Images

    Authors: Yunhao Li, Sijing Wu, Wei Sun, Zhichao Zhang, Yucheng Zhu, Zicheng Zhang, Huiyu Duan, Xiongkuo Min, Guangtao Zhai

    Abstract: The rapid development of text-to-image (T2I) generation approaches has attracted extensive interest in evaluating the quality of generated images, leading to the development of various quality assessment methods for general-purpose T2I outputs. However, existing image quality assessment (IQA) methods are limited to providing global quality scores, failing to deliver fine-grained perceptual evaluat… ▽ More

    Submitted 30 April, 2025; originally announced April 2025.

  4. arXiv:2504.20466  [pdf, other

    cs.CV

    LMME3DHF: Benchmarking and Evaluating Multimodal 3D Human Face Generation with LMMs

    Authors: Woo Yi Yang, Jiarui Wang, Sijing Wu, Huiyu Duan, Yuxin Zhu, Liu Yang, Kang Fu, Guangtao Zhai, Xiongkuo Min

    Abstract: The rapid advancement in generative artificial intelligence have enabled the creation of 3D human faces (HFs) for applications including media production, virtual reality, security, healthcare, and game development, etc. However, assessing the quality and realism of these AI-generated 3D human faces remains a significant challenge due to the subjective nature of human perception and innate percept… ▽ More

    Submitted 5 May, 2025; v1 submitted 29 April, 2025; originally announced April 2025.

  5. arXiv:2504.15552  [pdf

    cs.AI

    A Multi-Agent Framework for Automated Qinqiang Opera Script Generation Using Large Language Models

    Authors: Gengxian Cao, Fengyuan Li, Hong Duan, Ye Yang, Bofeng Wang, Donghe Li

    Abstract: This paper introduces a novel multi-Agent framework that automates the end to end production of Qinqiang opera by integrating Large Language Models , visual generation, and Text to Speech synthesis. Three specialized agents collaborate in sequence: Agent1 uses an LLM to craft coherent, culturally grounded scripts;Agent2 employs visual generation models to render contextually accurate stage scenes;… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

    Comments: 17 pages,7 figures,1 tables

  6. arXiv:2504.11379  [pdf, other

    cs.CV

    Omni$^2$: Unifying Omnidirectional Image Generation and Editing in an Omni Model

    Authors: Liu Yang, Huiyu Duan, Yucheng Zhu, Xiaohong Liu, Lu Liu, Zitong Xu, Guangji Ma, Xiongkuo Min, Guangtao Zhai, Patrick Le Callet

    Abstract: $360^{\circ}… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

    Comments: 10 pages

  7. arXiv:2504.11368  [pdf, other

    cs.CV

    From Gaze to Insight: Bridging Human Visual Attention and Vision Language Model Explanation for Weakly-Supervised Medical Image Segmentation

    Authors: Jingkun Chen, Haoran Duan, Xiao Zhang, Boyan Gao, Tao Tan, Vicente Grau, Jungong Han

    Abstract: Medical image segmentation remains challenging due to the high cost of pixel-level annotations for training. In the context of weak supervision, clinician gaze data captures regions of diagnostic interest; however, its sparsity limits its use for segmentation. In contrast, vision-language models (VLMs) provide semantic context through textual descriptions but lack the explanation precision require… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

    Comments: 10 pages, 5 figures

    MSC Class: 68T45 ACM Class: I.2.10; I.4.8

  8. arXiv:2504.09255  [pdf, other

    cs.CV

    FVQ: A Large-Scale Dataset and A LMM-based Method for Face Video Quality Assessment

    Authors: Sijing Wu, Yunhao Li, Ziwen Xu, Yixuan Gao, Huiyu Duan, Wei Sun, Guangtao Zhai

    Abstract: Face video quality assessment (FVQA) deserves to be explored in addition to general video quality assessment (VQA), as face videos are the primary content on social media platforms and human visual system (HVS) is particularly sensitive to human faces. However, FVQA is rarely explored due to the lack of large-scale FVQA datasets. To fill this gap, we present the first large-scale in-the-wild FVQA… ▽ More

    Submitted 12 April, 2025; originally announced April 2025.

  9. arXiv:2504.08358  [pdf, other

    cs.CV

    LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs

    Authors: Jiarui Wang, Huiyu Duan, Yu Zhao, Juntong Wang, Guangtao Zhai, Xiongkuo Min

    Abstract: Recent breakthroughs in large multimodal models (LMMs) have significantly advanced both text-to-image (T2I) generation and image-to-text (I2T) interpretation. However, many generated images still suffer from issues related to perceptual quality and text-image alignment. Given the high cost and inefficiency of manual evaluation, an automatic metric that aligns with human preferences is desirable. T… ▽ More

    Submitted 11 April, 2025; originally announced April 2025.

  10. arXiv:2504.07957  [pdf, other

    cs.CV

    MM-IFEngine: Towards Multimodal Instruction Following

    Authors: Shengyuan Ding, Shenxi Wu, Xiangyu Zhao, Yuhang Zang, Haodong Duan, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Dahua Lin, Jiaqi Wang

    Abstract: The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and whether they are doing it right. Existing multimodal instruction following training data is scarce, the benchmarks are simple with atomic instructions, and the evaluation strategies are imprecise for tasks demanding exact output constraints. To addre… ▽ More

    Submitted 27 April, 2025; v1 submitted 10 April, 2025; originally announced April 2025.

  11. arXiv:2504.07439  [pdf, other

    cs.IR cs.CL

    LLM4Ranking: An Easy-to-use Framework of Utilizing Large Language Models for Document Reranking

    Authors: Qi Liu, Haozhe Duan, Yiqun Chen, Quanfeng Lu, Weiwei Sun, Jiaxin Mao

    Abstract: Utilizing large language models (LLMs) for document reranking has been a popular and promising research direction in recent years, many studies are dedicated to improving the performance and efficiency of using LLMs for reranking. Besides, it can also be applied in many real-world applications, such as search engines or retrieval-augmented generation. In response to the growing demand for research… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

  12. arXiv:2504.03738  [pdf, other

    cs.LG cs.AI cs.CV

    Attention in Diffusion Model: A Survey

    Authors: Litao Hua, Fan Liu, Jie Su, Xingyu Miao, Zizhou Ouyang, Zeyu Wang, Runze Hu, Zhenyu Wen, Bing Zhai, Yang Long, Haoran Duan, Yuan Zhou

    Abstract: Attention mechanisms have become a foundational component in diffusion models, significantly influencing their capacity across a wide range of generative and discriminative tasks. This paper presents a comprehensive survey of attention within diffusion models, systematically analysing its roles, design patterns, and operations across different modalities and tasks. We propose a unified taxonomy th… ▽ More

    Submitted 1 April, 2025; originally announced April 2025.

  13. arXiv:2504.02826  [pdf, other

    cs.CV

    Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

    Authors: Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Hao Li, Zicheng Zhang, Guangtao Zhai, Junchi Yan, Hua Yang, Xue Yang, Haodong Duan

    Abstract: Large Multi-modality Models (LMMs) have made significant progress in visual understanding and generation, but they still face challenges in General Visual Editing, particularly in following complex instructions, preserving appearance consistency, and supporting flexible input formats. To address this gap, we introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing… ▽ More

    Submitted 8 April, 2025; v1 submitted 3 April, 2025; originally announced April 2025.

    Comments: 27 pages, 23 figures, 1 table. Technical Report

  14. arXiv:2504.02316  [pdf, other

    cs.CV cs.AI

    ConsDreamer: Advancing Multi-View Consistency for Zero-Shot Text-to-3D Generation

    Authors: Yuan Zhou, Shilong Jin, Litao Hua, Wanjun Lv, Haoran Duan, Jungong Han

    Abstract: Recent advances in zero-shot text-to-3D generation have revolutionized 3D content creation by enabling direct synthesis from textual descriptions. While state-of-the-art methods leverage 3D Gaussian Splatting with score distillation to enhance multi-view rendering through pre-trained text-to-image (T2I) models, they suffer from inherent view biases in T2I priors. These biases lead to inconsistent… ▽ More

    Submitted 3 April, 2025; originally announced April 2025.

    Comments: 13 pages, 11 figures, 3 tables

  15. arXiv:2504.00983  [pdf, other

    cs.GR cs.AI cs.CV

    WorldScore: A Unified Evaluation Benchmark for World Generation

    Authors: Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, Jiajun Wu

    Abstract: We introduce the WorldScore benchmark, the first unified benchmark for world generation. We decompose world generation into a sequence of next-scene generation tasks with explicit camera trajectory-based layout specifications, enabling unified evaluation of diverse approaches from 3D and 4D scene generation to video generation models. The WorldScore benchmark encompasses a curated dataset of 3,000… ▽ More

    Submitted 1 April, 2025; originally announced April 2025.

    Comments: Project website: https://haoyi-duan.github.io/WorldScore/ The first two authors contributed equally

  16. arXiv:2503.19990  [pdf, other

    cs.AI

    LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?

    Authors: Kexian Tang, Junyao Gao, Yanhong Zeng, Haodong Duan, Yanan Sun, Zhening Xing, Wenran Liu, Kaifeng Lyu, Kai Chen

    Abstract: Multi-step spatial reasoning entails understanding and reasoning about spatial relationships across multiple sequential steps, which is crucial for tackling complex real-world applications, such as robotic manipulation, autonomous navigation, and automated assembly. To assess how well current Multimodal Large Language Models (MLLMs) have acquired this fundamental capability, we introduce \textbf{L… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

    Comments: 12 pages, 7 figures

  17. arXiv:2503.19757  [pdf, other

    cs.RO cs.CV

    Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

    Authors: Zhi Hou, Tianyi Zhang, Yuwen Xiong, Haonan Duan, Hengjun Pu, Ronglei Tong, Chengyang Zhao, Xizhou Zhu, Yu Qiao, Jifeng Dai, Yuntao Chen

    Abstract: While recent vision-language-action models trained on diverse robot datasets exhibit promising generalization capabilities with limited in-domain data, their reliance on compact action heads to predict discretized or continuous actions constrains adaptability to heterogeneous action spaces. We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuou… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

    Comments: Preprint; https://robodita.github.io;

  18. arXiv:2503.17530  [pdf, other

    cs.CV

    FMDConv: Fast Multi-Attention Dynamic Convolution via Speed-Accuracy Trade-off

    Authors: Tianyu Zhang, Fan Wan, Haoran Duan, Kevin W. Tong, Jingjing Deng, Yang Long

    Abstract: Spatial convolution is fundamental in constructing deep Convolutional Neural Networks (CNNs) for visual recognition. While dynamic convolution enhances model accuracy by adaptively combining static kernels, it incurs significant computational overhead, limiting its deployment in resource-constrained environments such as federated edge computing. To address this, we propose Fast Multi-Attention Dyn… ▽ More

    Submitted 21 March, 2025; originally announced March 2025.

  19. D2Fusion: Dual-domain Fusion with Feature Superposition for Deepfake Detection

    Authors: Xueqi Qiu, Xingyu Miao, Fan Wan, Haoran Duan, Tejal Shah, Varun Ojhab, Yang Longa, Rajiv Ranjan

    Abstract: Deepfake detection is crucial for curbing the harm it causes to society. However, current Deepfake detection methods fail to thoroughly explore artifact information across different domains due to insufficient intrinsic interactions. These interactions refer to the fusion and coordination after feature extraction processes across different domains, which are crucial for recognizing complex forgery… ▽ More

    Submitted 21 March, 2025; originally announced March 2025.

  20. arXiv:2503.15390  [pdf, other

    eess.IV cs.CV

    FedSCA: Federated Tuning with Similarity-guided Collaborative Aggregation for Heterogeneous Medical Image Segmentation

    Authors: Yumin Zhang, Yan Gao, Haoran Duan, Hanqing Guo, Tejal Shah, Rajiv Ranjan, Bo Wei

    Abstract: Transformer-based foundation models (FMs) have recently demonstrated remarkable performance in medical image segmentation. However, scaling these models is challenging due to the limited size of medical image datasets within isolated hospitals, where data centralization is restricted due to privacy concerns. These constraints, combined with the data-intensive nature of FMs, hinder their broader ap… ▽ More

    Submitted 19 March, 2025; originally announced March 2025.

  21. arXiv:2503.14478  [pdf, other

    cs.CV

    Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM

    Authors: Xinyu Fang, Zhijian Chen, Kai Lan, Lixin Ma, Shengyuan Ding, Yingji Liang, Xiangyu Zhao, Farong Wen, Zicheng Zhang, Guofeng Zhang, Haodong Duan, Kai Chen, Dahua Lin

    Abstract: Creativity is a fundamental aspect of intelligence, involving the ability to generate novel and appropriate solutions across diverse contexts. While Large Language Models (LLMs) have been extensively evaluated for their creative capabilities, the assessment of Multimodal Large Language Models (MLLMs) in this domain remains largely unexplored. To address this gap, we introduce Creation-MMBench, a m… ▽ More

    Submitted 19 March, 2025; v1 submitted 18 March, 2025; originally announced March 2025.

    Comments: Evaluation Code and dataset see https://github.com/open-compass/Creation-MMBench

  22. arXiv:2503.13178  [pdf, other

    cs.AI cs.LG

    Rapfi: Distilling Efficient Neural Network for the Game of Gomoku

    Authors: Zhanggen Jin, Haobin Duan, Zhiyang Hang

    Abstract: Games have played a pivotal role in advancing artificial intelligence, with AI agents using sophisticated techniques to compete. Despite the success of neural network based game AIs, their performance often requires significant computational resources. In this paper, we present Rapfi, an efficient Gomoku agent that outperforms CNN-based agents in limited computation environments. Rapfi leverages a… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

  23. arXiv:2503.10291  [pdf, other

    cs.CV cs.CL

    VisualPRM: An Effective Process Reward Model for Multimodal Reasoning

    Authors: Weiyun Wang, Zhangwei Gao, Lianjie Chen, Zhe Chen, Jinguo Zhu, Xiangyu Zhao, Yangzhou Liu, Yue Cao, Shenglong Ye, Xizhou Zhu, Lewei Lu, Haodong Duan, Yu Qiao, Jifeng Dai, Wenhai Wang

    Abstract: We introduce VisualPRM, an advanced multimodal Process Reward Model (PRM) with 8B parameters, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) across different model scales and families with Best-of-N (BoN) evaluation strategies. Specifically, our model improves the reasoning performance of three types of MLLMs and four different model scales. Even when a… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

  24. arXiv:2503.10079  [pdf, other

    cs.CL

    Information Density Principle for MLLM Benchmarks

    Authors: Chunyi Li, Xiaozhe Li, Zicheng Zhang, Yuan Tian, Ziheng Jia, Xiaohong Liu, Xiongkuo Min, Jia Wang, Haodong Duan, Kai Chen, Guangtao Zhai

    Abstract: With the emergence of Multimodal Large Language Models (MLLMs), hundreds of benchmarks have been developed to ensure the reliability of MLLMs in downstream tasks. However, the evaluation mechanism itself may not be reliable. For developers of MLLMs, questions remain about which benchmark to use and whether the test results meet their requirements. Therefore, we propose a critical principle of Info… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

  25. arXiv:2503.10078  [pdf, other

    cs.CV cs.MM eess.IV

    Image Quality Assessment: From Human to Machine Preference

    Authors: Chunyi Li, Yuan Tian, Xiaoyue Ling, Zicheng Zhang, Haodong Duan, Haoning Wu, Ziheng Jia, Xiaohong Liu, Xiongkuo Min, Guo Lu, Weisi Lin, Guangtao Zhai

    Abstract: Image Quality Assessment (IQA) based on human subjective preferences has undergone extensive research in the past decades. However, with the development of communication protocols, the visual data consumption volume of machines has gradually surpassed that of humans. For machines, the preference depends on downstream tasks such as segmentation and detection, rather than visual appeal. Considering… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

  26. arXiv:2503.03399  [pdf, other

    cs.LG

    Predicting Practically? Domain Generalization for Predictive Analytics in Real-world Environments

    Authors: Hanyu Duan, Yi Yang, Ahmed Abbasi, Kar Yan Tam

    Abstract: Predictive machine learning models are widely used in customer relationship management (CRM) to forecast customer behaviors and support decision-making. However, the dynamic nature of customer behaviors often results in significant distribution shifts between training data and serving data, leading to performance degradation in predictive models. Domain generalization, which aims to train models t… ▽ More

    Submitted 5 March, 2025; originally announced March 2025.

  27. arXiv:2503.01785  [pdf, other

    cs.CV

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Authors: Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang

    Abstract: Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers, which is especially useful in applications when fine-tuning data is scarce. Recent open-source work like DeepSeek-R1 demonstrates that reinforcement learning with verifiable reward is one key direction in reproducing o1. While the R1-style model has demonstrated success in language models,… ▽ More

    Submitted 3 March, 2025; originally announced March 2025.

    Comments: project page: https://github.com/Liuziyu77/Visual-RFT

  28. arXiv:2503.00407  [pdf

    cs.LG cs.DC

    Asynchronous Personalized Federated Learning through Global Memorization

    Authors: Fan Wan, Yuchen Li, Xueqi Qiu, Rui Sun, Leyuan Zhang, Xingyu Miao, Tianyu Zhang, Haoran Duan, Yang Long

    Abstract: The proliferation of Internet of Things devices and advances in communication technology have unleashed an explosion of personal data, amplifying privacy concerns amid stringent regulations like GDPR and CCPA. Federated Learning offers a privacy preserving solution by enabling collaborative model training across decentralized devices without centralizing sensitive data. However, statistical hetero… ▽ More

    Submitted 1 March, 2025; originally announced March 2025.

  29. arXiv:2502.18411  [pdf, other

    cs.CV

    OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference

    Authors: Xiangyu Zhao, Shengyuan Ding, Zicheng Zhang, Haian Huang, Maosong Cao, Weiyun Wang, Jiaqi Wang, Xinyu Fang, Wenhai Wang, Guangtao Zhai, Haodong Duan, Hua Yang, Kai Chen

    Abstract: Recent advancements in open-source multi-modal large language models (MLLMs) have primarily focused on enhancing foundational capabilities, leaving a significant gap in human preference alignment. This paper introduces OmniAlign-V, a comprehensive dataset of 200K high-quality training samples featuring diverse images, complex questions, and varied response formats to improve MLLMs' alignment with… ▽ More

    Submitted 28 February, 2025; v1 submitted 25 February, 2025; originally announced February 2025.

  30. arXiv:2502.16915  [pdf, other

    cs.CV

    Multi-Dimensional Quality Assessment for Text-to-3D Assets: Dataset and Model

    Authors: Kang Fu, Huiyu Duan, Zicheng Zhang, Xiaohong Liu, Xiongkuo Min, Jia Wang, Guangtao Zhai

    Abstract: Recent advancements in text-to-image (T2I) generation have spurred the development of text-to-3D asset (T23DA) generation, leveraging pretrained 2D text-to-image diffusion models for text-to-3D asset synthesis. Despite the growing popularity of text-to-3D asset generation, its evaluation has not been well considered and studied. However, given the significant quality discrepancies among various te… ▽ More

    Submitted 24 February, 2025; originally announced February 2025.

  31. Adaptive Multi-Objective Bayesian Optimization for Capacity Planning of Hybrid Heat Sources in Electric-Heat Coupling Systems of Cold Regions

    Authors: Ruizhe Yang, Zhongkai Yi, Ying Xu, Guiyu Chen, Haojie Yang, Rong Yi, Tongqing Li, Miaozhe ShenJin Li, Haoxiang Gao, Hongyu Duan

    Abstract: The traditional heat-load generation pattern of combined heat and power generators has become a problem leading to renewable energy source (RES) power curtailment in cold regions, motivating the proposal of a planning model for alternative heat sources. The model aims to identify non-dominant capacity allocation schemes for heat pumps, thermal energy storage, electric boilers, and combined storage… ▽ More

    Submitted 13 February, 2025; originally announced February 2025.

    Comments: 11 pages, 11 figures

    Journal ref: IEEE Transactions on Industry Applications 2025 ( Early Access )

  32. arXiv:2502.07979  [pdf, other

    cs.CV

    Joint Modelling Histology and Molecular Markers for Cancer Classification

    Authors: Xiaofei Wang, Hanyu Liu, Yupei Zhang, Boyang Zhao, Hao Duan, Wanming Hu, Yonggao Mou, Stephen Price, Chao Li

    Abstract: Cancers are characterized by remarkable heterogeneity and diverse prognosis. Accurate cancer classification is essential for patient stratification and clinical decision-making. Although digital pathology has been advancing cancer diagnosis and prognosis, the paradigm in cancer pathology has shifted from purely relying on histology features to incorporating molecular markers. There is an urgent ne… ▽ More

    Submitted 11 February, 2025; originally announced February 2025.

    Comments: accepted by Medical Image Analysis

  33. arXiv:2502.05173  [pdf, other

    cs.CV

    VideoRoPE: What Makes for Good Video Rotary Position Embedding?

    Authors: Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, Dahua Lin

    Abstract: While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge. This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully co… ▽ More

    Submitted 27 April, 2025; v1 submitted 7 February, 2025; originally announced February 2025.

  34. Laser: Efficient Language-Guided Segmentation in Neural Radiance Fields

    Authors: Xingyu Miao, Haoran Duan, Yang Bai, Tejal Shah, Jun Song, Yang Long, Rajiv Ranjan, Ling Shao

    Abstract: In this work, we propose a method that leverages CLIP feature distillation, achieving efficient 3D segmentation through language guidance. Unlike previous methods that rely on multi-scale CLIP features and are limited by processing speed and storage requirements, our approach aims to streamline the workflow by directly and effectively distilling dense CLIP features, thereby achieving precise segme… ▽ More

    Submitted 31 January, 2025; originally announced January 2025.

    Comments: Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence

  35. arXiv:2501.13953  [pdf, other

    cs.CL cs.AI

    Redundancy Principles for MLLMs Benchmarks

    Authors: Zicheng Zhang, Xiangyu Zhao, Xinyu Fang, Chunyi Li, Xiaohong Liu, Xiongkuo Min, Haodong Duan, Kai Chen, Guangtao Zhai

    Abstract: With the rapid iteration of Multi-modality Large Language Models (MLLMs) and the evolving demands of the field, the number of benchmarks produced annually has surged into the hundreds. The rapid growth has inevitably led to significant redundancy among benchmarks. Therefore, it is crucial to take a step back and critically assess the current state of redundancy and propose targeted principles for… ▽ More

    Submitted 20 January, 2025; originally announced January 2025.

  36. arXiv:2501.12368  [pdf, other

    cs.CV cs.CL

    InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model

    Authors: Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, Kai Chen, Dahua Lin, Jiaqi Wang

    Abstract: Despite the promising performance of Large Vision Language Models (LVLMs) in visual understanding, they occasionally generate incorrect outputs. While reward models (RMs) with reinforcement learning or test-time scaling offer the potential for improving generation quality, a critical gap remains: publicly available multi-modal RMs for LVLMs are scarce, and the implementation details of proprietary… ▽ More

    Submitted 21 January, 2025; originally announced January 2025.

    Comments: Tech Report

  37. arXiv:2501.12273  [pdf, other

    cs.CL cs.AI

    Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement

    Authors: Maosong Cao, Taolin Zhang, Mo Li, Chuyu Zhang, Yunxin Liu, Haodong Duan, Songyang Zhang, Kai Chen

    Abstract: The quality of Supervised Fine-Tuning (SFT) data plays a critical role in enhancing the conversational capabilities of Large Language Models (LLMs). However, as LLMs become more advanced, the availability of high-quality human-annotated SFT data has become a significant bottleneck, necessitating a greater reliance on synthetic training data. In this work, we introduce Condor, a novel two-stage syn… ▽ More

    Submitted 21 January, 2025; originally announced January 2025.

    Comments: Tech Report. Github: https://github.com/InternLM/Condor

  38. arXiv:2501.12210  [pdf, other

    cs.CR

    You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense

    Authors: Wuyuao Mai, Geng Hong, Pei Chen, Xudong Pan, Baojun Liu, Yuan Zhang, Haixin Duan, Min Yang

    Abstract: With the rise of generative large language models (LLMs) like LLaMA and ChatGPT, these models have significantly transformed daily life and work by providing advanced insights. However, as jailbreak attacks continue to circumvent built-in safety mechanisms, exploiting carefully crafted scenarios or tokens, the safety risks of LLMs have come into focus. While numerous defense strategies--such as pr… ▽ More

    Submitted 21 January, 2025; originally announced January 2025.

  39. arXiv:2501.06469  [pdf, other

    cs.CV

    SP-SLAM: Neural Real-Time Dense SLAM With Scene Priors

    Authors: Zhen Hong, Bowen Wang, Haoran Duan, Yawen Huang, Xiong Li, Zhenyu Wen, Xiang Wu, Wei Xiang, Yefeng Zheng

    Abstract: Neural implicit representations have recently shown promising progress in dense Simultaneous Localization And Mapping (SLAM). However, existing works have shortcomings in terms of reconstruction quality and real-time performance, mainly due to inflexible scene representation strategy without leveraging any prior information. In this paper, we introduce SP-SLAM, a novel neural RGB-D SLAM system tha… ▽ More

    Submitted 11 January, 2025; originally announced January 2025.

  40. arXiv:2501.05510  [pdf, other

    cs.CV cs.AI

    OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

    Authors: Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang

    Abstract: Temporal Awareness, the ability to reason dynamically based on the timestamp when a question is raised, is the key distinction between offline and online video LLMs. Unlike offline models, which rely on complete videos for static, post hoc analysis, online models process video streams incrementally and dynamically adapt their responses based on the timestamp at which the question is posed. Despite… ▽ More

    Submitted 27 March, 2025; v1 submitted 9 January, 2025; originally announced January 2025.

    Comments: CVPR 2025

  41. arXiv:2501.03226  [pdf, other

    cs.CL cs.AI cs.LG

    BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning

    Authors: Beichen Zhang, Yuhong Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Haodong Duan, Yuhang Cao, Dahua Lin, Jiaqi Wang

    Abstract: Large language models (LLMs) have demonstrated impressive ability in solving complex mathematical problems with multi-step reasoning and can be further enhanced with well-designed in-context learning (ICL) examples. However, this potential is often constrained by two major challenges in ICL: granularity mismatch and irrelevant information. We observe that while LLMs excel at decomposing mathematic… ▽ More

    Submitted 17 February, 2025; v1 submitted 6 January, 2025; originally announced January 2025.

    Comments: Codes and Data are available at https://github.com/beichenzbc/BoostStep

  42. arXiv:2501.02509   

    cs.CV

    Facial Attractiveness Prediction in Live Streaming: A New Benchmark and Multi-modal Method

    Authors: Hui Li, Xiaoyu Ren, Hongjiu Yu, Huiyu Duan, Kai Li, Ying Chen, Libo Wang, Xiongkuo Min, Guangtao Zhai, Xu Liu

    Abstract: Facial attractiveness prediction (FAP) has long been an important computer vision task, which could be widely applied in live streaming for facial retouching, content recommendation, etc. However, previous FAP datasets are either small, closed-source, or lack diversity. Moreover, the corresponding FAP models exhibit limited generalization and adaptation ability. To overcome these limitations, in t… ▽ More

    Submitted 12 March, 2025; v1 submitted 5 January, 2025; originally announced January 2025.

    Comments: Section 3 in Images Collection has description errors about data cleaning. The compared methods data of Table 3 lacks other metrics

  43. arXiv:2501.01116  [pdf, other

    cs.CV cs.MM

    HarmonyIQA: Pioneering Benchmark and Model for Image Harmonization Quality Assessment

    Authors: Zitong Xu, Huiyu Duan, Guangji Ma, Liu Yang, Jiarui Wang, Qingbo Wu, Xiongkuo Min, Guangtao Zhai, Patrick Le Callet

    Abstract: Image composition involves extracting a foreground object from one image and pasting it into another image through Image harmonization algorithms (IHAs), which aim to adjust the appearance of the foreground object to better match the background. Existing image quality assessment (IQA) methods may fail to align with human visual preference on image harmonization due to the insensitivity to minor co… ▽ More

    Submitted 2 January, 2025; originally announced January 2025.

  44. arXiv:2412.20423  [pdf, other

    cs.CV cs.MM

    ESVQA: Perceptual Quality Assessment of Egocentric Spatial Videos

    Authors: Xilei Zhu, Huiyu Duan, Liu Yang, Yucheng Zhu, Xiongkuo Min, Guangtao Zhai, Patrick Le Callet

    Abstract: With the rapid development of eXtended Reality (XR), egocentric spatial shooting and display technologies have further enhanced immersion and engagement for users. Assessing the quality of experience (QoE) of egocentric spatial videos is crucial to ensure a high-quality viewing experience. However, the corresponding research is still lacking. In this paper, we use the embodied experience to highli… ▽ More

    Submitted 29 December, 2024; originally announced December 2024.

    Comments: 7 pages, 3 figures

  45. arXiv:2412.19238  [pdf, other

    cs.CV cs.LG cs.MM eess.IV

    FineVQ: Fine-Grained User Generated Content Video Quality Assessment

    Authors: Huiyu Duan, Qiang Hu, Jiarui Wang, Liu Yang, Zitong Xu, Lu Liu, Xiongkuo Min, Chunlei Cai, Tianxiao Ye, Xiaoyun Zhang, Guangtao Zhai

    Abstract: The rapid growth of user-generated content (UGC) videos has produced an urgent need for effective video quality assessment (VQA) algorithms to monitor video quality and guide optimization and recommendation procedures. However, current VQA models generally only give an overall rating for a UGC video, which lacks fine-grained labels for serving video processing and recommendation applications. To a… ▽ More

    Submitted 26 April, 2025; v1 submitted 26 December, 2024; originally announced December 2024.

  46. arXiv:2412.18926  [pdf, other

    cs.LG cs.AI

    Exemplar-condensed Federated Class-incremental Learning

    Authors: Rui Sun, Yumin Zhang, Varun Ojha, Tejal Shah, Haoran Duan, Bo Wei, Rajiv Ranjan

    Abstract: We propose Exemplar-Condensed federated class-incremental learning (ECoral) to distil the training characteristics of real images from streaming data into informative rehearsal exemplars. The proposed method eliminates the limitations of exemplar selection in replay-based approaches for mitigating catastrophic forgetting in federated continual learning (FCL). The limitations particularly related t… ▽ More

    Submitted 25 December, 2024; originally announced December 2024.

  47. arXiv:2412.18267  [pdf, other

    cs.LG

    NoiseHGNN: Synthesized Similarity Graph-Based Neural Network For Noised Heterogeneous Graph Representation Learning

    Authors: Xiong Zhang, Cheng Xie, Haoran Duan, Beibei Yu

    Abstract: Real-world graph data environments intrinsically exist noise (e.g., link and structure errors) that inevitably disturb the effectiveness of graph representation and downstream learning tasks. For homogeneous graphs, the latest works use original node features to synthesize a similarity graph that can correct the structure of the noised graph. This idea is based on the homogeneity assumption, which… ▽ More

    Submitted 24 December, 2024; originally announced December 2024.

    Comments: AAAI2025

  48. arXiv:2412.17038  [pdf, other

    cs.CV cs.AI

    ErasableMask: A Robust and Erasable Privacy Protection Scheme against Black-box Face Recognition Models

    Authors: Sipeng Shen, Yunming Zhang, Dengpan Ye, Xiuwen Shi, Long Tang, Haoran Duan, Jiacheng Deng, Ziyi Liu

    Abstract: While face recognition (FR) models have brought remarkable convenience in face verification and identification, they also pose substantial privacy risks to the public. Existing facial privacy protection schemes usually adopt adversarial examples to disrupt face verification of FR models. However, these schemes often suffer from weak transferability against black-box FR models and permanently damag… ▽ More

    Submitted 29 December, 2024; v1 submitted 22 December, 2024; originally announced December 2024.

  49. Revealing the Black Box of Device Search Engine: Scanning Assets, Strategies, and Ethical Consideration

    Authors: Mengying Wu, Geng Hong, Jinsong Chen, Qi Liu, Shujun Tang, Youhao Li, Baojun Liu, Haixin Duan, Min Yang

    Abstract: In the digital age, device search engines such as Censys and Shodan play crucial roles by scanning the internet to catalog online devices, aiding in the understanding and mitigation of network security risks. While previous research has used these tools to detect devices and assess vulnerabilities, there remains uncertainty regarding the assets they scan, the strategies they employ, and whether th… ▽ More

    Submitted 20 December, 2024; originally announced December 2024.

    Comments: 18 pages, accepted by NDSS 2025

  50. arXiv:2412.13743  [pdf, other

    cs.MM cs.HC

    User-Generated Content and Editors in Games: A Comprehensive Survey

    Authors: Yuyue Liu, Haihan Duan, Wei Cai

    Abstract: User-Generated Content (UGC) refers to any form of content, such as posts and images, created by users rather than by professionals. In recent years, UGC has become an essential part of the evolving video game industry, influencing both game culture and community dynamics. The ability for users to actively contribute to the games they engage with has shifted the landscape of gaming from a one-dire… ▽ More

    Submitted 18 December, 2024; originally announced December 2024.