-
Rethinking Score Distilling Sampling for 3D Editing and Generation
Authors:
Xingyu Miao,
Haoran Duan,
Yang Long,
Jungong Han
Abstract:
Score Distillation Sampling (SDS) has emerged as a prominent method for text-to-3D generation by leveraging the strengths of 2D diffusion models. However, SDS is limited to generation tasks and lacks the capability to edit existing 3D assets. Conversely, variants of SDS that introduce editing capabilities often can not generate new 3D assets effectively. In this work, we observe that the processes…
▽ More
Score Distillation Sampling (SDS) has emerged as a prominent method for text-to-3D generation by leveraging the strengths of 2D diffusion models. However, SDS is limited to generation tasks and lacks the capability to edit existing 3D assets. Conversely, variants of SDS that introduce editing capabilities often can not generate new 3D assets effectively. In this work, we observe that the processes of generation and editing within SDS and its variants have unified underlying gradient terms. Building on this insight, we propose Unified Distillation Sampling (UDS), a method that seamlessly integrates both the generation and editing of 3D assets. Essentially, UDS refines the gradient terms used in vanilla SDS methods, unifying them to support both tasks. Extensive experiments demonstrate that UDS not only outperforms baseline methods in generating 3D assets with richer details but also excels in editing tasks, thereby bridging the gap between 3D generation and editing. The code is available on: https://github.com/xingy038/UDS.
△ Less
Submitted 3 May, 2025;
originally announced May 2025.
-
GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling
Authors:
Siqi Li,
Yufan Shen,
Xiangnan Chen,
Jiayi Chen,
Hengwei Ju,
Haodong Duan,
Song Mao,
Hongbin Zhou,
Bo Zhang,
Pinlong Cai,
Licheng Wen,
Botian Shi,
Yong Liu,
Xinyu Cai,
Yu Qiao
Abstract:
The rapid advancement of multimodal large language models (MLLMs) has profoundly impacted the document domain, creating a wide array of application scenarios. This progress highlights the need for a comprehensive benchmark to evaluate these models' capabilities across various document-specific tasks. However, existing benchmarks often fail to locate specific model weaknesses or guide systematic im…
▽ More
The rapid advancement of multimodal large language models (MLLMs) has profoundly impacted the document domain, creating a wide array of application scenarios. This progress highlights the need for a comprehensive benchmark to evaluate these models' capabilities across various document-specific tasks. However, existing benchmarks often fail to locate specific model weaknesses or guide systematic improvements. To bridge this gap, we introduce a General Document Intelligence Benchmark (GDI-Bench), featuring 1.9k images across 9 key scenarios and 19 document-specific tasks. By decoupling visual complexity and reasoning complexity, the GDI-Bench structures graded tasks that allow performance assessment by difficulty, aiding in model weakness identification and optimization guidance. We evaluate the GDI-Bench on various open-source and closed-source models, conducting decoupled analyses in the visual and reasoning domains. For instance, the GPT-4o model excels in reasoning tasks but exhibits limitations in visual capabilities. To address the diverse tasks and domains in the GDI-Bench, we propose a GDI Model that mitigates the issue of catastrophic forgetting during the supervised fine-tuning (SFT) process through a intelligence-preserving training strategy. Our model achieves state-of-the-art performance on previous benchmarks and the GDI-Bench. Both our benchmark and model will be open source.
△ Less
Submitted 30 April, 2025;
originally announced May 2025.
-
AGHI-QA: A Subjective-Aligned Dataset and Metric for AI-Generated Human Images
Authors:
Yunhao Li,
Sijing Wu,
Wei Sun,
Zhichao Zhang,
Yucheng Zhu,
Zicheng Zhang,
Huiyu Duan,
Xiongkuo Min,
Guangtao Zhai
Abstract:
The rapid development of text-to-image (T2I) generation approaches has attracted extensive interest in evaluating the quality of generated images, leading to the development of various quality assessment methods for general-purpose T2I outputs. However, existing image quality assessment (IQA) methods are limited to providing global quality scores, failing to deliver fine-grained perceptual evaluat…
▽ More
The rapid development of text-to-image (T2I) generation approaches has attracted extensive interest in evaluating the quality of generated images, leading to the development of various quality assessment methods for general-purpose T2I outputs. However, existing image quality assessment (IQA) methods are limited to providing global quality scores, failing to deliver fine-grained perceptual evaluations for structurally complex subjects like humans, which is a critical challenge considering the frequent anatomical and textural distortions in AI-generated human images (AGHIs). To address this gap, we introduce AGHI-QA, the first large-scale benchmark specifically designed for quality assessment of AGHIs. The dataset comprises 4,000 images generated from 400 carefully crafted text prompts using 10 state of-the-art T2I models. We conduct a systematic subjective study to collect multidimensional annotations, including perceptual quality scores, text-image correspondence scores, visible and distorted body part labels. Based on AGHI-QA, we evaluate the strengths and weaknesses of current T2I methods in generating human images from multiple dimensions. Furthermore, we propose AGHI-Assessor, a novel quality metric that integrates the large multimodal model (LMM) with domain-specific human features for precise quality prediction and identification of visible and distorted body parts in AGHIs. Extensive experimental results demonstrate that AGHI-Assessor showcases state-of-the-art performance, significantly outperforming existing IQA methods in multidimensional quality assessment and surpassing leading LMMs in detecting structural distortions in AGHIs.
△ Less
Submitted 30 April, 2025;
originally announced April 2025.
-
LMME3DHF: Benchmarking and Evaluating Multimodal 3D Human Face Generation with LMMs
Authors:
Woo Yi Yang,
Jiarui Wang,
Sijing Wu,
Huiyu Duan,
Yuxin Zhu,
Liu Yang,
Kang Fu,
Guangtao Zhai,
Xiongkuo Min
Abstract:
The rapid advancement in generative artificial intelligence have enabled the creation of 3D human faces (HFs) for applications including media production, virtual reality, security, healthcare, and game development, etc. However, assessing the quality and realism of these AI-generated 3D human faces remains a significant challenge due to the subjective nature of human perception and innate percept…
▽ More
The rapid advancement in generative artificial intelligence have enabled the creation of 3D human faces (HFs) for applications including media production, virtual reality, security, healthcare, and game development, etc. However, assessing the quality and realism of these AI-generated 3D human faces remains a significant challenge due to the subjective nature of human perception and innate perceptual sensitivity to facial features. To this end, we conduct a comprehensive study on the quality assessment of AI-generated 3D human faces. We first introduce Gen3DHF, a large-scale benchmark comprising 2,000 videos of AI-Generated 3D Human Faces along with 4,000 Mean Opinion Scores (MOS) collected across two dimensions, i.e., quality and authenticity, 2,000 distortion-aware saliency maps and distortion descriptions. Based on Gen3DHF, we propose LMME3DHF, a Large Multimodal Model (LMM)-based metric for Evaluating 3DHF capable of quality and authenticity score prediction, distortion-aware visual question answering, and distortion-aware saliency prediction. Experimental results show that LMME3DHF achieves state-of-the-art performance, surpassing existing methods in both accurately predicting quality scores for AI-generated 3D human faces and effectively identifying distortion-aware salient regions and distortion types, while maintaining strong alignment with human perceptual judgments. Both the Gen3DHF database and the LMME3DHF will be released upon the publication.
△ Less
Submitted 5 May, 2025; v1 submitted 29 April, 2025;
originally announced April 2025.
-
A Multi-Agent Framework for Automated Qinqiang Opera Script Generation Using Large Language Models
Authors:
Gengxian Cao,
Fengyuan Li,
Hong Duan,
Ye Yang,
Bofeng Wang,
Donghe Li
Abstract:
This paper introduces a novel multi-Agent framework that automates the end to end production of Qinqiang opera by integrating Large Language Models , visual generation, and Text to Speech synthesis. Three specialized agents collaborate in sequence: Agent1 uses an LLM to craft coherent, culturally grounded scripts;Agent2 employs visual generation models to render contextually accurate stage scenes;…
▽ More
This paper introduces a novel multi-Agent framework that automates the end to end production of Qinqiang opera by integrating Large Language Models , visual generation, and Text to Speech synthesis. Three specialized agents collaborate in sequence: Agent1 uses an LLM to craft coherent, culturally grounded scripts;Agent2 employs visual generation models to render contextually accurate stage scenes; and Agent3 leverages TTS to produce synchronized, emotionally expressive vocal performances. In a case study on Dou E Yuan, the system achieved expert ratings of 3.8 for script fidelity, 3.5 for visual coherence, and 3.8 for speech accuracy-culminating in an overall score of 3.6, a 0.3 point improvement over a Single Agent baseline. Ablation experiments demonstrate that removing Agent2 or Agent3 leads to drops of 0.4 and 0.5 points, respectively, underscoring the value of modular collaboration. This work showcases how AI driven pipelines can streamline and scale the preservation of traditional performing arts, and points toward future enhancements in cross modal alignment, richer emotional nuance, and support for additional opera genres.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
Omni$^2$: Unifying Omnidirectional Image Generation and Editing in an Omni Model
Authors:
Liu Yang,
Huiyu Duan,
Yucheng Zhu,
Xiaohong Liu,
Lu Liu,
Zitong Xu,
Guangji Ma,
Xiongkuo Min,
Guangtao Zhai,
Patrick Le Callet
Abstract:
$360^{\circ}…
▽ More
$360^{\circ}$ omnidirectional images (ODIs) have gained considerable attention recently, and are widely used in various virtual reality (VR) and augmented reality (AR) applications. However, capturing such images is expensive and requires specialized equipment, making ODI synthesis increasingly important. While common 2D image generation and editing methods are rapidly advancing, these models struggle to deliver satisfactory results when generating or editing ODIs due to the unique format and broad 360$^{\circ}$ Field-of-View (FoV) of ODIs. To bridge this gap, we construct \textbf{\textit{Any2Omni}}, the first comprehensive ODI generation-editing dataset comprises 60,000+ training data covering diverse input conditions and up to 9 ODI generation and editing tasks. Built upon Any2Omni, we propose an \textbf{\underline{Omni}} model for \textbf{\underline{Omni}}-directional image generation and editing (\textbf{\textit{Omni$^2$}}), with the capability of handling various ODI generation and editing tasks under diverse input conditions using one model. Extensive experiments demonstrate the superiority and effectiveness of the proposed Omni$^2$ model for both the ODI generation and editing tasks.
△ Less
Submitted 15 April, 2025;
originally announced April 2025.
-
From Gaze to Insight: Bridging Human Visual Attention and Vision Language Model Explanation for Weakly-Supervised Medical Image Segmentation
Authors:
Jingkun Chen,
Haoran Duan,
Xiao Zhang,
Boyan Gao,
Tao Tan,
Vicente Grau,
Jungong Han
Abstract:
Medical image segmentation remains challenging due to the high cost of pixel-level annotations for training. In the context of weak supervision, clinician gaze data captures regions of diagnostic interest; however, its sparsity limits its use for segmentation. In contrast, vision-language models (VLMs) provide semantic context through textual descriptions but lack the explanation precision require…
▽ More
Medical image segmentation remains challenging due to the high cost of pixel-level annotations for training. In the context of weak supervision, clinician gaze data captures regions of diagnostic interest; however, its sparsity limits its use for segmentation. In contrast, vision-language models (VLMs) provide semantic context through textual descriptions but lack the explanation precision required. Recognizing that neither source alone suffices, we propose a teacher-student framework that integrates both gaze and language supervision, leveraging their complementary strengths. Our key insight is that gaze data indicates where clinicians focus during diagnosis, while VLMs explain why those regions are significant. To implement this, the teacher model first learns from gaze points enhanced by VLM-generated descriptions of lesion morphology, establishing a foundation for guiding the student model. The teacher then directs the student through three strategies: (1) Multi-scale feature alignment to fuse visual cues with textual semantics; (2) Confidence-weighted consistency constraints to focus on reliable predictions; (3) Adaptive masking to limit error propagation in uncertain areas. Experiments on the Kvasir-SEG, NCI-ISBI, and ISIC datasets show that our method achieves Dice scores of 80.78%, 80.53%, and 84.22%, respectively-improving 3-5% over gaze baselines without increasing the annotation burden. By preserving correlations among predictions, gaze data, and lesion descriptions, our framework also maintains clinical interpretability. This work illustrates how integrating human visual attention with AI-generated semantic context can effectively overcome the limitations of individual weak supervision signals, thereby advancing the development of deployable, annotation-efficient medical AI systems. Code is available at: https://github.com/jingkunchen/FGI.git.
△ Less
Submitted 15 April, 2025;
originally announced April 2025.
-
FVQ: A Large-Scale Dataset and A LMM-based Method for Face Video Quality Assessment
Authors:
Sijing Wu,
Yunhao Li,
Ziwen Xu,
Yixuan Gao,
Huiyu Duan,
Wei Sun,
Guangtao Zhai
Abstract:
Face video quality assessment (FVQA) deserves to be explored in addition to general video quality assessment (VQA), as face videos are the primary content on social media platforms and human visual system (HVS) is particularly sensitive to human faces. However, FVQA is rarely explored due to the lack of large-scale FVQA datasets. To fill this gap, we present the first large-scale in-the-wild FVQA…
▽ More
Face video quality assessment (FVQA) deserves to be explored in addition to general video quality assessment (VQA), as face videos are the primary content on social media platforms and human visual system (HVS) is particularly sensitive to human faces. However, FVQA is rarely explored due to the lack of large-scale FVQA datasets. To fill this gap, we present the first large-scale in-the-wild FVQA dataset, FVQ-20K, which contains 20,000 in-the-wild face videos together with corresponding mean opinion score (MOS) annotations. Along with the FVQ-20K dataset, we further propose a specialized FVQA method named FVQ-Rater to achieve human-like rating and scoring for face video, which is the first attempt to explore the potential of large multimodal models (LMMs) for the FVQA task. Concretely, we elaborately extract multi-dimensional features including spatial features, temporal features, and face-specific features (i.e., portrait features and face embeddings) to provide comprehensive visual information, and take advantage of the LoRA-based instruction tuning technique to achieve quality-specific fine-tuning, which shows superior performance on both FVQ-20K and CFVQA datasets. Extensive experiments and comprehensive analysis demonstrate the significant potential of the FVQ-20K dataset and FVQ-Rater method in promoting the development of FVQA.
△ Less
Submitted 12 April, 2025;
originally announced April 2025.
-
LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs
Authors:
Jiarui Wang,
Huiyu Duan,
Yu Zhao,
Juntong Wang,
Guangtao Zhai,
Xiongkuo Min
Abstract:
Recent breakthroughs in large multimodal models (LMMs) have significantly advanced both text-to-image (T2I) generation and image-to-text (I2T) interpretation. However, many generated images still suffer from issues related to perceptual quality and text-image alignment. Given the high cost and inefficiency of manual evaluation, an automatic metric that aligns with human preferences is desirable. T…
▽ More
Recent breakthroughs in large multimodal models (LMMs) have significantly advanced both text-to-image (T2I) generation and image-to-text (I2T) interpretation. However, many generated images still suffer from issues related to perceptual quality and text-image alignment. Given the high cost and inefficiency of manual evaluation, an automatic metric that aligns with human preferences is desirable. To this end, we present EvalMi-50K, a comprehensive dataset and benchmark for evaluating large-multimodal image generation, which features (i) comprehensive tasks, encompassing 2,100 extensive prompts across 20 fine-grained task dimensions, and (ii) large-scale human-preference annotations, including 100K mean-opinion scores (MOSs) and 50K question-answering (QA) pairs annotated on 50,400 images generated from 24 T2I models. Based on EvalMi-50K, we propose LMM4LMM, an LMM-based metric for evaluating large multimodal T2I generation from multiple dimensions including perception, text-image correspondence, and task-specific accuracy. Extensive experimental results show that LMM4LMM achieves state-of-the-art performance on EvalMi-50K, and exhibits strong generalization ability on other AI-generated image evaluation benchmark datasets, manifesting the generality of both the EvalMi-50K dataset and LMM4LMM metric. Both EvalMi-50K and LMM4LMM will be released at https://github.com/IntMeGroup/LMM4LMM.
△ Less
Submitted 11 April, 2025;
originally announced April 2025.
-
MM-IFEngine: Towards Multimodal Instruction Following
Authors:
Shengyuan Ding,
Shenxi Wu,
Xiangyu Zhao,
Yuhang Zang,
Haodong Duan,
Xiaoyi Dong,
Pan Zhang,
Yuhang Cao,
Dahua Lin,
Jiaqi Wang
Abstract:
The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and whether they are doing it right. Existing multimodal instruction following training data is scarce, the benchmarks are simple with atomic instructions, and the evaluation strategies are imprecise for tasks demanding exact output constraints. To addre…
▽ More
The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and whether they are doing it right. Existing multimodal instruction following training data is scarce, the benchmarks are simple with atomic instructions, and the evaluation strategies are imprecise for tasks demanding exact output constraints. To address this, we present MM-IFEngine, an effective pipeline to generate high-quality image-instruction pairs. Our MM-IFEngine pipeline yields large-scale, diverse, and high-quality training data MM-IFInstruct-23k, which is suitable for Supervised Fine-Tuning (SFT) and extended as MM-IFDPO-23k for Direct Preference Optimization (DPO). We further introduce MM-IFEval, a challenging and diverse multi-modal instruction-following benchmark that includes (1) both compose-level constraints for output responses and perception-level constraints tied to the input images, and (2) a comprehensive evaluation pipeline incorporating both rule-based assessment and judge model. We conduct SFT and DPO experiments and demonstrate that fine-tuning MLLMs on MM-IFInstruct-23k and MM-IFDPO-23k achieves notable gains on various IF benchmarks, such as MM-IFEval (+10.2$\%$), MIA (+7.6$\%$), and IFEval (+12.3$\%$). We have fully open-sourced the datasets (both SFT and DPO), evaluation code and training scripts at https://github.com/SYuan03/MM-IFEngine.
△ Less
Submitted 27 April, 2025; v1 submitted 10 April, 2025;
originally announced April 2025.
-
LLM4Ranking: An Easy-to-use Framework of Utilizing Large Language Models for Document Reranking
Authors:
Qi Liu,
Haozhe Duan,
Yiqun Chen,
Quanfeng Lu,
Weiwei Sun,
Jiaxin Mao
Abstract:
Utilizing large language models (LLMs) for document reranking has been a popular and promising research direction in recent years, many studies are dedicated to improving the performance and efficiency of using LLMs for reranking. Besides, it can also be applied in many real-world applications, such as search engines or retrieval-augmented generation. In response to the growing demand for research…
▽ More
Utilizing large language models (LLMs) for document reranking has been a popular and promising research direction in recent years, many studies are dedicated to improving the performance and efficiency of using LLMs for reranking. Besides, it can also be applied in many real-world applications, such as search engines or retrieval-augmented generation. In response to the growing demand for research and application in practice, we introduce a unified framework, \textbf{LLM4Ranking}, which enables users to adopt different ranking methods using open-source or closed-source API-based LLMs. Our framework provides a simple and extensible interface for document reranking with LLMs, as well as easy-to-use evaluation and fine-tuning scripts for this task. We conducted experiments based on this framework and evaluated various models and methods on several widely used datasets, providing reproducibility results on utilizing LLMs for document reranking. Our code is publicly available at https://github.com/liuqi6777/llm4ranking.
△ Less
Submitted 10 April, 2025;
originally announced April 2025.
-
Attention in Diffusion Model: A Survey
Authors:
Litao Hua,
Fan Liu,
Jie Su,
Xingyu Miao,
Zizhou Ouyang,
Zeyu Wang,
Runze Hu,
Zhenyu Wen,
Bing Zhai,
Yang Long,
Haoran Duan,
Yuan Zhou
Abstract:
Attention mechanisms have become a foundational component in diffusion models, significantly influencing their capacity across a wide range of generative and discriminative tasks. This paper presents a comprehensive survey of attention within diffusion models, systematically analysing its roles, design patterns, and operations across different modalities and tasks. We propose a unified taxonomy th…
▽ More
Attention mechanisms have become a foundational component in diffusion models, significantly influencing their capacity across a wide range of generative and discriminative tasks. This paper presents a comprehensive survey of attention within diffusion models, systematically analysing its roles, design patterns, and operations across different modalities and tasks. We propose a unified taxonomy that categorises attention-related modifications into parts according to the structural components they affect, offering a clear lens through which to understand their functional diversity. In addition to reviewing architectural innovations, we examine how attention mechanisms contribute to performance improvements in diverse applications. We also identify current limitations and underexplored areas, and outline potential directions for future research. Our study provides valuable insights into the evolving landscape of diffusion models, with a particular focus on the integrative and ubiquitous role of attention.
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
Authors:
Xiangyu Zhao,
Peiyuan Zhang,
Kexian Tang,
Hao Li,
Zicheng Zhang,
Guangtao Zhai,
Junchi Yan,
Hua Yang,
Xue Yang,
Haodong Duan
Abstract:
Large Multi-modality Models (LMMs) have made significant progress in visual understanding and generation, but they still face challenges in General Visual Editing, particularly in following complex instructions, preserving appearance consistency, and supporting flexible input formats. To address this gap, we introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing…
▽ More
Large Multi-modality Models (LMMs) have made significant progress in visual understanding and generation, but they still face challenges in General Visual Editing, particularly in following complex instructions, preserving appearance consistency, and supporting flexible input formats. To address this gap, we introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE). RISEBench focuses on four key reasoning types: Temporal, Causal, Spatial, and Logical Reasoning. We curate high-quality test cases for each category and propose an evaluation framework that assesses Instruction Reasoning, Appearance Consistency, and Visual Plausibility with both human judges and an LMM-as-a-judge approach. Our experiments reveal that while GPT-4o-Native significantly outperforms other open-source and proprietary models, even this state-of-the-art system struggles with logical reasoning tasks, highlighting an area that remains underexplored. As an initial effort, RISEBench aims to provide foundational insights into reasoning-aware visual editing and to catalyze future research. Though still in its early stages, we are committed to continuously expanding and refining the benchmark to support more comprehensive, reliable, and scalable evaluations of next-generation multimodal systems. Our code and data will be released at https://github.com/PhoenixZ810/RISEBench.
△ Less
Submitted 8 April, 2025; v1 submitted 3 April, 2025;
originally announced April 2025.
-
ConsDreamer: Advancing Multi-View Consistency for Zero-Shot Text-to-3D Generation
Authors:
Yuan Zhou,
Shilong Jin,
Litao Hua,
Wanjun Lv,
Haoran Duan,
Jungong Han
Abstract:
Recent advances in zero-shot text-to-3D generation have revolutionized 3D content creation by enabling direct synthesis from textual descriptions. While state-of-the-art methods leverage 3D Gaussian Splatting with score distillation to enhance multi-view rendering through pre-trained text-to-image (T2I) models, they suffer from inherent view biases in T2I priors. These biases lead to inconsistent…
▽ More
Recent advances in zero-shot text-to-3D generation have revolutionized 3D content creation by enabling direct synthesis from textual descriptions. While state-of-the-art methods leverage 3D Gaussian Splatting with score distillation to enhance multi-view rendering through pre-trained text-to-image (T2I) models, they suffer from inherent view biases in T2I priors. These biases lead to inconsistent 3D generation, particularly manifesting as the multi-face Janus problem, where objects exhibit conflicting features across views. To address this fundamental challenge, we propose ConsDreamer, a novel framework that mitigates view bias by refining both the conditional and unconditional terms in the score distillation process: (1) a View Disentanglement Module (VDM) that eliminates viewpoint biases in conditional prompts by decoupling irrelevant view components and injecting precise camera parameters; and (2) a similarity-based partial order loss that enforces geometric consistency in the unconditional term by aligning cosine similarities with azimuth relationships. Extensive experiments demonstrate that ConsDreamer effectively mitigates the multi-face Janus problem in text-to-3D generation, outperforming existing methods in both visual quality and consistency.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
WorldScore: A Unified Evaluation Benchmark for World Generation
Authors:
Haoyi Duan,
Hong-Xing Yu,
Sirui Chen,
Li Fei-Fei,
Jiajun Wu
Abstract:
We introduce the WorldScore benchmark, the first unified benchmark for world generation. We decompose world generation into a sequence of next-scene generation tasks with explicit camera trajectory-based layout specifications, enabling unified evaluation of diverse approaches from 3D and 4D scene generation to video generation models. The WorldScore benchmark encompasses a curated dataset of 3,000…
▽ More
We introduce the WorldScore benchmark, the first unified benchmark for world generation. We decompose world generation into a sequence of next-scene generation tasks with explicit camera trajectory-based layout specifications, enabling unified evaluation of diverse approaches from 3D and 4D scene generation to video generation models. The WorldScore benchmark encompasses a curated dataset of 3,000 test examples that span diverse worlds: static and dynamic, indoor and outdoor, photorealistic and stylized. The WorldScore metrics evaluate generated worlds through three key aspects: controllability, quality, and dynamics. Through extensive evaluation of 19 representative models, including both open-source and closed-source ones, we reveal key insights and challenges for each category of models. Our dataset, evaluation code, and leaderboard can be found at https://haoyi-duan.github.io/WorldScore/
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?
Authors:
Kexian Tang,
Junyao Gao,
Yanhong Zeng,
Haodong Duan,
Yanan Sun,
Zhening Xing,
Wenran Liu,
Kaifeng Lyu,
Kai Chen
Abstract:
Multi-step spatial reasoning entails understanding and reasoning about spatial relationships across multiple sequential steps, which is crucial for tackling complex real-world applications, such as robotic manipulation, autonomous navigation, and automated assembly. To assess how well current Multimodal Large Language Models (MLLMs) have acquired this fundamental capability, we introduce \textbf{L…
▽ More
Multi-step spatial reasoning entails understanding and reasoning about spatial relationships across multiple sequential steps, which is crucial for tackling complex real-world applications, such as robotic manipulation, autonomous navigation, and automated assembly. To assess how well current Multimodal Large Language Models (MLLMs) have acquired this fundamental capability, we introduce \textbf{LEGO-Puzzles}, a scalable benchmark designed to evaluate both \textbf{spatial understanding} and \textbf{sequential reasoning} in MLLMs through LEGO-based tasks. LEGO-Puzzles consists of 1,100 carefully curated visual question-answering (VQA) samples spanning 11 distinct tasks, ranging from basic spatial understanding to complex multi-step reasoning. Based on LEGO-Puzzles, we conduct a comprehensive evaluation of state-of-the-art MLLMs and uncover significant limitations in their spatial reasoning capabilities: even the most powerful MLLMs can answer only about half of the test cases, whereas human participants achieve over 90\% accuracy. In addition to VQA tasks, we evaluate MLLMs' abilities to generate LEGO images following assembly illustrations. Our experiments show that only Gemini-2.0-Flash and GPT-4o exhibit a limited ability to follow these instructions, while other MLLMs either replicate the input image or generate completely irrelevant outputs. Overall, LEGO-Puzzles exposes critical deficiencies in existing MLLMs' spatial understanding and sequential reasoning capabilities, and underscores the need for further advancements in multimodal spatial reasoning.
△ Less
Submitted 25 March, 2025;
originally announced March 2025.
-
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy
Authors:
Zhi Hou,
Tianyi Zhang,
Yuwen Xiong,
Haonan Duan,
Hengjun Pu,
Ronglei Tong,
Chengyang Zhao,
Xizhou Zhu,
Yu Qiao,
Jifeng Dai,
Yuntao Chen
Abstract:
While recent vision-language-action models trained on diverse robot datasets exhibit promising generalization capabilities with limited in-domain data, their reliance on compact action heads to predict discretized or continuous actions constrains adaptability to heterogeneous action spaces. We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuou…
▽ More
While recent vision-language-action models trained on diverse robot datasets exhibit promising generalization capabilities with limited in-domain data, their reliance on compact action heads to predict discretized or continuous actions constrains adaptability to heterogeneous action spaces. We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences through a unified multimodal diffusion process. Departing from prior methods that condition denoising on fused embeddings via shallow networks, Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations. This design explicitly models action deltas and environmental nuances. By scaling the diffusion action denoiser alongside the Transformer's scalability, Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces. Such synergy enhances robustness against various variances and facilitates the successful execution of long-horizon tasks. Evaluations across extensive benchmarks demonstrate state-of-the-art or comparative performance in simulation. Notably, Dita achieves robust real-world adaptation to environmental variances and complex long-horizon tasks through 10-shot finetuning, using only third-person camera inputs. The architecture establishes a versatile, lightweight and open-source baseline for generalist robot policy learning. Project Page: https://robodita.github.io.
△ Less
Submitted 25 March, 2025;
originally announced March 2025.
-
FMDConv: Fast Multi-Attention Dynamic Convolution via Speed-Accuracy Trade-off
Authors:
Tianyu Zhang,
Fan Wan,
Haoran Duan,
Kevin W. Tong,
Jingjing Deng,
Yang Long
Abstract:
Spatial convolution is fundamental in constructing deep Convolutional Neural Networks (CNNs) for visual recognition. While dynamic convolution enhances model accuracy by adaptively combining static kernels, it incurs significant computational overhead, limiting its deployment in resource-constrained environments such as federated edge computing. To address this, we propose Fast Multi-Attention Dyn…
▽ More
Spatial convolution is fundamental in constructing deep Convolutional Neural Networks (CNNs) for visual recognition. While dynamic convolution enhances model accuracy by adaptively combining static kernels, it incurs significant computational overhead, limiting its deployment in resource-constrained environments such as federated edge computing. To address this, we propose Fast Multi-Attention Dynamic Convolution (FMDConv), which integrates input attention, temperature-degraded kernel attention, and output attention to optimize the speed-accuracy trade-off. FMDConv achieves a better balance between accuracy and efficiency by selectively enhancing feature extraction with lower complexity. Furthermore, we introduce two novel quantitative metrics, the Inverse Efficiency Score and Rate-Correct Score, to systematically evaluate this trade-off. Extensive experiments on CIFAR-10, CIFAR-100, and ImageNet demonstrate that FMDConv reduces the computational cost by up to 49.8\% on ResNet-18 and 42.2\% on ResNet-50 compared to prior multi-attention dynamic convolution methods while maintaining competitive accuracy. These advantages make FMDConv highly suitable for real-world, resource-constrained applications.
△ Less
Submitted 21 March, 2025;
originally announced March 2025.
-
D2Fusion: Dual-domain Fusion with Feature Superposition for Deepfake Detection
Authors:
Xueqi Qiu,
Xingyu Miao,
Fan Wan,
Haoran Duan,
Tejal Shah,
Varun Ojhab,
Yang Longa,
Rajiv Ranjan
Abstract:
Deepfake detection is crucial for curbing the harm it causes to society. However, current Deepfake detection methods fail to thoroughly explore artifact information across different domains due to insufficient intrinsic interactions. These interactions refer to the fusion and coordination after feature extraction processes across different domains, which are crucial for recognizing complex forgery…
▽ More
Deepfake detection is crucial for curbing the harm it causes to society. However, current Deepfake detection methods fail to thoroughly explore artifact information across different domains due to insufficient intrinsic interactions. These interactions refer to the fusion and coordination after feature extraction processes across different domains, which are crucial for recognizing complex forgery clues. Focusing on more generalized Deepfake detection, in this work, we introduce a novel bi-directional attention module to capture the local positional information of artifact clues from the spatial domain. This enables accurate artifact localization, thus addressing the coarse processing with artifact features. To further address the limitation that the proposed bi-directional attention module may not well capture global subtle forgery information in the artifact feature (e.g., textures or edges), we employ a fine-grained frequency attention module in the frequency domain. By doing so, we can obtain high-frequency information in the fine-grained features, which contains the global and subtle forgery information. Although these features from the diverse domains can be effectively and independently improved, fusing them directly does not effectively improve the detection performance. Therefore, we propose a feature superposition strategy that complements information from spatial and frequency domains. This strategy turns the feature components into the form of wave-like tokens, which are updated based on their phase, such that the distinctions between authentic and artifact features can be amplified. Our method demonstrates significant improvements over state-of-the-art (SOTA) methods on five public Deepfake datasets in capturing abnormalities across different manipulated operations and real-life.
△ Less
Submitted 21 March, 2025;
originally announced March 2025.
-
FedSCA: Federated Tuning with Similarity-guided Collaborative Aggregation for Heterogeneous Medical Image Segmentation
Authors:
Yumin Zhang,
Yan Gao,
Haoran Duan,
Hanqing Guo,
Tejal Shah,
Rajiv Ranjan,
Bo Wei
Abstract:
Transformer-based foundation models (FMs) have recently demonstrated remarkable performance in medical image segmentation. However, scaling these models is challenging due to the limited size of medical image datasets within isolated hospitals, where data centralization is restricted due to privacy concerns. These constraints, combined with the data-intensive nature of FMs, hinder their broader ap…
▽ More
Transformer-based foundation models (FMs) have recently demonstrated remarkable performance in medical image segmentation. However, scaling these models is challenging due to the limited size of medical image datasets within isolated hospitals, where data centralization is restricted due to privacy concerns. These constraints, combined with the data-intensive nature of FMs, hinder their broader application. Integrating federated learning (FL) with foundation models (FLFM) fine-tuning offers a potential solution to these challenges by enabling collaborative model training without data sharing, thus allowing FMs to take advantage of a diverse pool of sensitive medical image data across hospitals/clients. However, non-independent and identically distributed (non-IID) data among clients, paired with computational and communication constraints in federated environments, presents an additional challenge that limits further performance improvements and remains inadequately addressed in existing studies. In this work, we propose a novel FLFM fine-tuning framework, \underline{\textbf{Fed}}erated tuning with \underline{\textbf{S}}imilarity-guided \underline{\textbf{C}}ollaborative \underline{\textbf{A}}ggregation (FedSCA), encompassing all phases of the FL process. This includes (1) specially designed parameter-efficient fine-tuning (PEFT) for local client training to enhance computational efficiency; (2) partial low-level adapter transmission for communication efficiency; and (3) similarity-guided collaborative aggregation (SGCA) on the server side to address non-IID issues. Extensive experiments on three FL benchmarks for medical image segmentation demonstrate the effectiveness of our proposed FedSCA, establishing new SOTA performance.
△ Less
Submitted 19 March, 2025;
originally announced March 2025.
-
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM
Authors:
Xinyu Fang,
Zhijian Chen,
Kai Lan,
Lixin Ma,
Shengyuan Ding,
Yingji Liang,
Xiangyu Zhao,
Farong Wen,
Zicheng Zhang,
Guofeng Zhang,
Haodong Duan,
Kai Chen,
Dahua Lin
Abstract:
Creativity is a fundamental aspect of intelligence, involving the ability to generate novel and appropriate solutions across diverse contexts. While Large Language Models (LLMs) have been extensively evaluated for their creative capabilities, the assessment of Multimodal Large Language Models (MLLMs) in this domain remains largely unexplored. To address this gap, we introduce Creation-MMBench, a m…
▽ More
Creativity is a fundamental aspect of intelligence, involving the ability to generate novel and appropriate solutions across diverse contexts. While Large Language Models (LLMs) have been extensively evaluated for their creative capabilities, the assessment of Multimodal Large Language Models (MLLMs) in this domain remains largely unexplored. To address this gap, we introduce Creation-MMBench, a multimodal benchmark specifically designed to evaluate the creative capabilities of MLLMs in real-world, image-based tasks. The benchmark comprises 765 test cases spanning 51 fine-grained tasks. To ensure rigorous evaluation, we define instance-specific evaluation criteria for each test case, guiding the assessment of both general response quality and factual consistency with visual inputs. Experimental results reveal that current open-source MLLMs significantly underperform compared to proprietary models in creative tasks. Furthermore, our analysis demonstrates that visual fine-tuning can negatively impact the base LLM's creative abilities. Creation-MMBench provides valuable insights for advancing MLLM creativity and establishes a foundation for future improvements in multimodal generative intelligence. Full data and evaluation code is released on https://github.com/open-compass/Creation-MMBench.
△ Less
Submitted 19 March, 2025; v1 submitted 18 March, 2025;
originally announced March 2025.
-
Rapfi: Distilling Efficient Neural Network for the Game of Gomoku
Authors:
Zhanggen Jin,
Haobin Duan,
Zhiyang Hang
Abstract:
Games have played a pivotal role in advancing artificial intelligence, with AI agents using sophisticated techniques to compete. Despite the success of neural network based game AIs, their performance often requires significant computational resources. In this paper, we present Rapfi, an efficient Gomoku agent that outperforms CNN-based agents in limited computation environments. Rapfi leverages a…
▽ More
Games have played a pivotal role in advancing artificial intelligence, with AI agents using sophisticated techniques to compete. Despite the success of neural network based game AIs, their performance often requires significant computational resources. In this paper, we present Rapfi, an efficient Gomoku agent that outperforms CNN-based agents in limited computation environments. Rapfi leverages a compact neural network with a pattern-based codebook distilled from CNNs, and an incremental update scheme that minimizes computation when input changes are minor. This new network uses computation that is orders of magnitude less to reach a similar accuracy of much larger neural networks such as Resnet. Thanks to our incremental update scheme, depth-first search methods such as the alpha-beta search can be significantly accelerated. With a carefully tuned evaluation and search, Rapfi reached strength surpassing Katagomo, the strongest open-source Gomoku AI based on AlphaZero's algorithm, under limited computational resources where accelerators like GPUs are absent. Rapfi ranked first among 520 Gomoku agents on Botzone and won the championship in GomoCup 2024.
△ Less
Submitted 17 March, 2025;
originally announced March 2025.
-
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
Authors:
Weiyun Wang,
Zhangwei Gao,
Lianjie Chen,
Zhe Chen,
Jinguo Zhu,
Xiangyu Zhao,
Yangzhou Liu,
Yue Cao,
Shenglong Ye,
Xizhou Zhu,
Lewei Lu,
Haodong Duan,
Yu Qiao,
Jifeng Dai,
Wenhai Wang
Abstract:
We introduce VisualPRM, an advanced multimodal Process Reward Model (PRM) with 8B parameters, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) across different model scales and families with Best-of-N (BoN) evaluation strategies. Specifically, our model improves the reasoning performance of three types of MLLMs and four different model scales. Even when a…
▽ More
We introduce VisualPRM, an advanced multimodal Process Reward Model (PRM) with 8B parameters, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) across different model scales and families with Best-of-N (BoN) evaluation strategies. Specifically, our model improves the reasoning performance of three types of MLLMs and four different model scales. Even when applied to the highly capable InternVL2.5-78B, it achieves a 5.9-point improvement across seven multimodal reasoning benchmarks. Experimental results show that our model exhibits superior performance compared to Outcome Reward Models and Self-Consistency during BoN evaluation. To facilitate the training of multimodal PRMs, we construct a multimodal process supervision dataset VisualPRM400K using an automated data pipeline. For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels, to measure the abilities of PRMs to detect erroneous steps in multimodal reasoning tasks. We hope that our work can inspire more future research and contribute to the development of MLLMs. Our model, data, and benchmark are released in https://internvl.github.io/blog/2025-03-13-VisualPRM/.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
Information Density Principle for MLLM Benchmarks
Authors:
Chunyi Li,
Xiaozhe Li,
Zicheng Zhang,
Yuan Tian,
Ziheng Jia,
Xiaohong Liu,
Xiongkuo Min,
Jia Wang,
Haodong Duan,
Kai Chen,
Guangtao Zhai
Abstract:
With the emergence of Multimodal Large Language Models (MLLMs), hundreds of benchmarks have been developed to ensure the reliability of MLLMs in downstream tasks. However, the evaluation mechanism itself may not be reliable. For developers of MLLMs, questions remain about which benchmark to use and whether the test results meet their requirements. Therefore, we propose a critical principle of Info…
▽ More
With the emergence of Multimodal Large Language Models (MLLMs), hundreds of benchmarks have been developed to ensure the reliability of MLLMs in downstream tasks. However, the evaluation mechanism itself may not be reliable. For developers of MLLMs, questions remain about which benchmark to use and whether the test results meet their requirements. Therefore, we propose a critical principle of Information Density, which examines how much insight a benchmark can provide for the development of MLLMs. We characterize it from four key dimensions: (1) Fallacy, (2) Difficulty, (3) Redundancy, (4) Diversity. Through a comprehensive analysis of more than 10,000 samples, we measured the information density of 19 MLLM benchmarks. Experiments show that using the latest benchmarks in testing can provide more insight compared to previous ones, but there is still room for improvement in their information density. We hope this principle can promote the development and application of future MLLM benchmarks. Project page: https://github.com/lcysyzxdxc/bench4bench
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
Image Quality Assessment: From Human to Machine Preference
Authors:
Chunyi Li,
Yuan Tian,
Xiaoyue Ling,
Zicheng Zhang,
Haodong Duan,
Haoning Wu,
Ziheng Jia,
Xiaohong Liu,
Xiongkuo Min,
Guo Lu,
Weisi Lin,
Guangtao Zhai
Abstract:
Image Quality Assessment (IQA) based on human subjective preferences has undergone extensive research in the past decades. However, with the development of communication protocols, the visual data consumption volume of machines has gradually surpassed that of humans. For machines, the preference depends on downstream tasks such as segmentation and detection, rather than visual appeal. Considering…
▽ More
Image Quality Assessment (IQA) based on human subjective preferences has undergone extensive research in the past decades. However, with the development of communication protocols, the visual data consumption volume of machines has gradually surpassed that of humans. For machines, the preference depends on downstream tasks such as segmentation and detection, rather than visual appeal. Considering the huge gap between human and machine visual systems, this paper proposes the topic: Image Quality Assessment for Machine Vision for the first time. Specifically, we (1) defined the subjective preferences of machines, including downstream tasks, test models, and evaluation metrics; (2) established the Machine Preference Database (MPD), which contains 2.25M fine-grained annotations and 30k reference/distorted image pair instances; (3) verified the performance of mainstream IQA algorithms on MPD. Experiments show that current IQA metrics are human-centric and cannot accurately characterize machine preferences. We sincerely hope that MPD can promote the evolution of IQA from human to machine preferences. Project page is on: https://github.com/lcysyzxdxc/MPD.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
Predicting Practically? Domain Generalization for Predictive Analytics in Real-world Environments
Authors:
Hanyu Duan,
Yi Yang,
Ahmed Abbasi,
Kar Yan Tam
Abstract:
Predictive machine learning models are widely used in customer relationship management (CRM) to forecast customer behaviors and support decision-making. However, the dynamic nature of customer behaviors often results in significant distribution shifts between training data and serving data, leading to performance degradation in predictive models. Domain generalization, which aims to train models t…
▽ More
Predictive machine learning models are widely used in customer relationship management (CRM) to forecast customer behaviors and support decision-making. However, the dynamic nature of customer behaviors often results in significant distribution shifts between training data and serving data, leading to performance degradation in predictive models. Domain generalization, which aims to train models that can generalize to unseen environments without prior knowledge of their distributions, has become a critical area of research. In this work, we propose a novel domain generalization method tailored to handle complex distribution shifts, encompassing both covariate and concept shifts. Our method builds upon the Distributionally Robust Optimization framework, optimizing model performance over a set of hypothetical worst-case distributions rather than relying solely on the training data. Through simulation experiments, we demonstrate the working mechanism of the proposed method. We also conduct experiments on a real-world customer churn dataset, and validate its effectiveness in both temporal and spatial generalization settings. Finally, we discuss the broader implications of our method for advancing Information Systems (IS) design research, particularly in building robust predictive models for dynamic managerial environments.
△ Less
Submitted 5 March, 2025;
originally announced March 2025.
-
Visual-RFT: Visual Reinforcement Fine-Tuning
Authors:
Ziyu Liu,
Zeyi Sun,
Yuhang Zang,
Xiaoyi Dong,
Yuhang Cao,
Haodong Duan,
Dahua Lin,
Jiaqi Wang
Abstract:
Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers, which is especially useful in applications when fine-tuning data is scarce. Recent open-source work like DeepSeek-R1 demonstrates that reinforcement learning with verifiable reward is one key direction in reproducing o1. While the R1-style model has demonstrated success in language models,…
▽ More
Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers, which is especially useful in applications when fine-tuning data is scarce. Recent open-source work like DeepSeek-R1 demonstrates that reinforcement learning with verifiable reward is one key direction in reproducing o1. While the R1-style model has demonstrated success in language models, its application in multi-modal domains remains under-explored. This work introduces Visual Reinforcement Fine-Tuning (Visual-RFT), which further extends the application areas of RFT on visual tasks. Specifically, Visual-RFT first uses Large Vision-Language Models (LVLMs) to generate multiple responses containing reasoning tokens and final answers for each input, and then uses our proposed visual perception verifiable reward functions to update the model via the policy optimization algorithm such as Group Relative Policy Optimization (GRPO). We design different verifiable reward functions for different perception tasks, such as the Intersection over Union (IoU) reward for object detection. Experimental results on fine-grained image classification, few-shot object detection, reasoning grounding, as well as open-vocabulary object detection benchmarks show the competitive performance and advanced generalization ability of Visual-RFT compared with Supervised Fine-tuning (SFT). For example, Visual-RFT improves accuracy by $24.3\%$ over the baseline in one-shot fine-grained image classification with around 100 samples. In few-shot object detection, Visual-RFT also exceeds the baseline by $21.9$ on COCO's two-shot setting and $15.4$ on LVIS. Our Visual-RFT represents a paradigm shift in fine-tuning LVLMs, offering a data-efficient, reward-driven approach that enhances reasoning and adaptability for domain-specific tasks.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Asynchronous Personalized Federated Learning through Global Memorization
Authors:
Fan Wan,
Yuchen Li,
Xueqi Qiu,
Rui Sun,
Leyuan Zhang,
Xingyu Miao,
Tianyu Zhang,
Haoran Duan,
Yang Long
Abstract:
The proliferation of Internet of Things devices and advances in communication technology have unleashed an explosion of personal data, amplifying privacy concerns amid stringent regulations like GDPR and CCPA. Federated Learning offers a privacy preserving solution by enabling collaborative model training across decentralized devices without centralizing sensitive data. However, statistical hetero…
▽ More
The proliferation of Internet of Things devices and advances in communication technology have unleashed an explosion of personal data, amplifying privacy concerns amid stringent regulations like GDPR and CCPA. Federated Learning offers a privacy preserving solution by enabling collaborative model training across decentralized devices without centralizing sensitive data. However, statistical heterogeneity from non-independent and identically distributed datasets and system heterogeneity due to client dropouts particularly those with monopolistic classes severely degrade the global model's performance. To address these challenges, we propose the Asynchronous Personalized Federated Learning framework, which empowers clients to develop personalized models using a server side semantic generator. This generator, trained via data free knowledge transfer under global model supervision, enhances client data diversity by producing both seen and unseen samples, the latter enabled by Zero-Shot Learning to mitigate dropout-induced data loss. To counter the risks of synthetic data impairing training, we introduce a decoupled model interpolation method, ensuring robust personalization. Extensive experiments demonstrate that AP FL significantly outperforms state of the art FL methods in tackling non-IID distributions and client dropouts, achieving superior accuracy and resilience across diverse real-world scenarios.
△ Less
Submitted 1 March, 2025;
originally announced March 2025.
-
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference
Authors:
Xiangyu Zhao,
Shengyuan Ding,
Zicheng Zhang,
Haian Huang,
Maosong Cao,
Weiyun Wang,
Jiaqi Wang,
Xinyu Fang,
Wenhai Wang,
Guangtao Zhai,
Haodong Duan,
Hua Yang,
Kai Chen
Abstract:
Recent advancements in open-source multi-modal large language models (MLLMs) have primarily focused on enhancing foundational capabilities, leaving a significant gap in human preference alignment. This paper introduces OmniAlign-V, a comprehensive dataset of 200K high-quality training samples featuring diverse images, complex questions, and varied response formats to improve MLLMs' alignment with…
▽ More
Recent advancements in open-source multi-modal large language models (MLLMs) have primarily focused on enhancing foundational capabilities, leaving a significant gap in human preference alignment. This paper introduces OmniAlign-V, a comprehensive dataset of 200K high-quality training samples featuring diverse images, complex questions, and varied response formats to improve MLLMs' alignment with human preferences. We also present MM-AlignBench, a human-annotated benchmark specifically designed to evaluate MLLMs' alignment with human values. Experimental results show that finetuning MLLMs with OmniAlign-V, using Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO), significantly enhances human preference alignment while maintaining or enhancing performance on standard VQA benchmarks, preserving their fundamental capabilities. Our datasets, benchmark, code and checkpoints have been released at https://github.com/PhoenixZ810/OmniAlign-V.
△ Less
Submitted 28 February, 2025; v1 submitted 25 February, 2025;
originally announced February 2025.
-
Multi-Dimensional Quality Assessment for Text-to-3D Assets: Dataset and Model
Authors:
Kang Fu,
Huiyu Duan,
Zicheng Zhang,
Xiaohong Liu,
Xiongkuo Min,
Jia Wang,
Guangtao Zhai
Abstract:
Recent advancements in text-to-image (T2I) generation have spurred the development of text-to-3D asset (T23DA) generation, leveraging pretrained 2D text-to-image diffusion models for text-to-3D asset synthesis. Despite the growing popularity of text-to-3D asset generation, its evaluation has not been well considered and studied. However, given the significant quality discrepancies among various te…
▽ More
Recent advancements in text-to-image (T2I) generation have spurred the development of text-to-3D asset (T23DA) generation, leveraging pretrained 2D text-to-image diffusion models for text-to-3D asset synthesis. Despite the growing popularity of text-to-3D asset generation, its evaluation has not been well considered and studied. However, given the significant quality discrepancies among various text-to-3D assets, there is a pressing need for quality assessment models aligned with human subjective judgments. To tackle this challenge, we conduct a comprehensive study to explore the T23DA quality assessment (T23DAQA) problem in this work from both subjective and objective perspectives. Given the absence of corresponding databases, we first establish the largest text-to-3D asset quality assessment database to date, termed the AIGC-T23DAQA database. This database encompasses 969 validated 3D assets generated from 170 prompts via 6 popular text-to-3D asset generation models, and corresponding subjective quality ratings for these assets from the perspectives of quality, authenticity, and text-asset correspondence, respectively. Subsequently, we establish a comprehensive benchmark based on the AIGC-T23DAQA database, and devise an effective T23DAQA model to evaluate the generated 3D assets from the aforementioned three perspectives, respectively.
△ Less
Submitted 24 February, 2025;
originally announced February 2025.
-
Adaptive Multi-Objective Bayesian Optimization for Capacity Planning of Hybrid Heat Sources in Electric-Heat Coupling Systems of Cold Regions
Authors:
Ruizhe Yang,
Zhongkai Yi,
Ying Xu,
Guiyu Chen,
Haojie Yang,
Rong Yi,
Tongqing Li,
Miaozhe ShenJin Li,
Haoxiang Gao,
Hongyu Duan
Abstract:
The traditional heat-load generation pattern of combined heat and power generators has become a problem leading to renewable energy source (RES) power curtailment in cold regions, motivating the proposal of a planning model for alternative heat sources. The model aims to identify non-dominant capacity allocation schemes for heat pumps, thermal energy storage, electric boilers, and combined storage…
▽ More
The traditional heat-load generation pattern of combined heat and power generators has become a problem leading to renewable energy source (RES) power curtailment in cold regions, motivating the proposal of a planning model for alternative heat sources. The model aims to identify non-dominant capacity allocation schemes for heat pumps, thermal energy storage, electric boilers, and combined storage heaters to construct a Pareto front, considering both economic and sustainable objectives. The integration of various heat sources from both generation and consumption sides enhances flexibility in utilization. The study introduces a novel optimization algorithm, the adaptive multi-objective Bayesian optimization (AMBO). Compared to other widely used multi-objective optimization algorithms, AMBO eliminates predefined parameters that may introduce subjectivity from planners. Beyond the algorithm, the proposed model incorporates a noise term to account for inevitable simulation deviations, enabling the identification of better-performing planning results that meet the unique requirements of cold regions. What's more, the characteristics of electric-thermal coupling scenarios are captured and reflected in the operation simulation model to make sure the simulation is close to reality. Numerical simulation verifies the superiority of the proposed approach in generating a more diverse and evenly distributed Pareto front in a sample-efficient manner, providing comprehensive and objective planning choices.
△ Less
Submitted 13 February, 2025;
originally announced February 2025.
-
Joint Modelling Histology and Molecular Markers for Cancer Classification
Authors:
Xiaofei Wang,
Hanyu Liu,
Yupei Zhang,
Boyang Zhao,
Hao Duan,
Wanming Hu,
Yonggao Mou,
Stephen Price,
Chao Li
Abstract:
Cancers are characterized by remarkable heterogeneity and diverse prognosis. Accurate cancer classification is essential for patient stratification and clinical decision-making. Although digital pathology has been advancing cancer diagnosis and prognosis, the paradigm in cancer pathology has shifted from purely relying on histology features to incorporating molecular markers. There is an urgent ne…
▽ More
Cancers are characterized by remarkable heterogeneity and diverse prognosis. Accurate cancer classification is essential for patient stratification and clinical decision-making. Although digital pathology has been advancing cancer diagnosis and prognosis, the paradigm in cancer pathology has shifted from purely relying on histology features to incorporating molecular markers. There is an urgent need for digital pathology methods to meet the needs of the new paradigm. We introduce a novel digital pathology approach to jointly predict molecular markers and histology features and model their interactions for cancer classification. Firstly, to mitigate the challenge of cross-magnification information propagation, we propose a multi-scale disentangling module, enabling the extraction of multi-scale features from high-magnification (cellular-level) to low-magnification (tissue-level) whole slide images. Further, based on the multi-scale features, we propose an attention-based hierarchical multi-task multi-instance learning framework to simultaneously predict histology and molecular markers. Moreover, we propose a co-occurrence probability-based label correlation graph network to model the co-occurrence of molecular markers. Lastly, we design a cross-modal interaction module with the dynamic confidence constrain loss and a cross-modal gradient modulation strategy, to model the interactions of histology and molecular markers. Our experiments demonstrate that our method outperforms other state-of-the-art methods in classifying glioma, histology features and molecular markers. Our method promises to promote precise oncology with the potential to advance biomedical research and clinical applications. The code is available at https://github.com/LHY1007/M3C2
△ Less
Submitted 11 February, 2025;
originally announced February 2025.
-
VideoRoPE: What Makes for Good Video Rotary Position Embedding?
Authors:
Xilin Wei,
Xiaoran Liu,
Yuhang Zang,
Xiaoyi Dong,
Pan Zhang,
Yuhang Cao,
Jian Tong,
Haodong Duan,
Qipeng Guo,
Jiaqi Wang,
Xipeng Qiu,
Dahua Lin
Abstract:
While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge. This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully co…
▽ More
While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge. This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully considered in prior work. As part of our analysis, we introduce a challenging V-NIAH-D (Visual Needle-In-A-Haystack with Distractors) task, which adds periodic distractors into V-NIAH. The V-NIAH-D task demonstrates that previous RoPE variants, lacking appropriate temporal dimension allocation, are easily misled by distractors. Based on our analysis, we introduce \textbf{VideoRoPE}, with a \textit{3D structure} designed to preserve spatio-temporal relationships. VideoRoPE features \textit{low-frequency temporal allocation} to mitigate periodic oscillations, a \textit{diagonal layout} to maintain spatial symmetry, and \textit{adjustable temporal spacing} to decouple temporal and spatial indexing. VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination. Our code will be available at \href{https://github.com/Wiselnn570/VideoRoPE}{https://github.com/Wiselnn570/VideoRoPE}.
△ Less
Submitted 27 April, 2025; v1 submitted 7 February, 2025;
originally announced February 2025.
-
Laser: Efficient Language-Guided Segmentation in Neural Radiance Fields
Authors:
Xingyu Miao,
Haoran Duan,
Yang Bai,
Tejal Shah,
Jun Song,
Yang Long,
Rajiv Ranjan,
Ling Shao
Abstract:
In this work, we propose a method that leverages CLIP feature distillation, achieving efficient 3D segmentation through language guidance. Unlike previous methods that rely on multi-scale CLIP features and are limited by processing speed and storage requirements, our approach aims to streamline the workflow by directly and effectively distilling dense CLIP features, thereby achieving precise segme…
▽ More
In this work, we propose a method that leverages CLIP feature distillation, achieving efficient 3D segmentation through language guidance. Unlike previous methods that rely on multi-scale CLIP features and are limited by processing speed and storage requirements, our approach aims to streamline the workflow by directly and effectively distilling dense CLIP features, thereby achieving precise segmentation of 3D scenes using text. To achieve this, we introduce an adapter module and mitigate the noise issue in the dense CLIP feature distillation process through a self-cross-training strategy. Moreover, to enhance the accuracy of segmentation edges, this work presents a low-rank transient query attention mechanism. To ensure the consistency of segmentation for similar colors under different viewpoints, we convert the segmentation task into a classification task through label volume, which significantly improves the consistency of segmentation in color-similar areas. We also propose a simplified text augmentation strategy to alleviate the issue of ambiguity in the correspondence between CLIP features and text. Extensive experimental results show that our method surpasses current state-of-the-art technologies in both training speed and performance. Our code is available on: https://github.com/xingy038/Laser.git.
△ Less
Submitted 31 January, 2025;
originally announced January 2025.
-
Redundancy Principles for MLLMs Benchmarks
Authors:
Zicheng Zhang,
Xiangyu Zhao,
Xinyu Fang,
Chunyi Li,
Xiaohong Liu,
Xiongkuo Min,
Haodong Duan,
Kai Chen,
Guangtao Zhai
Abstract:
With the rapid iteration of Multi-modality Large Language Models (MLLMs) and the evolving demands of the field, the number of benchmarks produced annually has surged into the hundreds. The rapid growth has inevitably led to significant redundancy among benchmarks. Therefore, it is crucial to take a step back and critically assess the current state of redundancy and propose targeted principles for…
▽ More
With the rapid iteration of Multi-modality Large Language Models (MLLMs) and the evolving demands of the field, the number of benchmarks produced annually has surged into the hundreds. The rapid growth has inevitably led to significant redundancy among benchmarks. Therefore, it is crucial to take a step back and critically assess the current state of redundancy and propose targeted principles for constructing effective MLLM benchmarks. In this paper, we focus on redundancy from three key perspectives: 1) Redundancy of benchmark capability dimensions, 2) Redundancy in the number of test questions, and 3) Cross-benchmark redundancy within specific domains. Through the comprehensive analysis over hundreds of MLLMs' performance across more than 20 benchmarks, we aim to quantitatively measure the level of redundancy lies in existing MLLM evaluations, provide valuable insights to guide the future development of MLLM benchmarks, and offer strategies to refine and address redundancy issues effectively.
△ Less
Submitted 20 January, 2025;
originally announced January 2025.
-
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
Authors:
Yuhang Zang,
Xiaoyi Dong,
Pan Zhang,
Yuhang Cao,
Ziyu Liu,
Shengyuan Ding,
Shenxi Wu,
Yubo Ma,
Haodong Duan,
Wenwei Zhang,
Kai Chen,
Dahua Lin,
Jiaqi Wang
Abstract:
Despite the promising performance of Large Vision Language Models (LVLMs) in visual understanding, they occasionally generate incorrect outputs. While reward models (RMs) with reinforcement learning or test-time scaling offer the potential for improving generation quality, a critical gap remains: publicly available multi-modal RMs for LVLMs are scarce, and the implementation details of proprietary…
▽ More
Despite the promising performance of Large Vision Language Models (LVLMs) in visual understanding, they occasionally generate incorrect outputs. While reward models (RMs) with reinforcement learning or test-time scaling offer the potential for improving generation quality, a critical gap remains: publicly available multi-modal RMs for LVLMs are scarce, and the implementation details of proprietary models are often unclear. We bridge this gap with InternLM-XComposer2.5-Reward (IXC-2.5-Reward), a simple yet effective multi-modal reward model that aligns LVLMs with human preferences. To ensure the robustness and versatility of IXC-2.5-Reward, we set up a high-quality multi-modal preference corpus spanning text, image, and video inputs across diverse domains, such as instruction following, general understanding, text-rich documents, mathematical reasoning, and video understanding. IXC-2.5-Reward achieves excellent results on the latest multi-modal reward model benchmark and shows competitive performance on text-only reward model benchmarks. We further demonstrate three key applications of IXC-2.5-Reward: (1) Providing a supervisory signal for RL training. We integrate IXC-2.5-Reward with Proximal Policy Optimization (PPO) yields IXC-2.5-Chat, which shows consistent improvements in instruction following and multi-modal open-ended dialogue; (2) Selecting the best response from candidate responses for test-time scaling; and (3) Filtering outlier or noisy samples from existing image and video instruction tuning training data. To ensure reproducibility and facilitate further research, we have open-sourced all model weights and training recipes at https://github.com/InternLM/InternLM-XComposer
△ Less
Submitted 21 January, 2025;
originally announced January 2025.
-
Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement
Authors:
Maosong Cao,
Taolin Zhang,
Mo Li,
Chuyu Zhang,
Yunxin Liu,
Haodong Duan,
Songyang Zhang,
Kai Chen
Abstract:
The quality of Supervised Fine-Tuning (SFT) data plays a critical role in enhancing the conversational capabilities of Large Language Models (LLMs). However, as LLMs become more advanced, the availability of high-quality human-annotated SFT data has become a significant bottleneck, necessitating a greater reliance on synthetic training data. In this work, we introduce Condor, a novel two-stage syn…
▽ More
The quality of Supervised Fine-Tuning (SFT) data plays a critical role in enhancing the conversational capabilities of Large Language Models (LLMs). However, as LLMs become more advanced, the availability of high-quality human-annotated SFT data has become a significant bottleneck, necessitating a greater reliance on synthetic training data. In this work, we introduce Condor, a novel two-stage synthetic data generation framework that incorporates World Knowledge Tree and Self-Reflection Refinement to produce high-quality SFT data at scale. Our experimental results demonstrate that a base model fine-tuned on only 20K Condor-generated samples achieves superior performance compared to counterparts. The additional refinement stage in Condor further enables iterative self-improvement for LLMs at various scales (up to 72B), validating the effectiveness of our approach. Furthermore, our investigation into the scaling for synthetic data in post-training reveals substantial unexplored potential for performance improvements, opening promising avenues for future research.
△ Less
Submitted 21 January, 2025;
originally announced January 2025.
-
You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense
Authors:
Wuyuao Mai,
Geng Hong,
Pei Chen,
Xudong Pan,
Baojun Liu,
Yuan Zhang,
Haixin Duan,
Min Yang
Abstract:
With the rise of generative large language models (LLMs) like LLaMA and ChatGPT, these models have significantly transformed daily life and work by providing advanced insights. However, as jailbreak attacks continue to circumvent built-in safety mechanisms, exploiting carefully crafted scenarios or tokens, the safety risks of LLMs have come into focus. While numerous defense strategies--such as pr…
▽ More
With the rise of generative large language models (LLMs) like LLaMA and ChatGPT, these models have significantly transformed daily life and work by providing advanced insights. However, as jailbreak attacks continue to circumvent built-in safety mechanisms, exploiting carefully crafted scenarios or tokens, the safety risks of LLMs have come into focus. While numerous defense strategies--such as prompt detection, modification, and model fine-tuning--have been proposed to counter these attacks, a critical question arises: do these defenses compromise the utility and usability of LLMs for legitimate users? Existing research predominantly focuses on the effectiveness of defense strategies without thoroughly examining their impact on performance, leaving a gap in understanding the trade-offs between LLM safety and performance. Our research addresses this gap by conducting a comprehensive study on the utility degradation, safety elevation, and exaggerated-safety escalation of LLMs with jailbreak defense strategies. We propose USEBench, a novel benchmark designed to evaluate these aspects, along with USEIndex, a comprehensive metric for assessing overall model performance. Through experiments on seven state-of-the-art LLMs, we found that mainstream jailbreak defenses fail to ensure both safety and performance simultaneously. Although model-finetuning performs the best overall, their effectiveness varies across LLMs. Furthermore, vertical comparisons reveal that developers commonly prioritize performance over safety when iterating or fine-tuning their LLMs.
△ Less
Submitted 21 January, 2025;
originally announced January 2025.
-
SP-SLAM: Neural Real-Time Dense SLAM With Scene Priors
Authors:
Zhen Hong,
Bowen Wang,
Haoran Duan,
Yawen Huang,
Xiong Li,
Zhenyu Wen,
Xiang Wu,
Wei Xiang,
Yefeng Zheng
Abstract:
Neural implicit representations have recently shown promising progress in dense Simultaneous Localization And Mapping (SLAM). However, existing works have shortcomings in terms of reconstruction quality and real-time performance, mainly due to inflexible scene representation strategy without leveraging any prior information. In this paper, we introduce SP-SLAM, a novel neural RGB-D SLAM system tha…
▽ More
Neural implicit representations have recently shown promising progress in dense Simultaneous Localization And Mapping (SLAM). However, existing works have shortcomings in terms of reconstruction quality and real-time performance, mainly due to inflexible scene representation strategy without leveraging any prior information. In this paper, we introduce SP-SLAM, a novel neural RGB-D SLAM system that performs tracking and mapping in real-time. SP-SLAM computes depth images and establishes sparse voxel-encoded scene priors near the surfaces to achieve rapid convergence of the model. Subsequently, the encoding voxels computed from single-frame depth image are fused into a global volume, which facilitates high-fidelity surface reconstruction. Simultaneously, we employ tri-planes to store scene appearance information, striking a balance between achieving high-quality geometric texture mapping and minimizing memory consumption. Furthermore, in SP-SLAM, we introduce an effective optimization strategy for mapping, allowing the system to continuously optimize the poses of all historical input frames during runtime without increasing computational overhead. We conduct extensive evaluations on five benchmark datasets (Replica, ScanNet, TUM RGB-D, Synthetic RGB-D, 7-Scenes). The results demonstrate that, compared to existing methods, we achieve superior tracking accuracy and reconstruction quality, while running at a significantly faster speed.
△ Less
Submitted 11 January, 2025;
originally announced January 2025.
-
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
Authors:
Yifei Li,
Junbo Niu,
Ziyang Miao,
Chunjiang Ge,
Yuanhang Zhou,
Qihao He,
Xiaoyi Dong,
Haodong Duan,
Shuangrui Ding,
Rui Qian,
Pan Zhang,
Yuhang Zang,
Yuhang Cao,
Conghui He,
Jiaqi Wang
Abstract:
Temporal Awareness, the ability to reason dynamically based on the timestamp when a question is raised, is the key distinction between offline and online video LLMs. Unlike offline models, which rely on complete videos for static, post hoc analysis, online models process video streams incrementally and dynamically adapt their responses based on the timestamp at which the question is posed. Despite…
▽ More
Temporal Awareness, the ability to reason dynamically based on the timestamp when a question is raised, is the key distinction between offline and online video LLMs. Unlike offline models, which rely on complete videos for static, post hoc analysis, online models process video streams incrementally and dynamically adapt their responses based on the timestamp at which the question is posed. Despite its significance, temporal awareness has not been adequately evaluated in existing benchmarks. To fill this gap, we present OVO-Bench (Online-VideO-Benchmark), a novel video benchmark that emphasizes the importance of timestamps for advanced online video understanding capability benchmarking. OVO-Bench evaluates the ability of video LLMs to reason and respond to events occurring at specific timestamps under three distinct scenarios: (1) Backward tracing: trace back to past events to answer the question. (2) Real-time understanding: understand and respond to events as they unfold at the current timestamp. (3) Forward active responding: delay the response until sufficient future information becomes available to answer the question accurately. OVO-Bench comprises 12 tasks, featuring 644 unique videos and approximately human-curated 2,800 fine-grained meta-annotations with precise timestamps. We combine automated generation pipelines with human curation. With these high-quality samples, we further developed an evaluation pipeline to systematically query video LLMs along the video timeline. Evaluations of nine Video-LLMs reveal that, despite advancements on traditional benchmarks, current models struggle with online video understanding, showing a significant gap compared to human agents. We hope OVO-Bench will drive progress in video LLMs and inspire future research in online video reasoning. Our benchmark and code can be accessed at https://github.com/JoeLeelyf/OVO-Bench.
△ Less
Submitted 27 March, 2025; v1 submitted 9 January, 2025;
originally announced January 2025.
-
BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning
Authors:
Beichen Zhang,
Yuhong Liu,
Xiaoyi Dong,
Yuhang Zang,
Pan Zhang,
Haodong Duan,
Yuhang Cao,
Dahua Lin,
Jiaqi Wang
Abstract:
Large language models (LLMs) have demonstrated impressive ability in solving complex mathematical problems with multi-step reasoning and can be further enhanced with well-designed in-context learning (ICL) examples. However, this potential is often constrained by two major challenges in ICL: granularity mismatch and irrelevant information. We observe that while LLMs excel at decomposing mathematic…
▽ More
Large language models (LLMs) have demonstrated impressive ability in solving complex mathematical problems with multi-step reasoning and can be further enhanced with well-designed in-context learning (ICL) examples. However, this potential is often constrained by two major challenges in ICL: granularity mismatch and irrelevant information. We observe that while LLMs excel at decomposing mathematical problems, they often struggle with reasoning errors in fine-grained steps. Moreover, ICL examples retrieved at the question level may omit critical steps or even mislead the model with irrelevant details. To address this issue, we propose BoostStep, a method that enhances reasoning accuracy through step-aligned ICL, a novel mechanism that carefully aligns retrieved reference steps with the corresponding reasoning steps. Additionally, BoostStep incorporates an effective "first-try" strategy to deliver exemplars highly relevant to the current state of reasoning. BoostStep is a flexible and powerful method that integrates seamlessly with chain-of-thought (CoT) and tree search algorithms, refining both candidate selection and decision-making. Empirical results show that BoostStep improves GPT-4o's CoT performance by 4.6% across mathematical benchmarks, significantly surpassing traditional few-shot learning's 1.2%. Moreover, it can achieve an additional 7.5\% gain combined with tree search. Surprisingly, it enhances state-of-the-art LLMs to solve challenging math problems using simpler examples. It improves DeepSeek-R1-671B's performance on AIME by 2.2%, leveraging simple examples only from the MATH dataset.
△ Less
Submitted 17 February, 2025; v1 submitted 6 January, 2025;
originally announced January 2025.
-
Facial Attractiveness Prediction in Live Streaming: A New Benchmark and Multi-modal Method
Authors:
Hui Li,
Xiaoyu Ren,
Hongjiu Yu,
Huiyu Duan,
Kai Li,
Ying Chen,
Libo Wang,
Xiongkuo Min,
Guangtao Zhai,
Xu Liu
Abstract:
Facial attractiveness prediction (FAP) has long been an important computer vision task, which could be widely applied in live streaming for facial retouching, content recommendation, etc. However, previous FAP datasets are either small, closed-source, or lack diversity. Moreover, the corresponding FAP models exhibit limited generalization and adaptation ability. To overcome these limitations, in t…
▽ More
Facial attractiveness prediction (FAP) has long been an important computer vision task, which could be widely applied in live streaming for facial retouching, content recommendation, etc. However, previous FAP datasets are either small, closed-source, or lack diversity. Moreover, the corresponding FAP models exhibit limited generalization and adaptation ability. To overcome these limitations, in this paper we present LiveBeauty, the first large-scale live-specific FAP dataset, in a more challenging application scenario, i.e., live streaming. 10,000 face images are collected from a live streaming platform directly, with 200,000 corresponding attractiveness annotations obtained from a well-devised subjective experiment, making LiveBeauty the largest open-access FAP dataset in the challenging live scenario. Furthermore, a multi-modal FAP method is proposed to measure the facial attractiveness in live streaming. Specifically, we first extract holistic facial prior knowledge and multi-modal aesthetic semantic features via a Personalized Attractiveness Prior Module (PAPM) and a Multi-modal Attractiveness Encoder Module (MAEM), respectively, then integrate the extracted features through a Cross-Modal Fusion Module (CMFM). Extensive experiments conducted on both LiveBeauty and other open-source FAP datasets demonstrate that our proposed method achieves state-of-the-art performance. Dataset will be available soon.
△ Less
Submitted 12 March, 2025; v1 submitted 5 January, 2025;
originally announced January 2025.
-
HarmonyIQA: Pioneering Benchmark and Model for Image Harmonization Quality Assessment
Authors:
Zitong Xu,
Huiyu Duan,
Guangji Ma,
Liu Yang,
Jiarui Wang,
Qingbo Wu,
Xiongkuo Min,
Guangtao Zhai,
Patrick Le Callet
Abstract:
Image composition involves extracting a foreground object from one image and pasting it into another image through Image harmonization algorithms (IHAs), which aim to adjust the appearance of the foreground object to better match the background. Existing image quality assessment (IQA) methods may fail to align with human visual preference on image harmonization due to the insensitivity to minor co…
▽ More
Image composition involves extracting a foreground object from one image and pasting it into another image through Image harmonization algorithms (IHAs), which aim to adjust the appearance of the foreground object to better match the background. Existing image quality assessment (IQA) methods may fail to align with human visual preference on image harmonization due to the insensitivity to minor color or light inconsistency. To address the issue and facilitate the advancement of IHAs, we introduce the first Image Quality Assessment Database for image Harmony evaluation (HarmonyIQAD), which consists of 1,350 harmonized images generated by 9 different IHAs, and the corresponding human visual preference scores. Based on this database, we propose a Harmony Image Quality Assessment (HarmonyIQA), to predict human visual preference for harmonized images. Extensive experiments show that HarmonyIQA achieves state-of-the-art performance on human visual preference evaluation for harmonized images, and also achieves competing results on traditional IQA tasks. Furthermore, cross-dataset evaluation also shows that HarmonyIQA exhibits better generalization ability than self-supervised learning-based IQA methods. Both HarmonyIQAD and HarmonyIQA will be made publicly available upon paper publication.
△ Less
Submitted 2 January, 2025;
originally announced January 2025.
-
ESVQA: Perceptual Quality Assessment of Egocentric Spatial Videos
Authors:
Xilei Zhu,
Huiyu Duan,
Liu Yang,
Yucheng Zhu,
Xiongkuo Min,
Guangtao Zhai,
Patrick Le Callet
Abstract:
With the rapid development of eXtended Reality (XR), egocentric spatial shooting and display technologies have further enhanced immersion and engagement for users. Assessing the quality of experience (QoE) of egocentric spatial videos is crucial to ensure a high-quality viewing experience. However, the corresponding research is still lacking. In this paper, we use the embodied experience to highli…
▽ More
With the rapid development of eXtended Reality (XR), egocentric spatial shooting and display technologies have further enhanced immersion and engagement for users. Assessing the quality of experience (QoE) of egocentric spatial videos is crucial to ensure a high-quality viewing experience. However, the corresponding research is still lacking. In this paper, we use the embodied experience to highlight this more immersive experience and study the new problem, i.e., embodied perceptual quality assessment for egocentric spatial videos. Specifically, we introduce the first Egocentric Spatial Video Quality Assessment Database (ESVQAD), which comprises 600 egocentric spatial videos and their mean opinion scores (MOSs). Furthermore, we propose a novel multi-dimensional binocular feature fusion model, termed ESVQAnet, which integrates binocular spatial, motion, and semantic features to predict the perceptual quality. Experimental results demonstrate the ESVQAnet outperforms 16 state-of-the-art VQA models on the embodied perceptual quality assessment task, and exhibits strong generalization capability on traditional VQA tasks. The database and codes will be released upon the publication.
△ Less
Submitted 29 December, 2024;
originally announced December 2024.
-
FineVQ: Fine-Grained User Generated Content Video Quality Assessment
Authors:
Huiyu Duan,
Qiang Hu,
Jiarui Wang,
Liu Yang,
Zitong Xu,
Lu Liu,
Xiongkuo Min,
Chunlei Cai,
Tianxiao Ye,
Xiaoyun Zhang,
Guangtao Zhai
Abstract:
The rapid growth of user-generated content (UGC) videos has produced an urgent need for effective video quality assessment (VQA) algorithms to monitor video quality and guide optimization and recommendation procedures. However, current VQA models generally only give an overall rating for a UGC video, which lacks fine-grained labels for serving video processing and recommendation applications. To a…
▽ More
The rapid growth of user-generated content (UGC) videos has produced an urgent need for effective video quality assessment (VQA) algorithms to monitor video quality and guide optimization and recommendation procedures. However, current VQA models generally only give an overall rating for a UGC video, which lacks fine-grained labels for serving video processing and recommendation applications. To address the challenges and promote the development of UGC videos, we establish the first large-scale Fine-grained Video quality assessment Database, termed FineVD, which comprises 6104 UGC videos with fine-grained quality scores and descriptions across multiple dimensions. Based on this database, we propose a Fine-grained Video Quality assessment (FineVQ) model to learn the fine-grained quality of UGC videos, with the capabilities of quality rating, quality scoring, and quality attribution. Extensive experimental results demonstrate that our proposed FineVQ can produce fine-grained video-quality results and achieve state-of-the-art performance on FineVD and other commonly used UGC-VQA datasets.
△ Less
Submitted 26 April, 2025; v1 submitted 26 December, 2024;
originally announced December 2024.
-
Exemplar-condensed Federated Class-incremental Learning
Authors:
Rui Sun,
Yumin Zhang,
Varun Ojha,
Tejal Shah,
Haoran Duan,
Bo Wei,
Rajiv Ranjan
Abstract:
We propose Exemplar-Condensed federated class-incremental learning (ECoral) to distil the training characteristics of real images from streaming data into informative rehearsal exemplars. The proposed method eliminates the limitations of exemplar selection in replay-based approaches for mitigating catastrophic forgetting in federated continual learning (FCL). The limitations particularly related t…
▽ More
We propose Exemplar-Condensed federated class-incremental learning (ECoral) to distil the training characteristics of real images from streaming data into informative rehearsal exemplars. The proposed method eliminates the limitations of exemplar selection in replay-based approaches for mitigating catastrophic forgetting in federated continual learning (FCL). The limitations particularly related to the heterogeneity of information density of each summarized data. Our approach maintains the consistency of training gradients and the relationship to past tasks for the summarized exemplars to represent the streaming data compared to the original images effectively. Additionally, our approach reduces the information-level heterogeneity of the summarized data by inter-client sharing of the disentanglement generative model. Extensive experiments show that our ECoral outperforms several state-of-the-art methods and can be seamlessly integrated with many existing approaches to enhance performance.
△ Less
Submitted 25 December, 2024;
originally announced December 2024.
-
NoiseHGNN: Synthesized Similarity Graph-Based Neural Network For Noised Heterogeneous Graph Representation Learning
Authors:
Xiong Zhang,
Cheng Xie,
Haoran Duan,
Beibei Yu
Abstract:
Real-world graph data environments intrinsically exist noise (e.g., link and structure errors) that inevitably disturb the effectiveness of graph representation and downstream learning tasks. For homogeneous graphs, the latest works use original node features to synthesize a similarity graph that can correct the structure of the noised graph. This idea is based on the homogeneity assumption, which…
▽ More
Real-world graph data environments intrinsically exist noise (e.g., link and structure errors) that inevitably disturb the effectiveness of graph representation and downstream learning tasks. For homogeneous graphs, the latest works use original node features to synthesize a similarity graph that can correct the structure of the noised graph. This idea is based on the homogeneity assumption, which states that similar nodes in the homogeneous graph tend to have direct links in the original graph. However, similar nodes in heterogeneous graphs usually do not have direct links, which can not be used to correct the original noise graph. This causes a significant challenge in noised heterogeneous graph learning. To this end, this paper proposes a novel synthesized similarity-based graph neural network compatible with noised heterogeneous graph learning. First, we calculate the original feature similarities of all nodes to synthesize a similarity-based high-order graph. Second, we propose a similarity-aware encoder to embed original and synthesized graphs with shared parameters. Then, instead of graph-to-graph supervising, we synchronously supervise the original and synthesized graph embeddings to predict the same labels. Meanwhile, a target-based graph extracted from the synthesized graph contrasts the structure of the metapath-based graph extracted from the original graph to learn the mutual information. Extensive experiments in numerous real-world datasets show the proposed method achieves state-of-the-art records in the noised heterogeneous graph learning tasks. In highlights, +5$\sim$6\% improvements are observed in several noised datasets compared with previous SOTA methods. The code and datasets are available at https://github.com/kg-cc/NoiseHGNN.
△ Less
Submitted 24 December, 2024;
originally announced December 2024.
-
ErasableMask: A Robust and Erasable Privacy Protection Scheme against Black-box Face Recognition Models
Authors:
Sipeng Shen,
Yunming Zhang,
Dengpan Ye,
Xiuwen Shi,
Long Tang,
Haoran Duan,
Jiacheng Deng,
Ziyi Liu
Abstract:
While face recognition (FR) models have brought remarkable convenience in face verification and identification, they also pose substantial privacy risks to the public. Existing facial privacy protection schemes usually adopt adversarial examples to disrupt face verification of FR models. However, these schemes often suffer from weak transferability against black-box FR models and permanently damag…
▽ More
While face recognition (FR) models have brought remarkable convenience in face verification and identification, they also pose substantial privacy risks to the public. Existing facial privacy protection schemes usually adopt adversarial examples to disrupt face verification of FR models. However, these schemes often suffer from weak transferability against black-box FR models and permanently damage the identifiable information that cannot fulfill the requirements of authorized operations such as forensics and authentication. To address these limitations, we propose ErasableMask, a robust and erasable privacy protection scheme against black-box FR models. Specifically, via rethinking the inherent relationship between surrogate FR models, ErasableMask introduces a novel meta-auxiliary attack, which boosts black-box transferability by learning more general features in a stable and balancing optimization strategy. It also offers a perturbation erasion mechanism that supports the erasion of semantic perturbations in protected face without degrading image quality. To further improve performance, ErasableMask employs a curriculum learning strategy to mitigate optimization conflicts between adversarial attack and perturbation erasion. Extensive experiments on the CelebA-HQ and FFHQ datasets demonstrate that ErasableMask achieves the state-of-the-art performance in transferability, achieving over 72% confidence on average in commercial FR systems. Moreover, ErasableMask also exhibits outstanding perturbation erasion performance, achieving over 90% erasion success rate.
△ Less
Submitted 29 December, 2024; v1 submitted 22 December, 2024;
originally announced December 2024.
-
Revealing the Black Box of Device Search Engine: Scanning Assets, Strategies, and Ethical Consideration
Authors:
Mengying Wu,
Geng Hong,
Jinsong Chen,
Qi Liu,
Shujun Tang,
Youhao Li,
Baojun Liu,
Haixin Duan,
Min Yang
Abstract:
In the digital age, device search engines such as Censys and Shodan play crucial roles by scanning the internet to catalog online devices, aiding in the understanding and mitigation of network security risks. While previous research has used these tools to detect devices and assess vulnerabilities, there remains uncertainty regarding the assets they scan, the strategies they employ, and whether th…
▽ More
In the digital age, device search engines such as Censys and Shodan play crucial roles by scanning the internet to catalog online devices, aiding in the understanding and mitigation of network security risks. While previous research has used these tools to detect devices and assess vulnerabilities, there remains uncertainty regarding the assets they scan, the strategies they employ, and whether they adhere to ethical guidelines. This study presents the first comprehensive examination of these engines' operational and ethical dimensions. We developed a novel framework to trace the IP addresses utilized by these engines and collected 1,407 scanner IPs. By uncovering their IPs, we gain deep insights into the actions of device search engines for the first time and gain original findings. By employing 28 honeypots to monitor their scanning activities extensively in one year, we demonstrate that users can hardly evade scans by blocklisting scanner IPs or migrating service ports. Our findings reveal significant ethical concerns, including a lack of transparency, harmlessness, and anonymity. Notably, these engines often fail to provide transparency and do not allow users to opt out of scans. Further, the engines send malformed requests, attempt to access excessive details without authorization, and even publish personally identifiable information (PII) and screenshots on search results. These practices compromise user privacy and expose devices to further risks by potentially aiding malicious entities. This paper emphasizes the urgent need for stricter ethical standards and enhanced transparency in the operations of device search engines, offering crucial insights into safeguarding against invasive scanning practices and protecting digital infrastructures.
△ Less
Submitted 20 December, 2024;
originally announced December 2024.
-
User-Generated Content and Editors in Games: A Comprehensive Survey
Authors:
Yuyue Liu,
Haihan Duan,
Wei Cai
Abstract:
User-Generated Content (UGC) refers to any form of content, such as posts and images, created by users rather than by professionals. In recent years, UGC has become an essential part of the evolving video game industry, influencing both game culture and community dynamics. The ability for users to actively contribute to the games they engage with has shifted the landscape of gaming from a one-dire…
▽ More
User-Generated Content (UGC) refers to any form of content, such as posts and images, created by users rather than by professionals. In recent years, UGC has become an essential part of the evolving video game industry, influencing both game culture and community dynamics. The ability for users to actively contribute to the games they engage with has shifted the landscape of gaming from a one-directional entertainment experience into a collaborative, user-driven ecosystem. Therefore, this growing trend highlights the urgent need for summarizing the current UGC development in game industry. Our conference paper has systematically classified the existing UGC in games and the UGC editors separately into four types. However, the previous survey lacks the depth and precision necessary to capture the wide-ranging and increasingly complex nature of UGC. To this end, as an extension of previous work, this paper presents a refined and expanded classification of UGC and UGC editors within video games, offering a more robust and comprehensive framework with representative cases that better reflects the diversity and nuances of contemporary user-generated contributions. Moreover, we provide our insights on the future of UGC, involving game culture, game genre and user creative tendencies, artificial intelligence, its potential ethical considerations, and relationship between games, users and communities.
△ Less
Submitted 18 December, 2024;
originally announced December 2024.