Judge Anything: MLLM as a Judge Across Any Modality

Pu, Shu; Wang, Yaochen; Chen, Dongping; Chen, Yuhang; Wang, Guohao; Qin, Qi; Zhang, Zhongyi; Zhang, Zhiyuan; Zhou, Zetong; Gong, Shuang; Gui, Yi; Wan, Yao; Yu, Philip S.

Abstract:Evaluating generative foundation models on open-ended multimodal understanding (MMU) and generation (MMG) tasks across diverse modalities (e.g., images, audio, video) poses significant challenges due to the complexity of cross-modal interactions. To this end, the idea of utilizing Multimodal LLMs (MLLMs) as automated judges has emerged, with encouraging results in assessing vision-language understanding tasks. Moving further, this paper extends MLLM-as-a-Judge across modalities to a unified manner by introducing two benchmarks, TaskAnything and JudgeAnything, to respectively evaluate the overall performance and judging capabilities of MLLMs across any-to-any modality tasks. Specifically, TaskAnything evaluates the MMU and MMG capabilities across 15 any-to-any modality categories, employing 1,500 queries curated from well-established benchmarks. Furthermore, JudgeAnything evaluates the judging capabilities of 5 advanced (e.g., GPT-4o and Gemini-2.0-Flash) from the perspectives of Pair Comparison and Score Evaluation, providing a standardized testbed that incorporates human judgments and detailed rubrics. Our extensive experiments reveal that while these MLLMs show promise in assessing MMU (i.e., achieving an average of 66.55% in Pair Comparison setting and 42.79% in Score Evaluation setting), they encounter significant challenges with MMG tasks (i.e., averaging only 53.37% in Pair Comparison setting and 30.05% in Score Evaluation setting), exposing cross-modality biases and hallucination issues. To address this, we present OmniArena, an automated platform for evaluating omni-models and multimodal reward models. Our work highlights the need for fairer evaluation protocols and stronger alignment with human preferences. The source code and dataset are publicly available at: this https URL.

Subjects:	Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2503.17489 [cs.CL]
	(or arXiv:2503.17489v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2503.17489

Computer Science > Computation and Language

Title:Judge Anything: MLLM as a Judge Across Any Modality

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators