-
YINYANG-ALIGN: Benchmarking Contradictory Objectives and Proposing Multi-Objective Optimization based DPO for Text-to-Image Alignment
Authors:
Amitava Das,
Yaswanth Narsupalli,
Gurpreet Singh,
Vinija Jain,
Vasu Sharma,
Suranjana Trivedy,
Aman Chadha,
Amit Sheth
Abstract:
Precise alignment in Text-to-Image (T2I) systems is crucial to ensure that generated visuals not only accurately encapsulate user intents but also conform to stringent ethical and aesthetic benchmarks. Incidents like the Google Gemini fiasco, where misaligned outputs triggered significant public backlash, underscore the critical need for robust alignment mechanisms. In contrast, Large Language Mod…
▽ More
Precise alignment in Text-to-Image (T2I) systems is crucial to ensure that generated visuals not only accurately encapsulate user intents but also conform to stringent ethical and aesthetic benchmarks. Incidents like the Google Gemini fiasco, where misaligned outputs triggered significant public backlash, underscore the critical need for robust alignment mechanisms. In contrast, Large Language Models (LLMs) have achieved notable success in alignment. Building on these advancements, researchers are eager to apply similar alignment techniques, such as Direct Preference Optimization (DPO), to T2I systems to enhance image generation fidelity and reliability.
We present YinYangAlign, an advanced benchmarking framework that systematically quantifies the alignment fidelity of T2I systems, addressing six fundamental and inherently contradictory design objectives. Each pair represents fundamental tensions in image generation, such as balancing adherence to user prompts with creative modifications or maintaining diversity alongside visual coherence. YinYangAlign includes detailed axiom datasets featuring human prompts, aligned (chosen) responses, misaligned (rejected) AI-generated outputs, and explanations of the underlying contradictions.
△ Less
Submitted 9 February, 2025; v1 submitted 5 February, 2025;
originally announced February 2025.
-
DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference Optimization
Authors:
Amitava Das,
Suranjana Trivedy,
Danush Khanna,
Rajarshi Roy,
Gurpreet Singh,
Basab Ghosh,
Yaswanth Narsupalli,
Vinija Jain,
Vasu Sharma,
Aishwarya Naresh Reganti,
Aman Chadha
Abstract:
The rapid rise of large language models (LLMs) has unlocked many applications but also underscores the challenge of aligning them with diverse values and preferences. Direct Preference Optimization (DPO) is central to alignment but constrained by fixed divergences and limited feature transformations. We propose DPO-Kernels, which integrates kernel methods to address these issues through four key c…
▽ More
The rapid rise of large language models (LLMs) has unlocked many applications but also underscores the challenge of aligning them with diverse values and preferences. Direct Preference Optimization (DPO) is central to alignment but constrained by fixed divergences and limited feature transformations. We propose DPO-Kernels, which integrates kernel methods to address these issues through four key contributions: (i) Kernelized Representations with polynomial, RBF, Mahalanobis, and spectral kernels for richer transformations, plus a hybrid loss combining embedding-based and probability-based objectives; (ii) Divergence Alternatives (Jensen-Shannon, Hellinger, Renyi, Bhattacharyya, Wasserstein, and f-divergences) for greater stability; (iii) Data-Driven Selection metrics that automatically choose the best kernel-divergence pair; and (iv) a Hierarchical Mixture of Kernels for both local precision and global modeling. Evaluations on 12 datasets demonstrate state-of-the-art performance in factuality, safety, reasoning, and instruction following. Grounded in Heavy-Tailed Self-Regularization, DPO-Kernels maintains robust generalization for LLMs, offering a comprehensive resource for further alignment research.
△ Less
Submitted 19 January, 2025; v1 submitted 4 January, 2025;
originally announced January 2025.
-
ReFeR: Improving Evaluation and Reasoning through Hierarchy of Models
Authors:
Yaswanth Narsupalli,
Abhranil Chandra,
Sreevatsa Muppirala,
Manish Gupta,
Pawan Goyal
Abstract:
Assessing the quality of outputs generated by generative models, such as large language models and vision language models, presents notable challenges. Traditional methods for evaluation typically rely on either human assessments, which are resource-intensive, or automatic metrics that often show a low correlation with human judgment. Another common approach is to use deep learning systems, which…
▽ More
Assessing the quality of outputs generated by generative models, such as large language models and vision language models, presents notable challenges. Traditional methods for evaluation typically rely on either human assessments, which are resource-intensive, or automatic metrics that often show a low correlation with human judgment. Another common approach is to use deep learning systems, which not only consume a substantial amount of compute and time but also require extensive training data. In this study, we introduce a tuning-free framework called ReFeR, designed to evaluate generative outputs, including both text and images, by leveraging a 2-level hierarchy of LLMs and VLMs themselves. We rigorously evaluate our framework, ReFeR, across four diverse evaluation tasks. The framework not only improves the accuracy of these evaluations, surpassing previous benchmarks but also generates constructive feedback. Interestingly, the framework is also applicable to reasoning tasks. Experiments on four reasoning tasks demonstrate superior collective reasoning abilities of the framework. We present two variants of the framework: ReFeR-Turbo, optimized for accelerated performance, and ReFeR-Lite, offering a more cost-effective solution. ReFeR-Lite is $\sim7.7\times$ more efficient while being comparably accurate to ReFeR-Turbo. We make code, data and PIP package publicly available. See this PIP URL https://pypi.org/project/refer-agents/ and this Git URL https://github.com/yaswanth-iitkgp/ReFeR_Code .
△ Less
Submitted 9 October, 2024; v1 submitted 16 July, 2024;
originally announced July 2024.
-
VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation
Authors:
Xuan He,
Dongfu Jiang,
Ge Zhang,
Max Ku,
Achint Soni,
Sherman Siu,
Haonan Chen,
Abhranil Chandra,
Ziyan Jiang,
Aaran Arulraj,
Kai Wang,
Quy Duc Do,
Yuansheng Ni,
Bohan Lyu,
Yaswanth Narsupalli,
Rongqi Fan,
Zhiheng Lyu,
Yuchen Lin,
Wenhu Chen
Abstract:
The recent years have witnessed great advances in video generation. However, the development of automatic video metrics is lagging significantly behind. None of the existing metric is able to provide reliable scores over generated videos. The main barrier is the lack of large-scale human-annotated dataset. In this paper, we release VideoFeedback, the first large-scale dataset containing human-prov…
▽ More
The recent years have witnessed great advances in video generation. However, the development of automatic video metrics is lagging significantly behind. None of the existing metric is able to provide reliable scores over generated videos. The main barrier is the lack of large-scale human-annotated dataset. In this paper, we release VideoFeedback, the first large-scale dataset containing human-provided multi-aspect score over 37.6K synthesized videos from 11 existing video generative models. We train VideoScore (initialized from Mantis) based on VideoFeedback to enable automatic video quality assessment. Experiments show that the Spearman correlation between VideoScore and humans can reach 77.1 on VideoFeedback-test, beating the prior best metrics by about 50 points. Further result on other held-out EvalCrafter, GenAI-Bench, and VBench show that VideoScore has consistently much higher correlation with human judges than other metrics. Due to these results, we believe VideoScore can serve as a great proxy for human raters to (1) rate different video models to track progress (2) simulate fine-grained human feedback in Reinforcement Learning with Human Feedback (RLHF) to improve current video generation models.
△ Less
Submitted 14 October, 2024; v1 submitted 21 June, 2024;
originally announced June 2024.
-
II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models
Authors:
Ziqiang Liu,
Feiteng Fang,
Xi Feng,
Xinrun Du,
Chenhao Zhang,
Zekun Wang,
Yuelin Bai,
Qixuan Zhao,
Liyang Fan,
Chengguang Gan,
Hongquan Lin,
Jiaming Li,
Yuansheng Ni,
Haihong Wu,
Yaswanth Narsupalli,
Zhigang Zheng,
Chengming Li,
Xiping Hu,
Ruifeng Xu,
Xiaojun Chen,
Min Yang,
Jiaheng Liu,
Ruibo Liu,
Wenhao Huang,
Ge Zhang
, et al. (1 additional authors not shown)
Abstract:
The rapid advancements in the development of multimodal large language models (MLLMs) have consistently led to new breakthroughs on various benchmarks. In response, numerous challenging and comprehensive benchmarks have been proposed to more accurately assess the capabilities of MLLMs. However, there is a dearth of exploration of the higher-order perceptual capabilities of MLLMs. To fill this gap,…
▽ More
The rapid advancements in the development of multimodal large language models (MLLMs) have consistently led to new breakthroughs on various benchmarks. In response, numerous challenging and comprehensive benchmarks have been proposed to more accurately assess the capabilities of MLLMs. However, there is a dearth of exploration of the higher-order perceptual capabilities of MLLMs. To fill this gap, we propose the Image Implication understanding Benchmark, II-Bench, which aims to evaluate the model's higher-order perception of images. Through extensive experiments on II-Bench across multiple MLLMs, we have made significant findings. Initially, a substantial gap is observed between the performance of MLLMs and humans on II-Bench. The pinnacle accuracy of MLLMs attains 74.8%, whereas human accuracy averages 90%, peaking at an impressive 98%. Subsequently, MLLMs perform worse on abstract and complex images, suggesting limitations in their ability to understand high-level semantics and capture image details. Finally, it is observed that most models exhibit enhanced accuracy when image sentiment polarity hints are incorporated into the prompts. This observation underscores a notable deficiency in their inherent understanding of image sentiment. We believe that II-Bench will inspire the community to develop the next generation of MLLMs, advancing the journey towards expert artificial general intelligence (AGI). II-Bench is publicly available at https://huggingface.co/datasets/m-a-p/II-Bench.
△ Less
Submitted 13 January, 2025; v1 submitted 9 June, 2024;
originally announced June 2024.
-
DepNeCTI: Dependency-based Nested Compound Type Identification for Sanskrit
Authors:
Jivnesh Sandhan,
Yaswanth Narsupalli,
Sreevatsa Muppirala,
Sriram Krishnan,
Pavankumar Satuluri,
Amba Kulkarni,
Pawan Goyal
Abstract:
Multi-component compounding is a prevalent phenomenon in Sanskrit, and understanding the implicit structure of a compound's components is crucial for deciphering its meaning. Earlier approaches in Sanskrit have focused on binary compounds and neglected the multi-component compound setting. This work introduces the novel task of nested compound type identification (NeCTI), which aims to identify ne…
▽ More
Multi-component compounding is a prevalent phenomenon in Sanskrit, and understanding the implicit structure of a compound's components is crucial for deciphering its meaning. Earlier approaches in Sanskrit have focused on binary compounds and neglected the multi-component compound setting. This work introduces the novel task of nested compound type identification (NeCTI), which aims to identify nested spans of a multi-component compound and decode the implicit semantic relations between them. To the best of our knowledge, this is the first attempt in the field of lexical semantics to propose this task.
We present 2 newly annotated datasets including an out-of-domain dataset for this task. We also benchmark these datasets by exploring the efficacy of the standard problem formulations such as nested named entity recognition, constituency parsing and seq2seq, etc. We present a novel framework named DepNeCTI: Dependency-based Nested Compound Type Identifier that surpasses the performance of the best baseline with an average absolute improvement of 13.1 points F1-score in terms of Labeled Span Score (LSS) and a 5-fold enhancement in inference efficiency. In line with the previous findings in the binary Sanskrit compound identification task, context provides benefits for the NeCTI task. The codebase and datasets are publicly available at: https://github.com/yaswanth-iitkgp/DepNeCTI
△ Less
Submitted 14 October, 2023;
originally announced October 2023.