-
Automated Capability Evaluation of Foundation Models
Authors:
Arash Afkanpour,
Omkar Dige,
Fatemeh Tavakoli
Abstract:
Current evaluation frameworks for foundation models rely heavily on fixed, manually curated benchmarks, limiting their ability to capture the full breadth of model capabilities. This paper introduces Active learning for Capability Evaluation (ACE), a novel framework for scalable, automated, and fine-grained evaluation of foundation models. ACE leverages the knowledge embedded in powerful language…
▽ More
Current evaluation frameworks for foundation models rely heavily on fixed, manually curated benchmarks, limiting their ability to capture the full breadth of model capabilities. This paper introduces Active learning for Capability Evaluation (ACE), a novel framework for scalable, automated, and fine-grained evaluation of foundation models. ACE leverages the knowledge embedded in powerful language models to decompose a domain into semantically meaningful capabilities and generate diverse evaluation tasks, significantly reducing human effort. To maximize coverage and efficiency, ACE models a subject model's performance as a capability function over a latent semantic space and uses active learning to prioritize the evaluation of the most informative capabilities. This adaptive evaluation strategy enables cost-effective discovery of strengths, weaknesses, and failure modes that static benchmarks may miss. Our results suggest that ACE provides a more complete and informative picture of model capabilities, which is essential for safe and well-informed deployment of foundation models.
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
Mitigating Social Biases in Language Models through Unlearning
Authors:
Omkar Dige,
Diljot Singh,
Tsz Fung Yau,
Qixuan Zhang,
Borna Bolandraftar,
Xiaodan Zhu,
Faiza Khan Khattak
Abstract:
Mitigating bias in language models (LMs) has become a critical problem due to the widespread deployment of LMs. Numerous approaches revolve around data pre-processing and fine-tuning of language models, tasks that can be both time-consuming and computationally demanding. Consequently, there is a growing interest in machine unlearning techniques given their capacity to induce the forgetting of unde…
▽ More
Mitigating bias in language models (LMs) has become a critical problem due to the widespread deployment of LMs. Numerous approaches revolve around data pre-processing and fine-tuning of language models, tasks that can be both time-consuming and computationally demanding. Consequently, there is a growing interest in machine unlearning techniques given their capacity to induce the forgetting of undesired behaviors of the existing pre-trained or fine-tuned models with lower computational cost. In this work, we explore two unlearning methods, (1) Partitioned Contrastive Gradient Unlearning (PCGU) applied on decoder models and (2) Negation via Task Vector, to reduce social biases in state-of-the-art and open-source LMs such as LLaMA-2 and OPT. We also implement distributed PCGU for large models. It is empirically shown, through quantitative and qualitative analyses, that negation via Task Vector method outperforms PCGU in debiasing with minimum deterioration in performance and perplexity of the models. On LLaMA-27B, negation via Task Vector reduces the bias score by 11.8%
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
On The Role of Reasoning in the Identification of Subtle Stereotypes in Natural Language
Authors:
Jacob-Junqi Tian,
Omkar Dige,
D. B. Emerson,
Faiza Khan Khattak
Abstract:
Large language models (LLMs) are trained on vast, uncurated datasets that contain various forms of biases and language reinforcing harmful stereotypes that may be subsequently inherited by the models themselves. Therefore, it is essential to examine and address biases in language models, integrating fairness into their development to ensure that these models do not perpetuate social biases. In thi…
▽ More
Large language models (LLMs) are trained on vast, uncurated datasets that contain various forms of biases and language reinforcing harmful stereotypes that may be subsequently inherited by the models themselves. Therefore, it is essential to examine and address biases in language models, integrating fairness into their development to ensure that these models do not perpetuate social biases. In this work, we demonstrate the importance of reasoning in zero-shot stereotype identification across several open-source LLMs. Accurate identification of stereotypical language is a complex task requiring a nuanced understanding of social structures, biases, and existing unfair generalizations about particular groups. While improved accuracy is observed through model scaling, the use of reasoning, especially multi-step reasoning, is crucial to consistent performance. Additionally, through a qualitative analysis of select reasoning traces, we highlight how reasoning improves not just accuracy, but also the interpretability of model decisions. This work firmly establishes reasoning as a critical component in automatic stereotype detection and is a first step towards stronger stereotype mitigation pipelines for LLMs.
△ Less
Submitted 28 September, 2024; v1 submitted 24 July, 2023;
originally announced August 2023.
-
Can Instruction Fine-Tuned Language Models Identify Social Bias through Prompting?
Authors:
Omkar Dige,
Jacob-Junqi Tian,
David Emerson,
Faiza Khan Khattak
Abstract:
As the breadth and depth of language model applications continue to expand rapidly, it is increasingly important to build efficient frameworks for measuring and mitigating the learned or inherited social biases of these models. In this paper, we present our work on evaluating instruction fine-tuned language models' ability to identify bias through zero-shot prompting, including Chain-of-Thought (C…
▽ More
As the breadth and depth of language model applications continue to expand rapidly, it is increasingly important to build efficient frameworks for measuring and mitigating the learned or inherited social biases of these models. In this paper, we present our work on evaluating instruction fine-tuned language models' ability to identify bias through zero-shot prompting, including Chain-of-Thought (CoT) prompts. Across LLaMA and its two instruction fine-tuned versions, Alpaca 7B performs best on the bias identification task with an accuracy of 56.7%. We also demonstrate that scaling up LLM size and data diversity could lead to further performance gain. This is a work-in-progress presenting the first component of our bias mitigation framework. We will keep updating this work as we get more results.
△ Less
Submitted 19 July, 2023;
originally announced July 2023.