Certifying Counterfactual Bias in LLMs

Chaudhary, Isha; Hu, Qian; Kumar, Manoj; Ziyadi, Morteza; Gupta, Rahul; Singh, Gagandeep

Computer Science > Artificial Intelligence

arXiv:2405.18780 (cs)

[Submitted on 29 May 2024 (v1), last revised 21 Apr 2025 (this version, v3)]

Title:Certifying Counterfactual Bias in LLMs

Authors:Isha Chaudhary, Qian Hu, Manoj Kumar, Morteza Ziyadi, Rahul Gupta, Gagandeep Singh

View PDF HTML (experimental)

Abstract:Large Language Models (LLMs) can produce biased responses that can cause representational harms. However, conventional studies are insufficient to thoroughly evaluate biases across LLM responses for different demographic groups (a.k.a. counterfactual bias), as they do not scale to large number of inputs and do not provide guarantees. Therefore, we propose the first framework, LLMCert-B that certifies LLMs for counterfactual bias on distributions of prompts. A certificate consists of high-confidence bounds on the probability of unbiased LLM responses for any set of counterfactual prompts - prompts differing by demographic groups, sampled from a distribution. We illustrate counterfactual bias certification for distributions of counterfactual prompts created by applying prefixes sampled from prefix distributions, to a given set of prompts. We consider prefix distributions consisting random token sequences, mixtures of manual jailbreaks, and perturbations of jailbreaks in LLM's embedding space. We generate non-trivial certificates for SOTA LLMs, exposing their vulnerabilities over distributions of prompts generated from computationally inexpensive prefix distributions.

Comments:	Published at ICLR 2025
Subjects:	Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2405.18780 [cs.AI]
	(or arXiv:2405.18780v3 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2405.18780

Submission history

From: Isha Chaudhary [view email]
[v1] Wed, 29 May 2024 05:39:37 UTC (9,302 KB)
[v2] Sun, 20 Oct 2024 18:10:31 UTC (11,034 KB)
[v3] Mon, 21 Apr 2025 23:20:28 UTC (11,398 KB)

Computer Science > Artificial Intelligence

Title:Certifying Counterfactual Bias in LLMs

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:Certifying Counterfactual Bias in LLMs

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators