A Shared Standard for Valid Measurement of Generative AI Systems' Capabilities, Risks, and Impacts

Chouldechova, Alexandra; Atalla, Chad; Barocas, Solon; Cooper, A. Feder; Corvi, Emily; Dow, P. Alex; Garcia-Gathright, Jean; Pangakis, Nicholas; Reed, Stefanie; Sheng, Emily; Vann, Dan; Vogel, Matthew; Washington, Hannah; Wallach, Hanna

Computer Science > Computers and Society

arXiv:2412.01934 (cs)

[Submitted on 2 Dec 2024]

Title:A Shared Standard for Valid Measurement of Generative AI Systems' Capabilities, Risks, and Impacts

Authors:Alexandra Chouldechova, Chad Atalla, Solon Barocas, A. Feder Cooper, Emily Corvi, P. Alex Dow, Jean Garcia-Gathright, Nicholas Pangakis, Stefanie Reed, Emily Sheng, Dan Vann, Matthew Vogel, Hannah Washington, Hanna Wallach

View PDF HTML (experimental)

Abstract:The valid measurement of generative AI (GenAI) systems' capabilities, risks, and impacts forms the bedrock of our ability to evaluate these systems. We introduce a shared standard for valid measurement that helps place many of the disparate-seeming evaluation practices in use today on a common footing. Our framework, grounded in measurement theory from the social sciences, extends the work of Adcock & Collier (2001) in which the authors formalized valid measurement of concepts in political science via three processes: systematizing background concepts, operationalizing systematized concepts via annotation procedures, and applying those procedures to instances. We argue that valid measurement of GenAI systems' capabilities, risks, and impacts, further requires systematizing, operationalizing, and applying not only the entailed concepts, but also the contexts of interest and the metrics used. This involves both descriptive reasoning about particular instances and inferential reasoning about underlying populations, which is the purview of statistics. By placing many disparate-seeming GenAI evaluation practices on a common footing, our framework enables individual evaluations to be better understood, interrogated for reliability and validity, and meaningfully compared. This is an important step in advancing GenAI evaluation practices toward more formalized and theoretically grounded processes -- i.e., toward a science of GenAI evaluations.

Comments:	NeurIPS 2024 Workshop on Statistical Foundations of LLMs and Foundation Models (SFLLM)
Subjects:	Computers and Society (cs.CY)
Cite as:	arXiv:2412.01934 [cs.CY]
	(or arXiv:2412.01934v1 [cs.CY] for this version)
	https://doi.org/10.48550/arXiv.2412.01934

Submission history

From: Hanna Wallach [view email]
[v1] Mon, 2 Dec 2024 19:50:00 UTC (182 KB)

Computer Science > Computers and Society

Title:A Shared Standard for Valid Measurement of Generative AI Systems' Capabilities, Risks, and Impacts

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computers and Society

Title:A Shared Standard for Valid Measurement of Generative AI Systems' Capabilities, Risks, and Impacts

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators