Data Readiness for Scientific AI at Scale

Brewer, Wesley; Widener, Patrick; Anantharaj, Valentine; Wang, Feiyi; Beck, Tom; Shankar, Arjun; Oral, Sarp

doi:10.1145/3750720.3757282

Computer Science > Artificial Intelligence

arXiv:2507.23018 (cs)

[Submitted on 30 Jul 2025]

Title:Data Readiness for Scientific AI at Scale

Authors:Wesley Brewer, Patrick Widener, Valentine Anantharaj, Feiyi Wang, Tom Beck, Arjun Shankar, Sarp Oral

View PDF HTML (experimental)

Abstract:This paper examines how Data Readiness for AI (DRAI) principles apply to leadership-scale scientific datasets used to train foundation models. We analyze archetypal workflows across four representative domains - climate, nuclear fusion, bio/health, and materials - to identify common preprocessing patterns and domain-specific constraints. We introduce a two-dimensional readiness framework composed of Data Readiness Levels (raw to AI-ready) and Data Processing Stages (ingest to shard), both tailored to high performance computing (HPC) environments. This framework outlines key challenges in transforming scientific data for scalable AI training, emphasizing transformer-based generative models. Together, these dimensions form a conceptual maturity matrix that characterizes scientific data readiness and guides infrastructure development toward standardized, cross-domain support for scalable and reproducible AI for science.

Comments:	10 pages, 1 figure, 2 tables
Subjects:	Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
ACM classes:	I.2.6
Cite as:	arXiv:2507.23018 [cs.AI]
	(or arXiv:2507.23018v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2507.23018
Related DOI:	https://doi.org/10.1145/3750720.3757282

Submission history

From: Wesley Brewer [view email]
[v1] Wed, 30 Jul 2025 18:30:37 UTC (151 KB)

Computer Science > Artificial Intelligence

Title:Data Readiness for Scientific AI at Scale

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:Data Readiness for Scientific AI at Scale

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators