-
Empowering Scientific Workflows with Federated Agents
Authors:
J. Gregory Pauloski,
Yadu Babuji,
Ryan Chard,
Mansi Sakarvadia,
Kyle Chard,
Ian Foster
Abstract:
Agentic systems, in which diverse agents cooperate to tackle challenging problems, are exploding in popularity in the AI community. However, the agentic frameworks used to build these systems have not previously enabled use with research cyberinfrastructure. Here we introduce Academy, a modular and extensible middleware designed to deploy autonomous agents across the federated research ecosystem,…
▽ More
Agentic systems, in which diverse agents cooperate to tackle challenging problems, are exploding in popularity in the AI community. However, the agentic frameworks used to build these systems have not previously enabled use with research cyberinfrastructure. Here we introduce Academy, a modular and extensible middleware designed to deploy autonomous agents across the federated research ecosystem, including HPC systems, experimental facilities, and data repositories. To meet the demands of scientific computing, Academy supports asynchronous execution, heterogeneous resources, high-throughput data flows, and dynamic resource availability. It provides abstractions for expressing stateful agents, managing inter-agent coordination, and integrating computation with experimental control. We present microbenchmark results that demonstrate high performance and scalability in HPC environments. To demonstrate the breadth of applications that can be supported by agentic workflow designs, we also present case studies in materials discovery, decentralized learning, and information extraction in which agents are deployed across diverse HPC systems.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
HiPerRAG: High-Performance Retrieval Augmented Generation for Scientific Insights
Authors:
Ozan Gokdemir,
Carlo Siebenschuh,
Alexander Brace,
Azton Wells,
Brian Hsu,
Kyle Hippe,
Priyanka V. Setty,
Aswathy Ajith,
J. Gregory Pauloski,
Varuni Sastry,
Sam Foreman,
Huihuo Zheng,
Heng Ma,
Bharat Kale,
Nicholas Chia,
Thomas Gibbs,
Michael E. Papka,
Thomas Brettin,
Francis J. Alexander,
Anima Anandkumar,
Ian Foster,
Rick Stevens,
Venkatram Vishwanath,
Arvind Ramanathan
Abstract:
The volume of scientific literature is growing exponentially, leading to underutilized discoveries, duplicated efforts, and limited cross-disciplinary collaboration. Retrieval Augmented Generation (RAG) offers a way to assist scientists by improving the factuality of Large Language Models (LLMs) in processing this influx of information. However, scaling RAG to handle millions of articles introduce…
▽ More
The volume of scientific literature is growing exponentially, leading to underutilized discoveries, duplicated efforts, and limited cross-disciplinary collaboration. Retrieval Augmented Generation (RAG) offers a way to assist scientists by improving the factuality of Large Language Models (LLMs) in processing this influx of information. However, scaling RAG to handle millions of articles introduces significant challenges, including the high computational costs associated with parsing documents and embedding scientific knowledge, as well as the algorithmic complexity of aligning these representations with the nuanced semantics of scientific content. To address these issues, we introduce HiPerRAG, a RAG workflow powered by high performance computing (HPC) to index and retrieve knowledge from more than 3.6 million scientific articles. At its core are Oreo, a high-throughput model for multimodal document parsing, and ColTrast, a query-aware encoder fine-tuning algorithm that enhances retrieval accuracy by using contrastive learning and late-interaction techniques. HiPerRAG delivers robust performance on existing scientific question answering benchmarks and two new benchmarks introduced in this work, achieving 90% accuracy on SciQ and 76% on PubMedQA-outperforming both domain-specific models like PubMedGPT and commercial LLMs such as GPT-4. Scaling to thousands of GPUs on the Polaris, Sunspot, and Frontier supercomputers, HiPerRAG delivers million document-scale RAG workflows for unifying scientific knowledge and fostering interdisciplinary innovation.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
34 Examples of LLM Applications in Materials Science and Chemistry: Towards Automation, Assistants, Agents, and Accelerated Scientific Discovery
Authors:
Yoel Zimmermann,
Adib Bazgir,
Alexander Al-Feghali,
Mehrad Ansari,
L. Catherine Brinson,
Yuan Chiang,
Defne Circi,
Min-Hsueh Chiu,
Nathan Daelman,
Matthew L. Evans,
Abhijeet S. Gangan,
Janine George,
Hassan Harb,
Ghazal Khalighinejad,
Sartaaj Takrim Khan,
Sascha Klawohn,
Magdalena Lederbauer,
Soroush Mahjoubi,
Bernadette Mohr,
Seyed Mohamad Moosavi,
Aakash Naik,
Aleyna Beste Ozhan,
Dieter Plessers,
Aritra Roy,
Fabian Schöppach
, et al. (8 additional authors not shown)
Abstract:
Large Language Models (LLMs) are reshaping many aspects of materials science and chemistry research, enabling advances in molecular property prediction, materials design, scientific automation, knowledge extraction, and more. Recent developments demonstrate that the latest class of models are able to integrate structured and unstructured data, assist in hypothesis generation, and streamline resear…
▽ More
Large Language Models (LLMs) are reshaping many aspects of materials science and chemistry research, enabling advances in molecular property prediction, materials design, scientific automation, knowledge extraction, and more. Recent developments demonstrate that the latest class of models are able to integrate structured and unstructured data, assist in hypothesis generation, and streamline research workflows. To explore the frontier of LLM capabilities across the research lifecycle, we review applications of LLMs through 34 total projects developed during the second annual Large Language Model Hackathon for Applications in Materials Science and Chemistry, a global hybrid event. These projects spanned seven key research areas: (1) molecular and material property prediction, (2) molecular and material design, (3) automation and novel interfaces, (4) scientific communication and education, (5) research data management and automation, (6) hypothesis generation and evaluation, and (7) knowledge extraction and reasoning from the scientific literature. Collectively, these applications illustrate how LLMs serve as versatile predictive models, platforms for rapid prototyping of domain-specific tools, and much more. In particular, improvements in both open source and proprietary LLM performance through the addition of reasoning, additional training data, and new techniques have expanded effectiveness, particularly in low-data environments and interdisciplinary research. As LLMs continue to improve, their integration into scientific workflows presents both new opportunities and new challenges, requiring ongoing exploration, continued refinement, and further research to address reliability, interpretability, and reproducibility.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine
Authors:
Carlo Siebenschuh,
Kyle Hippe,
Ozan Gokdemir,
Alexander Brace,
Arham Khan,
Khalid Hossain,
Yadu Babuji,
Nicholas Chia,
Venkatram Vishwanath,
Rick Stevens,
Arvind Ramanathan,
Ian Foster,
Robert Underwood
Abstract:
Language models for scientific tasks are trained on text from scientific publications, most distributed as PDFs that require parsing. PDF parsing approaches range from inexpensive heuristics (for simple documents) to computationally intensive ML-driven systems (for complex or degraded ones). The choice of the "best" parser for a particular document depends on its computational cost and the accurac…
▽ More
Language models for scientific tasks are trained on text from scientific publications, most distributed as PDFs that require parsing. PDF parsing approaches range from inexpensive heuristics (for simple documents) to computationally intensive ML-driven systems (for complex or degraded ones). The choice of the "best" parser for a particular document depends on its computational cost and the accuracy of its output. To address these issues, we introduce an Adaptive Parallel PDF Parsing and Resource Scaling Engine (AdaParse), a data-driven strategy for assigning an appropriate parser to each document. We enlist scientists to select preferred parser outputs and incorporate this information through direct preference optimization (DPO) into AdaParse, thereby aligning its selection process with human judgment. AdaParse then incorporates hardware requirements and predicted accuracy of each parser to orchestrate computational resources efficiently for large-scale parsing campaigns. We demonstrate that AdaParse, when compared to state-of-the-art parsers, improves throughput by $17\times$ while still achieving comparable accuracy (0.2 percent better) on a benchmark set of 1000 scientific documents. AdaParse's combination of high accuracy and parallel scalability makes it feasible to parse large-scale scientific document corpora to support the development of high-quality, trillion-token-scale text datasets. The implementation is available at https://github.com/7shoe/AdaParse/
△ Less
Submitted 23 April, 2025;
originally announced May 2025.
-
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems
Authors:
Bang Liu,
Xinfeng Li,
Jiayi Zhang,
Jinlin Wang,
Tanjin He,
Sirui Hong,
Hongzhang Liu,
Shaokun Zhang,
Kaitao Song,
Kunlun Zhu,
Yuheng Cheng,
Suyuchen Wang,
Xiaoqiang Wang,
Yuyu Luo,
Haibo Jin,
Peiyan Zhang,
Ollie Liu,
Jiaqi Chen,
Huan Zhang,
Zhaoyang Yu,
Haochen Shi,
Boyan Li,
Dekun Wu,
Fengwei Teng,
Xiaojun Jia
, et al. (22 additional authors not shown)
Abstract:
The advent of large language models (LLMs) has catalyzed a transformative shift in artificial intelligence, paving the way for advanced intelligent agents capable of sophisticated reasoning, robust perception, and versatile action across diverse domains. As these agents increasingly drive AI research and practical applications, their design, evaluation, and continuous improvement present intricate…
▽ More
The advent of large language models (LLMs) has catalyzed a transformative shift in artificial intelligence, paving the way for advanced intelligent agents capable of sophisticated reasoning, robust perception, and versatile action across diverse domains. As these agents increasingly drive AI research and practical applications, their design, evaluation, and continuous improvement present intricate, multifaceted challenges. This survey provides a comprehensive overview, framing intelligent agents within a modular, brain-inspired architecture that integrates principles from cognitive science, neuroscience, and computational research. We structure our exploration into four interconnected parts. First, we delve into the modular foundation of intelligent agents, systematically mapping their cognitive, perceptual, and operational modules onto analogous human brain functionalities, and elucidating core components such as memory, world modeling, reward processing, and emotion-like systems. Second, we discuss self-enhancement and adaptive evolution mechanisms, exploring how agents autonomously refine their capabilities, adapt to dynamic environments, and achieve continual learning through automated optimization paradigms, including emerging AutoML and LLM-driven optimization strategies. Third, we examine collaborative and evolutionary multi-agent systems, investigating the collective intelligence emerging from agent interactions, cooperation, and societal structures, highlighting parallels to human social dynamics. Finally, we address the critical imperative of building safe, secure, and beneficial AI systems, emphasizing intrinsic and extrinsic security threats, ethical alignment, robustness, and practical mitigation strategies necessary for trustworthy real-world deployment.
△ Less
Submitted 31 March, 2025;
originally announced April 2025.
-
Globus Service Enhancements for Exascale Applications and Facilities
Authors:
Weijian Zheng,
Jack Kordas,
Tyler J. Skluzacek,
Raj Kettimuthu,
Ian Foster
Abstract:
Many extreme-scale applications require the movement of large quantities of data to, from, and among leadership computing facilities, as well as other scientific facilities and the home institutions of facility users. These applications, particularly when leadership computing facilities are involved, can touch upon edge cases (e.g., terabyte files) that had not been a focus of previous Globus opti…
▽ More
Many extreme-scale applications require the movement of large quantities of data to, from, and among leadership computing facilities, as well as other scientific facilities and the home institutions of facility users. These applications, particularly when leadership computing facilities are involved, can touch upon edge cases (e.g., terabyte files) that had not been a focus of previous Globus optimization work, which had emphasized rather the movement of many smaller (megabyte to gigabyte) files. We report here on how automated client-driven chunking can be used to accelerate both the movement of large files and the integrity checking operations that have proven to be essential for large data transfers. We present detailed performance studies that provide insights into the benefits of these modifications in a range of file transfer scenarios.
△ Less
Submitted 29 March, 2025;
originally announced March 2025.
-
WRATH: Workload Resilience Across Task Hierarchies in Task-based Parallel Programming Frameworks
Authors:
Sicheng Zhou,
Zhuozhao Li,
Valérie Hayot-Sasson,
Haochen Pan,
Maxime Gonthier,
J. Gregory Pauloski,
Ryan Chard,
Kyle Chard,
Ian Foster
Abstract:
Failures in Task-based Parallel Programming (TBPP) can severely degrade performance and result in incomplete or incorrect outcomes. Existing failure-handling approaches, including reactive, proactive, and resilient methods such as retry and checkpointing mechanisms, often apply uniform retry mechanisms regardless of the root cause of failures, failing to account for the unique characteristics of T…
▽ More
Failures in Task-based Parallel Programming (TBPP) can severely degrade performance and result in incomplete or incorrect outcomes. Existing failure-handling approaches, including reactive, proactive, and resilient methods such as retry and checkpointing mechanisms, often apply uniform retry mechanisms regardless of the root cause of failures, failing to account for the unique characteristics of TBPP frameworks such as heterogeneous resource availability and task-level failures. To address these limitations, we propose WRATH, a novel systematic approach that categorizes failures based on the unique layered structure of TBPP frameworks and defines specific responses to address failures at different layers. WRATH combines a distributed monitoring system and a resilient module to collaboratively address different types of failures in real time. The monitoring system captures execution and resource information, reports failures, and profiles tasks across different layers of TBPP frameworks. The resilient module then categorizes failures and responds with appropriate actions, such as hierarchically retrying failed tasks on suitable resources. Evaluations demonstrate that WRATH significantly improves TBPP robustness, tripling the task success rate and maintaining an application success rate of over 90% for resolvable failures. Additionally, WRATH can reduce the time to failure by 20%-50%, allowing tasks that are destined to fail to be identified and fail more quickly.
△ Less
Submitted 27 March, 2025; v1 submitted 16 March, 2025;
originally announced March 2025.
-
EAIRA: Establishing a Methodology for Evaluating AI Models as Scientific Research Assistants
Authors:
Franck Cappello,
Sandeep Madireddy,
Robert Underwood,
Neil Getty,
Nicholas Lee-Ping Chia,
Nesar Ramachandra,
Josh Nguyen,
Murat Keceli,
Tanwi Mallick,
Zilinghan Li,
Marieme Ngom,
Chenhui Zhang,
Angel Yanguas-Gil,
Evan Antoniuk,
Bhavya Kailkhura,
Minyang Tian,
Yufeng Du,
Yuan-Sen Ting,
Azton Wells,
Bogdan Nicolae,
Avinash Maurya,
M. Mustafa Rafique,
Eliu Huerta,
Bo Li,
Ian Foster
, et al. (1 additional authors not shown)
Abstract:
Recent advancements have positioned AI, and particularly Large Language Models (LLMs), as transformative tools for scientific research, capable of addressing complex tasks that require reasoning, problem-solving, and decision-making. Their exceptional capabilities suggest their potential as scientific research assistants but also highlight the need for holistic, rigorous, and domain-specific evalu…
▽ More
Recent advancements have positioned AI, and particularly Large Language Models (LLMs), as transformative tools for scientific research, capable of addressing complex tasks that require reasoning, problem-solving, and decision-making. Their exceptional capabilities suggest their potential as scientific research assistants but also highlight the need for holistic, rigorous, and domain-specific evaluation to assess effectiveness in real-world scientific applications. This paper describes a multifaceted methodology for Evaluating AI models as scientific Research Assistants (EAIRA) developed at Argonne National Laboratory. This methodology incorporates four primary classes of evaluations. 1) Multiple Choice Questions to assess factual recall; 2) Open Response to evaluate advanced reasoning and problem-solving skills; 3) Lab-Style Experiments involving detailed analysis of capabilities as research assistants in controlled environments; and 4) Field-Style Experiments to capture researcher-LLM interactions at scale in a wide range of scientific domains and applications. These complementary methods enable a comprehensive analysis of LLM strengths and weaknesses with respect to their scientific knowledge, reasoning abilities, and adaptability. Recognizing the rapid pace of LLM advancements, we designed the methodology to evolve and adapt so as to ensure its continued relevance and applicability. This paper describes the methodology state at the end of February 2025. Although developed within a subset of scientific domains, the methodology is designed to be generalizable to a wide range of scientific domains.
△ Less
Submitted 27 February, 2025;
originally announced February 2025.
-
Connecting Large Language Model Agent to High Performance Computing Resource
Authors:
Heng Ma,
Alexander Brace,
Carlo Siebenschuh,
Greg Pauloski,
Ian Foster,
Arvind Ramanathan
Abstract:
The Large Language Model agent workflow enables the LLM to invoke tool functions to increase the performance on specific scientific domain questions. To tackle large scale of scientific research, it requires access to computing resource and parallel computing setup. In this work, we implemented Parsl to the LangChain/LangGraph tool call setup, to bridge the gap between the LLM agent to the computi…
▽ More
The Large Language Model agent workflow enables the LLM to invoke tool functions to increase the performance on specific scientific domain questions. To tackle large scale of scientific research, it requires access to computing resource and parallel computing setup. In this work, we implemented Parsl to the LangChain/LangGraph tool call setup, to bridge the gap between the LLM agent to the computing resource. Two tool call implementations were set up and tested on both local workstation and HPC environment on Polaris/ALCF. The first implementation with Parsl-enabled LangChain tool node queues the tool functions concurrently to the Parsl workers for parallel execution. The second configuration is implemented by converting the tool functions into Parsl ensemble functions, and is more suitable for large task on super computer environment. The LLM agent workflow was prompted to run molecular dynamics simulations, with different protein structure and simulation conditions. These results showed the LLM agent tools were managed and executed concurrently by Parsl on the available computing resource.
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
DrugImproverGPT: A Large Language Model for Drug Optimization with Fine-Tuning via Structured Policy Optimization
Authors:
Xuefeng Liu,
Songhao Jiang,
Siyu Chen,
Zhuoran Yang,
Yuxin Chen,
Ian Foster,
Rick Stevens
Abstract:
Finetuning a Large Language Model (LLM) is crucial for generating results towards specific objectives. This research delves into the realm of drug optimization and introduce a novel reinforcement learning algorithm to finetune a drug optimization LLM-based generative model, enhancing the original drug across target objectives, while retains the beneficial chemical properties of the original drug.…
▽ More
Finetuning a Large Language Model (LLM) is crucial for generating results towards specific objectives. This research delves into the realm of drug optimization and introduce a novel reinforcement learning algorithm to finetune a drug optimization LLM-based generative model, enhancing the original drug across target objectives, while retains the beneficial chemical properties of the original drug. This work is comprised of two primary components: (1) DrugImprover: A framework tailored for improving robustness and efficiency in drug optimization. It includes a LLM designed for drug optimization and a novel Structured Policy Optimization (SPO) algorithm, which is theoretically grounded. This algorithm offers a unique perspective for fine-tuning the LLM-based generative model by aligning the improvement of the generated molecule with the input molecule under desired objectives. (2) A dataset of 1 million compounds, each with OEDOCK docking scores on 5 human proteins associated with cancer cells and 24 binding sites from SARS-CoV-2 virus. We conduct a comprehensive evaluation of SPO and demonstrate its effectiveness in improving the original drug across target properties. Our code and dataset will be publicly available at: https://github.com/xuefeng-cs/DrugImproverGPT.
△ Less
Submitted 10 February, 2025;
originally announced February 2025.
-
ScaffoldGPT: A Scaffold-based GPT Model for Drug Optimization
Authors:
Xuefeng Liu,
Songhao Jiang,
Ian Foster,
Jinbo Xu,
Rick Stevens
Abstract:
Drug optimization has become increasingly crucial in light of fast-mutating virus strains and drug-resistant cancer cells. Nevertheless, it remains challenging as it necessitates retaining the beneficial properties of the original drug while simultaneously enhancing desired attributes beyond its scope. In this work, we aim to tackle this challenge by introducing ScaffoldGPT, a novel Generative Pre…
▽ More
Drug optimization has become increasingly crucial in light of fast-mutating virus strains and drug-resistant cancer cells. Nevertheless, it remains challenging as it necessitates retaining the beneficial properties of the original drug while simultaneously enhancing desired attributes beyond its scope. In this work, we aim to tackle this challenge by introducing ScaffoldGPT, a novel Generative Pretrained Transformer (GPT) designed for drug optimization based on molecular scaffolds. Our work comprises three key components: (1) A three-stage drug optimization approach that integrates pretraining, finetuning, and decoding optimization. (2) A uniquely designed two-phase incremental training approach for pre-training the drug optimization GPT on molecule scaffold with enhanced performance. (3) A token-level decoding optimization strategy, TOP-N, that enabling controlled, reward-guided generation using pretrained/finetuned GPT. We demonstrate via a comprehensive evaluation on COVID and cancer benchmarks that ScaffoldGPT outperforms the competing baselines in drug optimization benchmarks, while excelling in preserving original functional scaffold and enhancing desired properties.
△ Less
Submitted 11 April, 2025; v1 submitted 9 February, 2025;
originally announced February 2025.
-
Optimizing Fine-Grained Parallelism Through Dynamic Load Balancing on Multi-Socket Many-Core Systems
Authors:
Wenyi Wang,
Maxime Gonthier,
Poornima Nookala,
Haochen Pan,
Ian Foster,
Ioan Raicu,
Kyle Chard
Abstract:
Achieving efficient task parallelism on many-core architectures is an important challenge. The widely used GNU OpenMP implementation of the popular OpenMP parallel programming model incurs high overhead for fine-grained, short-running tasks due to time spent on runtime synchronization. In this work, we introduce and analyze three key advances that collectively achieve significant performance gains…
▽ More
Achieving efficient task parallelism on many-core architectures is an important challenge. The widely used GNU OpenMP implementation of the popular OpenMP parallel programming model incurs high overhead for fine-grained, short-running tasks due to time spent on runtime synchronization. In this work, we introduce and analyze three key advances that collectively achieve significant performance gains. First, we introduce XQueue, a lock-less concurrent queue implementation to replace GNU's priority task queue and remove the global task lock. Second, we develop a scalable, efficient, and hybrid lock-free/lock-less distributed tree barrier to address the high hardware synchronization overhead from GNU's centralized barrier. Third, we develop two lock-less and NUMA-aware load balancing strategies. We evaluate our implementation using Barcelona OpenMP Task Suite (BOTS) benchmarks. We show that the use of XQueue and the distributed tree barrier can improve performance by up to 1522.8$\times$ compared to the original GNU OpenMP. We further show that lock-less load balancing can improve performance by up to 4$\times$ compared to GNU OpenMP using XQueue.
△ Less
Submitted 19 March, 2025; v1 submitted 7 February, 2025;
originally announced February 2025.
-
MOFA: Discovering Materials for Carbon Capture with a GenAI- and Simulation-Based Workflow
Authors:
Xiaoli Yan,
Nathaniel Hudson,
Hyun Park,
Daniel Grzenda,
J. Gregory Pauloski,
Marcus Schwarting,
Haochen Pan,
Hassan Harb,
Samuel Foreman,
Chris Knight,
Tom Gibbs,
Kyle Chard,
Santanu Chaudhuri,
Emad Tajkhorshid,
Ian Foster,
Mohamad Moosavi,
Logan Ward,
E. A. Huerta
Abstract:
We present MOFA, an open-source generative AI (GenAI) plus simulation workflow for high-throughput generation of metal-organic frameworks (MOFs) on large-scale high-performance computing (HPC) systems. MOFA addresses key challenges in integrating GPU-accelerated computing for GPU-intensive GenAI tasks, including distributed training and inference, alongside CPU- and GPU-optimized tasks for screeni…
▽ More
We present MOFA, an open-source generative AI (GenAI) plus simulation workflow for high-throughput generation of metal-organic frameworks (MOFs) on large-scale high-performance computing (HPC) systems. MOFA addresses key challenges in integrating GPU-accelerated computing for GPU-intensive GenAI tasks, including distributed training and inference, alongside CPU- and GPU-optimized tasks for screening and filtering AI-generated MOFs using molecular dynamics, density functional theory, and Monte Carlo simulations. These heterogeneous tasks are unified within an online learning framework that optimizes the utilization of available CPU and GPU resources across HPC systems. Performance metrics from a 450-node (14,400 AMD Zen 3 CPUs + 1800 NVIDIA A100 GPUs) supercomputer run demonstrate that MOFA achieves high-throughput generation of novel MOF structures, with CO$_2$ adsorption capacities ranking among the top 10 in the hypothetical MOF (hMOF) dataset. Furthermore, the production of high-quality MOFs exhibits a linear relationship with the number of nodes utilized. The modular architecture of MOFA will facilitate its integration into other scientific applications that dynamically combine GenAI with large-scale simulations.
△ Less
Submitted 17 January, 2025;
originally announced January 2025.
-
Core Hours and Carbon Credits: Incentivizing Sustainability in HPC
Authors:
Alok Kamatar,
Maxime Gonthier,
Valerie Hayot-Sasson,
Andre Bauer,
Marcin Copik,
Torsten Hoefler,
Raul Castro Fernandez,
Kyle Chard,
Ian Foster
Abstract:
Realizing a shared responsibility between providers and consumers is critical to manage the sustainability of HPC. However, while cost may motivate efficiency improvements by infrastructure operators, broader progress is impeded by a lack of user incentives. We conduct a survey of HPC users that reveals fewer than 30 percent are aware of their energy consumption, and that energy efficiency is amon…
▽ More
Realizing a shared responsibility between providers and consumers is critical to manage the sustainability of HPC. However, while cost may motivate efficiency improvements by infrastructure operators, broader progress is impeded by a lack of user incentives. We conduct a survey of HPC users that reveals fewer than 30 percent are aware of their energy consumption, and that energy efficiency is among users' lowest priority concerns. One explanation is that existing pricing models may encourage users to prioritize performance over energy efficiency. We propose two transparent multi-resource pricing schemes, Energy- and Carbon-Based Accounting, that seek to change this paradigm by incentivizing more efficient user behavior. These two schemes charge for computations based on their energy consumption or carbon footprint, respectively, rewarding users who leverage efficient hardware and software. We evaluate these two pricing schemes via simulation, in a prototype, and a user study.
△ Less
Submitted 16 January, 2025;
originally announced January 2025.
-
Computational Grids
Authors:
Ian Foster,
Carl Kesselman
Abstract:
In this introductory chapter, we lay the groundwork for the rest of the book by providing a more detailed picture of the expected purpose, shape, and architecture of future grid systems. We structure the chapter in terms of six questions that we believe are central to this discussion: Why do we need computational grids? What types of applications will grids be used for? Who will use grids? How wil…
▽ More
In this introductory chapter, we lay the groundwork for the rest of the book by providing a more detailed picture of the expected purpose, shape, and architecture of future grid systems. We structure the chapter in terms of six questions that we believe are central to this discussion: Why do we need computational grids? What types of applications will grids be used for? Who will use grids? How will grids be used? What is involved in building a grid? And, what problems must be solved to make grids commonplace? We provide an overview of each of these issues here, referring to subsequent chapters for more detailed discussion.
△ Less
Submitted 2 January, 2025;
originally announced January 2025.
-
Reflections from the 2024 Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry
Authors:
Yoel Zimmermann,
Adib Bazgir,
Zartashia Afzal,
Fariha Agbere,
Qianxiang Ai,
Nawaf Alampara,
Alexander Al-Feghali,
Mehrad Ansari,
Dmytro Antypov,
Amro Aswad,
Jiaru Bai,
Viktoriia Baibakova,
Devi Dutta Biswajeet,
Erik Bitzek,
Joshua D. Bocarsly,
Anna Borisova,
Andres M Bran,
L. Catherine Brinson,
Marcel Moran Calderon,
Alessandro Canalicchio,
Victor Chen,
Yuan Chiang,
Defne Circi,
Benjamin Charmes,
Vikrant Chaudhary
, et al. (119 additional authors not shown)
Abstract:
Here, we present the outcomes from the second Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry, which engaged participants across global hybrid locations, resulting in 34 team submissions. The submissions spanned seven key application areas and demonstrated the diverse utility of LLMs for applications in (1) molecular and material property prediction; (2) mo…
▽ More
Here, we present the outcomes from the second Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry, which engaged participants across global hybrid locations, resulting in 34 team submissions. The submissions spanned seven key application areas and demonstrated the diverse utility of LLMs for applications in (1) molecular and material property prediction; (2) molecular and material design; (3) automation and novel interfaces; (4) scientific communication and education; (5) research data management and automation; (6) hypothesis generation and evaluation; and (7) knowledge extraction and reasoning from scientific literature. Each team submission is presented in a summary table with links to the code and as brief papers in the appendix. Beyond team results, we discuss the hackathon event and its hybrid format, which included physical hubs in Toronto, Montreal, San Francisco, Berlin, Lausanne, and Tokyo, alongside a global online hub to enable local and virtual collaboration. Overall, the event highlighted significant improvements in LLM capabilities since the previous year's hackathon, suggesting continued expansion of LLMs for applications in materials science and chemistry research. These outcomes demonstrate the dual utility of LLMs as both multipurpose models for diverse machine learning tasks and platforms for rapid prototyping custom applications in scientific research.
△ Less
Submitted 2 January, 2025; v1 submitted 20 November, 2024;
originally announced November 2024.
-
LSHBloom: Memory-efficient, Extreme-scale Document Deduplication
Authors:
Arham Khan,
Robert Underwood,
Carlo Siebenschuh,
Yadu Babuji,
Aswathy Ajith,
Kyle Hippe,
Ozan Gokdemir,
Alexander Brace,
Kyle Chard,
Ian Foster
Abstract:
Deduplication is a major focus for assembling and curating training datasets for large language models (LLM) -- detecting and eliminating additional instances of the same content -- in large collections of technical documents. Unrestrained, duplicates in the training dataset increase training costs and lead to undesirable properties such as memorization in trained models or cheating on evaluation.…
▽ More
Deduplication is a major focus for assembling and curating training datasets for large language models (LLM) -- detecting and eliminating additional instances of the same content -- in large collections of technical documents. Unrestrained, duplicates in the training dataset increase training costs and lead to undesirable properties such as memorization in trained models or cheating on evaluation. Contemporary approaches to document-level deduplication are often extremely expensive in both runtime and memory. We propose LSHBloom, an extension to MinhashLSH, which replaces the expensive LSHIndex with lightweight Bloom filters. LSHBloom demonstrates the same deduplication performance as MinhashLSH with only a marginal increase in false positives (as low as 1e-5 in our experiments); demonstrates competitive runtime (270\% faster than MinhashLSH on peS2o); and, crucially, uses just 0.6\% of the disk space required by MinhashLSH to deduplicate peS2o. We demonstrate that this space advantage scales with increased dataset size -- at the extreme scale of several billion documents, LSHBloom promises a 250\% speedup and a 54$\times$ space advantage over traditional MinHashLSH scaling deduplication of text datasets to many billions of documents.
△ Less
Submitted 12 May, 2025; v1 submitted 6 November, 2024;
originally announced November 2024.
-
Workflows Community Summit 2024: Future Trends and Challenges in Scientific Workflows
Authors:
Rafael Ferreira da Silva,
Deborah Bard,
Kyle Chard,
Shaun de Witt,
Ian T. Foster,
Tom Gibbs,
Carole Goble,
William Godoy,
Johan Gustafsson,
Utz-Uwe Haus,
Stephen Hudson,
Shantenu Jha,
Laila Los,
Drew Paine,
Frédéric Suter,
Logan Ward,
Sean Wilkinson,
Marcos Amaris,
Yadu Babuji,
Jonathan Bader,
Riccardo Balin,
Daniel Balouek,
Sarah Beecroft,
Khalid Belhajjame,
Rajat Bhattarai
, et al. (86 additional authors not shown)
Abstract:
The Workflows Community Summit gathered 111 participants from 18 countries to discuss emerging trends and challenges in scientific workflows, focusing on six key areas: time-sensitive workflows, AI-HPC convergence, multi-facility workflows, heterogeneous HPC environments, user experience, and FAIR computational workflows. The integration of AI and exascale computing has revolutionized scientific w…
▽ More
The Workflows Community Summit gathered 111 participants from 18 countries to discuss emerging trends and challenges in scientific workflows, focusing on six key areas: time-sensitive workflows, AI-HPC convergence, multi-facility workflows, heterogeneous HPC environments, user experience, and FAIR computational workflows. The integration of AI and exascale computing has revolutionized scientific workflows, enabling higher-fidelity models and complex, time-sensitive processes, while introducing challenges in managing heterogeneous environments and multi-facility data dependencies. The rise of large language models is driving computational demands to zettaflop scales, necessitating modular, adaptable systems and cloud-service models to optimize resource utilization and ensure reproducibility. Multi-facility workflows present challenges in data movement, curation, and overcoming institutional silos, while diverse hardware architectures require integrating workflow considerations into early system design and developing standardized resource management tools. The summit emphasized improving user experience in workflow systems and ensuring FAIR workflows to enhance collaboration and accelerate scientific discovery. Key recommendations include developing standardized metrics for time-sensitive workflows, creating frameworks for cloud-HPC integration, implementing distributed-by-design workflow modeling, establishing multi-facility authentication protocols, and accelerating AI integration in HPC workflow management. The summit also called for comprehensive workflow benchmarks, workflow-specific UX principles, and a FAIR workflow maturity model, highlighting the need for continued collaboration in addressing the complex challenges posed by the convergence of AI, HPC, and multi-facility research environments.
△ Less
Submitted 18 October, 2024;
originally announced October 2024.
-
Deep Model Merging: The Sister of Neural Network Interpretability -- A Survey
Authors:
Arham Khan,
Todd Nief,
Nathaniel Hudson,
Mansi Sakarvadia,
Daniel Grzenda,
Aswathy Ajith,
Jordan Pettyjohn,
Kyle Chard,
Ian Foster
Abstract:
We survey the model merging literature through the lens of loss landscape geometry to connect observations from empirical studies on model merging and loss landscape analysis to phenomena that govern neural network training and the emergence of their inner representations. We distill repeated empirical observations from the literature in these fields into descriptions of four major characteristics…
▽ More
We survey the model merging literature through the lens of loss landscape geometry to connect observations from empirical studies on model merging and loss landscape analysis to phenomena that govern neural network training and the emergence of their inner representations. We distill repeated empirical observations from the literature in these fields into descriptions of four major characteristics of loss landscape geometry: mode convexity, determinism, directedness, and connectivity. We argue that insights into the structure of learned representations from model merging have applications to model interpretability and robustness, subsequently we propose promising new research directions at the intersection of these fields.
△ Less
Submitted 21 March, 2025; v1 submitted 16 October, 2024;
originally announced October 2024.
-
Accelerating Python Applications with Dask and ProxyStore
Authors:
J. Gregory Pauloski,
Klaudiusz Rydzy,
Valerie Hayot-Sasson,
Ian Foster,
Kyle Chard
Abstract:
Applications are increasingly written as dynamic workflows underpinned by an execution framework that manages asynchronous computations across distributed hardware. However, execution frameworks typically offer one-size-fits-all solutions for data flow management, which can restrict performance and scalability. ProxyStore, a middleware layer that optimizes data flow via an advanced pass-by-referen…
▽ More
Applications are increasingly written as dynamic workflows underpinned by an execution framework that manages asynchronous computations across distributed hardware. However, execution frameworks typically offer one-size-fits-all solutions for data flow management, which can restrict performance and scalability. ProxyStore, a middleware layer that optimizes data flow via an advanced pass-by-reference paradigm, has shown to be an effective mechanism for addressing these limitations. Here, we investigate integrating ProxyStore with Dask Distributed, one of the most popular libraries for distributed computing in Python, with the goal of supporting scalable and portable scientific workflows. Dask provides an easy-to-use and flexible framework, but is less optimized for scaling certain data-intensive workflows. We investigate these limitations and detail the technical contributions necessary to develop a robust solution for distributed applications and demonstrate improved performance on synthetic benchmarks and real applications.
△ Less
Submitted 17 October, 2024; v1 submitted 15 October, 2024;
originally announced October 2024.
-
Mitigating Memorization In Language Models
Authors:
Mansi Sakarvadia,
Aswathy Ajith,
Arham Khan,
Nathaniel Hudson,
Caleb Geniesse,
Kyle Chard,
Yaoqing Yang,
Ian Foster,
Michael W. Mahoney
Abstract:
Language models (LMs) can "memorize" information, i.e., encode training data in their weights in such a way that inference-time queries can lead to verbatim regurgitation of that data. This ability to extract training data can be problematic, for example, when data are private or sensitive. In this work, we investigate methods to mitigate memorization: three regularizer-based, three finetuning-bas…
▽ More
Language models (LMs) can "memorize" information, i.e., encode training data in their weights in such a way that inference-time queries can lead to verbatim regurgitation of that data. This ability to extract training data can be problematic, for example, when data are private or sensitive. In this work, we investigate methods to mitigate memorization: three regularizer-based, three finetuning-based, and eleven machine unlearning-based methods, with five of the latter being new methods that we introduce. We also introduce TinyMem, a suite of small, computationally-efficient LMs for the rapid development and evaluation of memorization-mitigation methods. We demonstrate that the mitigation methods that we develop using TinyMem can successfully be applied to production-grade LMs, and we determine via experiment that: regularizer-based mitigation methods are slow and ineffective at curbing memorization; fine-tuning-based methods are effective at curbing memorization, but overly expensive, especially for retaining higher accuracies; and unlearning-based methods are faster and more effective, allowing for the precise localization and removal of memorized information from LM weights prior to inference. We show, in particular, that our proposed unlearning method BalancedSubnet outperforms other mitigation methods at removing memorized information while preserving performance on target tasks.
△ Less
Submitted 28 January, 2025; v1 submitted 2 October, 2024;
originally announced October 2024.
-
Binding Affinity Prediction: From Conventional to Machine Learning-Based Approaches
Authors:
Xuefeng Liu,
Songhao Jiang,
Xiaotian Duan,
Archit Vasan,
Chong Liu,
Chih-chan Tien,
Heng Ma,
Thomas Brettin,
Fangfang Xia,
Ian T. Foster,
Rick L. Stevens
Abstract:
Protein-ligand binding is the process by which a small molecule (drug or inhibitor) attaches to a target protein. The binding affinity, which refers to the strength of this interaction, is central to many important problems in bioinformatics such as drug design. An extensive amount of work has been devoted to predicting binding affinity over the past decades due to its significance. In this paper,…
▽ More
Protein-ligand binding is the process by which a small molecule (drug or inhibitor) attaches to a target protein. The binding affinity, which refers to the strength of this interaction, is central to many important problems in bioinformatics such as drug design. An extensive amount of work has been devoted to predicting binding affinity over the past decades due to its significance. In this paper, we review all significant recent works, focusing on the methods, features, and benchmark datasets. We have observed a rising trend in the use of traditional machine learning and deep learning models for predicting binding affinity, accompanied by an increasing amount of data on proteins and small drug-like molecules. While prediction results are constantly improving, we also identify several open questions and potential directions that remain unexplored in the field. This paper could serve as an excellent starting point for machine learning researchers who wish to engage in the study of binding affinity, or for anyone with general interests in machine learning, drug discovery, and bioinformatics.
△ Less
Submitted 29 September, 2024;
originally announced October 2024.
-
Flight: A FaaS-Based Framework for Complex and Hierarchical Federated Learning
Authors:
Nathaniel Hudson,
Valerie Hayot-Sasson,
Yadu Babuji,
Matt Baughman,
J. Gregory Pauloski,
Ryan Chard,
Ian Foster,
Kyle Chard
Abstract:
Federated Learning (FL) is a decentralized machine learning paradigm where models are trained on distributed devices and are aggregated at a central server. Existing FL frameworks assume simple two-tier network topologies where end devices are directly connected to the aggregation server. While this is a practical mental model, it does not exploit the inherent topology of real-world distributed sy…
▽ More
Federated Learning (FL) is a decentralized machine learning paradigm where models are trained on distributed devices and are aggregated at a central server. Existing FL frameworks assume simple two-tier network topologies where end devices are directly connected to the aggregation server. While this is a practical mental model, it does not exploit the inherent topology of real-world distributed systems like the Internet-of-Things. We present Flight, a novel FL framework that supports complex hierarchical multi-tier topologies, asynchronous aggregation, and decouples the control plane from the data plane. We compare the performance of Flight against Flower, a state-of-the-art FL framework. Our results show that Flight scales beyond Flower, supporting up to 2048 simultaneous devices, and reduces FL makespan across several models. Finally, we show that Flight's hierarchical FL model can reduce communication overheads by more than 60%.
△ Less
Submitted 24 September, 2024;
originally announced September 2024.
-
Sustainable Data Democratization: A Multifaceted Investment for an Equitable Future
Authors:
Michela Taufer,
Valerio Pascucci,
Christine R. Kirkpatric,
Ian T. Foster
Abstract:
The urgent need for data democratization in scientific research was the focal point of a panel discussion at SC23 in Denver, Colorado, from November 12 to 17, 2023. This article summarizes the outcomes of that discussion and subsequent conversations. We advocate for strategic investments in financial, human, and technological resources for sustainable data democratization. Emphasizing that data is…
▽ More
The urgent need for data democratization in scientific research was the focal point of a panel discussion at SC23 in Denver, Colorado, from November 12 to 17, 2023. This article summarizes the outcomes of that discussion and subsequent conversations. We advocate for strategic investments in financial, human, and technological resources for sustainable data democratization. Emphasizing that data is central to scientific discovery and AI deployment, we highlight barriers such as limited access, inadequate financial incentives for cross-domain collaboration, and a shortage of workforce development initiatives. Our recommendations aim to guide decision-makers in fostering an inclusive research community, breaking down research silos, and developing a skilled workforce to advance scientific discovery.
△ Less
Submitted 26 August, 2024;
originally announced August 2024.
-
Employing Artificial Intelligence to Steer Exascale Workflows with Colmena
Authors:
Logan Ward,
J. Gregory Pauloski,
Valerie Hayot-Sasson,
Yadu Babuji,
Alexander Brace,
Ryan Chard,
Kyle Chard,
Rajeev Thakur,
Ian Foster
Abstract:
Computational workflows are a common class of application on supercomputers, yet the loosely coupled and heterogeneous nature of workflows often fails to take full advantage of their capabilities. We created Colmena to leverage the massive parallelism of a supercomputer by using Artificial Intelligence (AI) to learn from and adapt a workflow as it executes. Colmena allows scientists to define how…
▽ More
Computational workflows are a common class of application on supercomputers, yet the loosely coupled and heterogeneous nature of workflows often fails to take full advantage of their capabilities. We created Colmena to leverage the massive parallelism of a supercomputer by using Artificial Intelligence (AI) to learn from and adapt a workflow as it executes. Colmena allows scientists to define how their application should respond to events (e.g., task completion) as a series of cooperative agents. In this paper, we describe the design of Colmena, the challenges we overcame while deploying applications on exascale systems, and the science workflows we have enhanced through interweaving AI. The scaling challenges we discuss include developing steering strategies that maximize node utilization, introducing data fabrics that reduce communication overhead of data-intensive tasks, and implementing workflow tasks that cache costly operations between invocations. These innovations coupled with a variety of application patterns accessible through our agent-based steering model have enabled science advances in chemistry, biophysics, and materials science using different types of AI. Our vision is that Colmena will spur creative solutions that harness AI across many domains of scientific computing.
△ Less
Submitted 26 August, 2024;
originally announced August 2024.
-
TaPS: A Performance Evaluation Suite for Task-based Execution Frameworks
Authors:
J. Gregory Pauloski,
Valerie Hayot-Sasson,
Maxime Gonthier,
Nathaniel Hudson,
Haochen Pan,
Sicheng Zhou,
Ian Foster,
Kyle Chard
Abstract:
Task-based execution frameworks, such as parallel programming libraries, computational workflow systems, and function-as-a-service platforms, enable the composition of distinct tasks into a single, unified application designed to achieve a computational goal. Task-based execution frameworks abstract the parallel execution of an application's tasks on arbitrary hardware. Research into these task ex…
▽ More
Task-based execution frameworks, such as parallel programming libraries, computational workflow systems, and function-as-a-service platforms, enable the composition of distinct tasks into a single, unified application designed to achieve a computational goal. Task-based execution frameworks abstract the parallel execution of an application's tasks on arbitrary hardware. Research into these task executors has accelerated as computational sciences increasingly need to take advantage of parallel compute and/or heterogeneous hardware. However, the lack of evaluation standards makes it challenging to compare and contrast novel systems against existing implementations. Here, we introduce TaPS, the Task Performance Suite, to support continued research in parallel task executor frameworks. TaPS provides (1) a unified, modular interface for writing and evaluating applications using arbitrary execution frameworks and data management systems and (2) an initial set of reference synthetic and real-world science applications. We discuss how the design of TaPS supports the reliable evaluation of frameworks and demonstrate TaPS through a survey of benchmarks using the provided reference applications.
△ Less
Submitted 13 August, 2024;
originally announced August 2024.
-
Octopus: Experiences with a Hybrid Event-Driven Architecture for Distributed Scientific Computing
Authors:
Haochen Pan,
Ryan Chard,
Sicheng Zhou,
Alok Kamatar,
Rafael Vescovi,
Valérie Hayot-Sasson,
André Bauer,
Maxime Gonthier,
Kyle Chard,
Ian Foster
Abstract:
Scientific research increasingly relies on distributed computational resources, storage systems, networks, and instruments, ranging from HPC and cloud systems to edge devices. Event-driven architecture (EDA) benefits applications targeting distributed research infrastructures by enabling the organization, communication, processing, reliability, and security of events generated from many sources. T…
▽ More
Scientific research increasingly relies on distributed computational resources, storage systems, networks, and instruments, ranging from HPC and cloud systems to edge devices. Event-driven architecture (EDA) benefits applications targeting distributed research infrastructures by enabling the organization, communication, processing, reliability, and security of events generated from many sources. To support the development of scientific EDA, we introduce Octopus, a hybrid, cloud-to-edge event fabric designed to link many local event producers and consumers with cloud-hosted brokers. Octopus can be scaled to meet demand, permits the deployment of highly available Triggers for automatic event processing, and enforces fine-grained access control. We identify requirements in self-driving laboratories, scientific data automation, online task scheduling, epidemic modeling, and dynamic workflow management use cases, and present results demonstrating Octopus' ability to meet those requirements. Octopus supports producing and consuming events at a rate of over 4.2 M and 9.6 M events per second, respectively, from distributed clients.
△ Less
Submitted 28 September, 2024; v1 submitted 16 July, 2024;
originally announced July 2024.
-
Foundation Models for the Electric Power Grid
Authors:
Hendrik F. Hamann,
Thomas Brunschwiler,
Blazhe Gjorgiev,
Leonardo S. A. Martins,
Alban Puech,
Anna Varbella,
Jonas Weiss,
Juan Bernabe-Moreno,
Alexandre Blondin Massé,
Seong Choi,
Ian Foster,
Bri-Mathias Hodge,
Rishabh Jain,
Kibaek Kim,
Vincent Mai,
François Mirallès,
Martin De Montigny,
Octavio Ramos-Leaños,
Hussein Suprême,
Le Xie,
El-Nasser S. Youssef,
Arnaud Zinflou,
Alexander J. Belyi,
Ricardo J. Bessa,
Bishnu Prasad Bhattarai
, et al. (2 additional authors not shown)
Abstract:
Foundation models (FMs) currently dominate news headlines. They employ advanced deep learning architectures to extract structural information autonomously from vast datasets through self-supervision. The resulting rich representations of complex systems and dynamics can be applied to many downstream applications. Therefore, FMs can find uses in electric power grids, challenged by the energy transi…
▽ More
Foundation models (FMs) currently dominate news headlines. They employ advanced deep learning architectures to extract structural information autonomously from vast datasets through self-supervision. The resulting rich representations of complex systems and dynamics can be applied to many downstream applications. Therefore, FMs can find uses in electric power grids, challenged by the energy transition and climate change. In this paper, we call for the development of, and state why we believe in, the potential of FMs for electric grids. We highlight their strengths and weaknesses amidst the challenges of a changing grid. We argue that an FM learning from diverse grid data and topologies could unlock transformative capabilities, pioneering a new approach in leveraging AI to redefine how we manage complexity and uncertainty in the electric grid. Finally, we discuss a power grid FM concept, namely GridFM, based on graph neural networks and show how different downstream tasks benefit.
△ Less
Submitted 12 November, 2024; v1 submitted 12 July, 2024;
originally announced July 2024.
-
Object Proxy Patterns for Accelerating Distributed Applications
Authors:
J. Gregory Pauloski,
Valerie Hayot-Sasson,
Logan Ward,
Alexander Brace,
André Bauer,
Kyle Chard,
Ian Foster
Abstract:
Workflow and serverless frameworks have empowered new approaches to distributed application design by abstracting compute resources. However, their typically limited or one-size-fits-all support for advanced data flow patterns leaves optimization to the application programmer -- optimization that becomes more difficult as data become larger. The transparent object proxy, which provides wide-area r…
▽ More
Workflow and serverless frameworks have empowered new approaches to distributed application design by abstracting compute resources. However, their typically limited or one-size-fits-all support for advanced data flow patterns leaves optimization to the application programmer -- optimization that becomes more difficult as data become larger. The transparent object proxy, which provides wide-area references that can resolve to data regardless of location, has been demonstrated as an effective low-level building block in such situations. Here we propose three high-level proxy-based programming patterns -- distributed futures, streaming, and ownership -- that make the power of the proxy pattern usable for more complex and dynamic distributed program structures. We motivate these patterns via careful review of application requirements and describe implementations of each pattern. We evaluate our implementations through a suite of benchmarks and by applying them in three substantial scientific applications, in which we demonstrate substantial improvements in runtime, throughput, and memory usage.
△ Less
Submitted 2 December, 2024; v1 submitted 1 July, 2024;
originally announced July 2024.
-
GreenFaaS: Maximizing Energy Efficiency of HPC Workloads with FaaS
Authors:
Alok Kamatar,
Valerie Hayot-Sasson,
Yadu Babuji,
Andre Bauer,
Gourav Rattihalli,
Ninad Hogade,
Dejan Milojicic,
Kyle Chard,
Ian Foster
Abstract:
Application energy efficiency can be improved by executing each application component on the compute element that consumes the least energy while also satisfying time constraints. In principle, the function as a service (FaaS) paradigm should simplify such optimizations by abstracting away compute location, but existing FaaS systems do not provide for user transparency over application energy cons…
▽ More
Application energy efficiency can be improved by executing each application component on the compute element that consumes the least energy while also satisfying time constraints. In principle, the function as a service (FaaS) paradigm should simplify such optimizations by abstracting away compute location, but existing FaaS systems do not provide for user transparency over application energy consumption or task placement. Here we present GreenFaaS, a novel open source framework that bridges this gap between energy-efficient applications and FaaS platforms. GreenFaaS can be deployed by end users or providers across systems to monitor energy use, provide task-specific feedback, and schedule tasks in an energy-aware manner. We demonstrate that intelligent placement of tasks can both reduce energy consumption and improve performance. For a synthetic workload, GreenFaaS reduces the energy-delay product by 45% compared to alternatives. Furthermore, running a molecular design application through GreenFaaS can reduce energy consumption by 21% and runtime by 63% by better matching tasks with machines.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Causal Discovery over High-Dimensional Structured Hypothesis Spaces with Causal Graph Partitioning
Authors:
Ashka Shah,
Adela DePavia,
Nathaniel Hudson,
Ian Foster,
Rick Stevens
Abstract:
The aim in many sciences is to understand the mechanisms that underlie the observed distribution of variables, starting from a set of initial hypotheses. Causal discovery allows us to infer mechanisms as sets of cause and effect relationships in a generalized way -- without necessarily tailoring to a specific domain. Causal discovery algorithms search over a structured hypothesis space, defined by…
▽ More
The aim in many sciences is to understand the mechanisms that underlie the observed distribution of variables, starting from a set of initial hypotheses. Causal discovery allows us to infer mechanisms as sets of cause and effect relationships in a generalized way -- without necessarily tailoring to a specific domain. Causal discovery algorithms search over a structured hypothesis space, defined by the set of directed acyclic graphs, to find the graph that best explains the data. For high-dimensional problems, however, this search becomes intractable and scalable algorithms for causal discovery are needed to bridge the gap. In this paper, we define a novel causal graph partition that allows for divide-and-conquer causal discovery with theoretical guarantees. We leverage the idea of a superstructure -- a set of learned or existing candidate hypotheses -- to partition the search space. We prove under certain assumptions that learning with a causal graph partition always yields the Markov Equivalence Class of the true causal graph. We show our algorithm achieves comparable accuracy and a faster time to solution for biologically-tuned synthetic networks and networks up to ${10^4}$ variables. This makes our method applicable to gene regulatory network inference and other domains with high-dimensional structured hypothesis spaces.
△ Less
Submitted 3 March, 2025; v1 submitted 10 June, 2024;
originally announced June 2024.
-
Efficient Data-Parallel Continual Learning with Asynchronous Distributed Rehearsal Buffers
Authors:
Thomas Bouvier,
Bogdan Nicolae,
Hugo Chaugier,
Alexandru Costan,
Ian Foster,
Gabriel Antoniu
Abstract:
Deep learning has emerged as a powerful method for extracting valuable information from large volumes of data. However, when new training data arrives continuously (i.e., is not fully available from the beginning), incremental training suffers from catastrophic forgetting (i.e., new patterns are reinforced at the expense of previously acquired knowledge). Training from scratch each time new traini…
▽ More
Deep learning has emerged as a powerful method for extracting valuable information from large volumes of data. However, when new training data arrives continuously (i.e., is not fully available from the beginning), incremental training suffers from catastrophic forgetting (i.e., new patterns are reinforced at the expense of previously acquired knowledge). Training from scratch each time new training data becomes available would result in extremely long training times and massive data accumulation. Rehearsal-based continual learning has shown promise for addressing the catastrophic forgetting challenge, but research to date has not addressed performance and scalability. To fill this gap, we propose an approach based on a distributed rehearsal buffer that efficiently complements data-parallel training on multiple GPUs, allowing us to achieve short runtime and scalability while retaining high accuracy. It leverages a set of buffers (local to each GPU) and uses several asynchronous techniques for updating these local buffers in an embarrassingly parallel fashion, all while handling the communication overheads necessary to augment input mini-batches (groups of training samples fed to the model) using unbiased, global sampling. In this paper we explore the benefits of this approach for classification models. We run extensive experiments on up to 128 GPUs of the ThetaGPU supercomputer to compare our approach with baselines representative of training-from-scratch (the upper bound in terms of accuracy) and incremental training (the lower bound). Results show that rehearsal-based continual learning achieves a top-5 classification accuracy close to the upper bound, while simultaneously exhibiting a runtime close to the lower bound.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Oil & Water? Diffusion of AI Within and Across Scientific Fields
Authors:
Eamon Duede,
William Dolan,
André Bauer,
Ian Foster,
Karim Lakhani
Abstract:
This study empirically investigates claims of the increasing ubiquity of artificial intelligence (AI) within roughly 80 million research publications across 20 diverse scientific fields, by examining the change in scholarly engagement with AI from 1985 through 2022. We observe exponential growth, with AI-engaged publications increasing approximately thirteenfold (13x) across all fields, suggesting…
▽ More
This study empirically investigates claims of the increasing ubiquity of artificial intelligence (AI) within roughly 80 million research publications across 20 diverse scientific fields, by examining the change in scholarly engagement with AI from 1985 through 2022. We observe exponential growth, with AI-engaged publications increasing approximately thirteenfold (13x) across all fields, suggesting a dramatic shift from niche to mainstream. Moreover, we provide the first empirical examination of the distribution of AI-engaged publications across publication venues within individual fields, with results that reveal a broadening of AI engagement within disciplines. While this broadening engagement suggests a move toward greater disciplinary integration in every field, increased ubiquity is associated with a semantic tension between AI-engaged research and more traditional disciplinary research. Through an analysis of tens of millions of document embeddings, we observe a complex interplay between AI-engaged and non-AI-engaged research within and across fields, suggesting that increasing ubiquity is something of an oil-and-water phenomenon -- AI-engaged work is spreading out over fields, but not mixing well with non-AI-engaged work.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
SciQAG: A Framework for Auto-Generated Science Question Answering Dataset with Fine-grained Evaluation
Authors:
Yuwei Wan,
Yixuan Liu,
Aswathy Ajith,
Clara Grazian,
Bram Hoex,
Wenjie Zhang,
Chunyu Kit,
Tong Xie,
Ian Foster
Abstract:
We introduce SciQAG, a novel framework for automatically generating high-quality science question-answer pairs from a large corpus of scientific literature based on large language models (LLMs). SciQAG consists of a QA generator and a QA evaluator, which work together to extract diverse and research-level questions and answers from scientific papers. Utilizing this framework, we construct a large-…
▽ More
We introduce SciQAG, a novel framework for automatically generating high-quality science question-answer pairs from a large corpus of scientific literature based on large language models (LLMs). SciQAG consists of a QA generator and a QA evaluator, which work together to extract diverse and research-level questions and answers from scientific papers. Utilizing this framework, we construct a large-scale, high-quality, open-ended science QA dataset containing 188,042 QA pairs extracted from 22,743 scientific papers across 24 scientific domains. We also introduce SciQAG-24D, a new benchmark task designed to evaluate the science question-answering ability of LLMs. Extensive experiments demonstrate that fine-tuning LLMs on the SciQAG dataset significantly improves their performance on both open-ended question answering and scientific tasks. To foster research and collaboration, we make the datasets, models, and evaluation codes publicly available, contributing to the advancement of science question answering and developing more interpretable and reasoning-capable AI systems.
△ Less
Submitted 9 July, 2024; v1 submitted 16 May, 2024;
originally announced May 2024.
-
Automated, Reliable, and Efficient Continental-Scale Replication of 7.3 Petabytes of Climate Simulation Data: A Case Study
Authors:
Lukasz Lacinski,
Lee Liming,
Steven Turoscy,
Cameron Harr,
Kyle Chard,
Eli Dart,
Paul Durack,
Sasha Ames,
Forrest M. Hoffman,
Ian T. Foster
Abstract:
We report on our experiences replicating 7.3 petabytes (PB) of Earth System Grid Federation (ESGF) climate simulation data from Lawrence Livermore National Laboratory (LLNL) in California to Argonne National Laboratory (ANL) in Illinois and Oak Ridge National Laboratory (ORNL) in Tennessee. This movement of some 29 million files, twice, undertaken in order to establish new ESGF nodes at ANL and OR…
▽ More
We report on our experiences replicating 7.3 petabytes (PB) of Earth System Grid Federation (ESGF) climate simulation data from Lawrence Livermore National Laboratory (LLNL) in California to Argonne National Laboratory (ANL) in Illinois and Oak Ridge National Laboratory (ORNL) in Tennessee. This movement of some 29 million files, twice, undertaken in order to establish new ESGF nodes at ANL and ORNL, was performed largely automatically by a simple replication tool, a script that invoked Globus to transfer large bundles of files while tracking progress in a database. Under the covers, Globus organized transfers to make efficient use of the high-speed Energy Sciences network (ESnet) and the data transfer nodes deployed at participating sites, and also addressed security, integrity checking, and recovery from a variety of transient failures. This success demonstrates the considerable benefits that can accrue from the adoption of performant data replication infrastructure.
△ Less
Submitted 30 April, 2024;
originally announced April 2024.
-
MalleTrain: Deep Neural Network Training on Unfillable Supercomputer Nodes
Authors:
Xiaolong Ma,
Feng Yan,
Lei Yang,
Ian Foster,
Michael E. Papka,
Zhengchun Liu,
Rajkumar Kettimuthu
Abstract:
First-come first-serve scheduling can result in substantial (up to 10%) of transiently idle nodes on supercomputers. Recognizing that such unfilled nodes are well-suited for deep neural network (DNN) training, due to the flexible nature of DNN training tasks, Liu et al. proposed that the re-scaling DNN training tasks to fit gaps in schedules be formulated as a mixed-integer linear programming (MIL…
▽ More
First-come first-serve scheduling can result in substantial (up to 10%) of transiently idle nodes on supercomputers. Recognizing that such unfilled nodes are well-suited for deep neural network (DNN) training, due to the flexible nature of DNN training tasks, Liu et al. proposed that the re-scaling DNN training tasks to fit gaps in schedules be formulated as a mixed-integer linear programming (MILP) problem, and demonstrated via simulation the potential benefits of the approach. Here, we introduce MalleTrain, a system that provides the first practical implementation of this approach and that furthermore generalizes it by allowing it use even for DNN training applications for which model information is unknown before runtime. Key to this latter innovation is the use of a lightweight online job profiling advisor (JPA) to collect critical scalability information for DNN jobs -- information that it then employs to optimize resource allocations dynamically, in real time. We describe the MalleTrain architecture and present the results of a detailed experimental evaluation on a supercomputer GPU cluster and several representative DNN training workloads, including neural architecture search and hyperparameter optimization. Our results not only confirm the practical feasibility of leveraging idle supercomputer nodes for DNN training but improve significantly on prior results, improving training throughput by up to 22.3\% without requiring users to provide job scalability information.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
Twins in rotational spectroscopy: Does a rotational spectrum uniquely identify a molecule?
Authors:
Marcus Schwarting,
Nathan A. Seifert,
Michael J. Davis,
Ben Blaiszik,
Ian Foster,
Kirill Prozument
Abstract:
Rotational spectroscopy is the most accurate method for determining structures of molecules in the gas phase. It is often assumed that a rotational spectrum is a unique "fingerprint" of a molecule. The availability of large molecular databases and the development of artificial intelligence methods for spectroscopy makes the testing of this assumption timely. In this paper, we pose the determinatio…
▽ More
Rotational spectroscopy is the most accurate method for determining structures of molecules in the gas phase. It is often assumed that a rotational spectrum is a unique "fingerprint" of a molecule. The availability of large molecular databases and the development of artificial intelligence methods for spectroscopy makes the testing of this assumption timely. In this paper, we pose the determination of molecular structures from rotational spectra as an inverse problem. Within this framework, we adopt a funnel-based approach to search for molecular twins, which are two or more molecules, which have similar rotational spectra but distinctly different molecular structures. We demonstrate that there are twins within standard levels of computational accuracy by generating rotational constants for many molecules from several large molecular databases, indicating the inverse problem is ill-posed. However, some twins can be distinguished by increasing the accuracy of the theoretical methods or by performing additional experiments.
△ Less
Submitted 5 April, 2024;
originally announced April 2024.
-
FastqZip: An Improved Reference-Based Genome Sequence Lossy Compression Framework
Authors:
Yuanjian Liu,
Huihao Luo,
Zhijun Han,
Yao Hu,
Yehui Yang,
Kyle Chard,
Sheng Di,
Ian Foster,
Jiesheng Wu
Abstract:
Storing and archiving data produced by next-generation sequencing (NGS) is a huge burden for research institutions. Reference-based compression algorithms are effective in dealing with these data. Our work focuses on compressing FASTQ format files with an improved reference-based compression algorithm to achieve a higher compression ratio than other state-of-the-art algorithms. We propose FastqZip…
▽ More
Storing and archiving data produced by next-generation sequencing (NGS) is a huge burden for research institutions. Reference-based compression algorithms are effective in dealing with these data. Our work focuses on compressing FASTQ format files with an improved reference-based compression algorithm to achieve a higher compression ratio than other state-of-the-art algorithms. We propose FastqZip, which uses a new method mapping the sequence to reference for compression, allows reads-reordering and lossy quality scores, and the BSC or ZPAQ algorithm to perform final lossless compression for a higher compression ratio and relatively fast speed. Our method ensures the sequence can be losslessly reconstructed while allowing lossless or lossy compression for the quality scores. We reordered the reads to get a higher compression ratio. We evaluate our algorithms on five datasets and show that FastqZip can outperform the SOTA algorithm Genozip by around 10% in terms of compression ratio while having an acceptable slowdown.
△ Less
Submitted 22 February, 2024;
originally announced April 2024.
-
UniFaaS: Programming across Distributed Cyberinfrastructure with Federated Function Serving
Authors:
Yifei Li,
Ryan Chard,
Yadu Babuji,
Kyle Chard,
Ian Foster,
Zhuozhao Li
Abstract:
Modern scientific applications are increasingly decomposable into individual functions that may be deployed across distributed and diverse cyberinfrastructure such as supercomputers, clouds, and accelerators. Such applications call for new approaches to programming, distributed execution, and function-level management. We present UniFaaS, a parallel programming framework that relies on a federated…
▽ More
Modern scientific applications are increasingly decomposable into individual functions that may be deployed across distributed and diverse cyberinfrastructure such as supercomputers, clouds, and accelerators. Such applications call for new approaches to programming, distributed execution, and function-level management. We present UniFaaS, a parallel programming framework that relies on a federated function-as-a-service (FaaS) model to enable composition of distributed, scalable, and high-performance scientific workflows, and to support fine-grained function-level management. UniFaaS provides a unified programming interface to compose dynamic task graphs with transparent wide-area data management. UniFaaS exploits an observe-predict-decide approach to efficiently map workflow tasks to target heterogeneous and dynamic resources. We propose a dynamic heterogeneity-aware scheduling algorithm that employs a delay mechanism and a re-scheduling mechanism to accommodate dynamic resource capacity. Our experiments show that UniFaaS can efficiently execute workflows across computing resources with minimal scheduling overhead. We show that UniFaaS can improve the performance of a real-world drug screening workflow by as much as 22.99% when employing an additional 19.48% of resources and a montage workflow by 54.41% when employing an additional 47.83% of resources across multiple distributed clusters, in contrast to using a single cluster
△ Less
Submitted 28 March, 2024;
originally announced March 2024.
-
Steering a Fleet: Adaptation for Large-Scale, Workflow-Based Experiments
Authors:
Jim Pruyne,
Valerie Hayot-Sasson,
Weijian Zheng,
Ryan Chard,
Justin M. Wozniak,
Tekin Bicer,
Kyle Chard,
Ian T. Foster
Abstract:
Experimental science is increasingly driven by instruments that produce vast volumes of data and thus a need to manage, compute, describe, and index this data. High performance and distributed computing provide the means of addressing the computing needs; however, in practice, the variety of actions required and the distributed set of resources involved, requires sophisticated "flows" defining the…
▽ More
Experimental science is increasingly driven by instruments that produce vast volumes of data and thus a need to manage, compute, describe, and index this data. High performance and distributed computing provide the means of addressing the computing needs; however, in practice, the variety of actions required and the distributed set of resources involved, requires sophisticated "flows" defining the steps to be performed on data. As each scan or measurement is performed by an instrument, a new instance of the flow is initiated resulting in a "fleet" of concurrently running flows, with the overall goal to process all the data collected during a potentially long-running experiment. During the course of the experiment, each flow may need to adapt its execution due to changes in the environment, such as computational or storage resource availability, or based on the progress of the fleet as a whole such as completion or discovery of an intermediate result leading to a change in subsequent flow's behavior. We introduce a cloud-based decision engine, Braid, which flows consult during execution to query their run-time environment and coordinate with other flows within their fleet. Braid accepts streams of measurements taken from the run-time environment or from within flow runs which can then be statistically aggregated and compared to other streams to determine a strategy to guide flow execution. For example, queue lengths in execution environments can be used to direct a flow to run computations in one environment or another, or experiment progress as measured by individual flows can be aggregated to determine the progress and subsequent direction of the flows within a fleet. We describe Braid, its interface, implementation and performance characteristics. We further show through examples and experience modifying an existing scientific flow how Braid is used to make adaptable flows.
△ Less
Submitted 9 March, 2024;
originally announced March 2024.
-
Combining Language and Graph Models for Semi-structured Information Extraction on the Web
Authors:
Zhi Hong,
Kyle Chard,
Ian Foster
Abstract:
Relation extraction is an efficient way of mining the extraordinary wealth of human knowledge on the Web. Existing methods rely on domain-specific training data or produce noisy outputs. We focus here on extracting targeted relations from semi-structured web pages given only a short description of the relation. We present GraphScholarBERT, an open-domain information extraction method based on a jo…
▽ More
Relation extraction is an efficient way of mining the extraordinary wealth of human knowledge on the Web. Existing methods rely on domain-specific training data or produce noisy outputs. We focus here on extracting targeted relations from semi-structured web pages given only a short description of the relation. We present GraphScholarBERT, an open-domain information extraction method based on a joint graph and language model structure. GraphScholarBERT can generalize to previously unseen domains without additional data or training and produces only clean extraction results matched to the search keyword. Experiments show that GraphScholarBERT can improve extraction F1 scores by as much as 34.8\% compared to previous work in a zero-shot domain and zero-shot website setting.
△ Less
Submitted 21 February, 2024;
originally announced February 2024.
-
Trillion Parameter AI Serving Infrastructure for Scientific Discovery: A Survey and Vision
Authors:
Nathaniel Hudson,
J. Gregory Pauloski,
Matt Baughman,
Alok Kamatar,
Mansi Sakarvadia,
Logan Ward,
Ryan Chard,
André Bauer,
Maksim Levental,
Wenyi Wang,
Will Engler,
Owen Price Skelly,
Ben Blaiszik,
Rick Stevens,
Kyle Chard,
Ian Foster
Abstract:
Deep learning methods are transforming research, enabling new techniques, and ultimately leading to new discoveries. As the demand for more capable AI models continues to grow, we are now entering an era of Trillion Parameter Models (TPM), or models with more than a trillion parameters -- such as Huawei's PanGu-$Σ$. We describe a vision for the ecosystem of TPM users and providers that caters to t…
▽ More
Deep learning methods are transforming research, enabling new techniques, and ultimately leading to new discoveries. As the demand for more capable AI models continues to grow, we are now entering an era of Trillion Parameter Models (TPM), or models with more than a trillion parameters -- such as Huawei's PanGu-$Σ$. We describe a vision for the ecosystem of TPM users and providers that caters to the specific needs of the scientific community. We then outline the significant technical challenges and open problems in system design for serving TPMs to enable scientific research and discovery. Specifically, we describe the requirements of a comprehensive software stack and interfaces to support the diverse and flexible requirements of researchers.
△ Less
Submitted 5 February, 2024;
originally announced February 2024.
-
XaaS: Acceleration as a Service to Enable Productive High-Performance Cloud Computing
Authors:
Torsten Hoefler,
Marcin Copik,
Pete Beckman,
Andrew Jones,
Ian Foster,
Manish Parashar,
Daniel Reed,
Matthias Troyer,
Thomas Schulthess,
Dan Ernst,
Jack Dongarra
Abstract:
HPC and Cloud have evolved independently, specializing their innovations into performance or productivity. Acceleration as a Service (XaaS) is a recipe to empower both fields with a shared execution platform that provides transparent access to computing resources, regardless of the underlying cloud or HPC service provider. Bridging HPC and cloud advancements, XaaS presents a unified architecture b…
▽ More
HPC and Cloud have evolved independently, specializing their innovations into performance or productivity. Acceleration as a Service (XaaS) is a recipe to empower both fields with a shared execution platform that provides transparent access to computing resources, regardless of the underlying cloud or HPC service provider. Bridging HPC and cloud advancements, XaaS presents a unified architecture built on performance-portable containers. Our converged model concentrates on low-overhead, high-performance communication and computing, targeting resource-intensive workloads from climate simulations to machine learning. XaaS lifts the restricted allocation model of Function-as-a-Service (FaaS), allowing users to benefit from the flexibility and efficient resource utilization of serverless while supporting long-running and performance-sensitive workloads from HPC.
△ Less
Submitted 9 January, 2024;
originally announced January 2024.
-
Comprehensive Exploration of Synthetic Data Generation: A Survey
Authors:
André Bauer,
Simon Trapp,
Michael Stenger,
Robert Leppich,
Samuel Kounev,
Mark Leznik,
Kyle Chard,
Ian Foster
Abstract:
Recent years have witnessed a surge in the popularity of Machine Learning (ML), applied across diverse domains. However, progress is impeded by the scarcity of training data due to expensive acquisition and privacy legislation. Synthetic data emerges as a solution, but the abundance of released models and limited overview literature pose challenges for decision-making. This work surveys 417 Synthe…
▽ More
Recent years have witnessed a surge in the popularity of Machine Learning (ML), applied across diverse domains. However, progress is impeded by the scarcity of training data due to expensive acquisition and privacy legislation. Synthetic data emerges as a solution, but the abundance of released models and limited overview literature pose challenges for decision-making. This work surveys 417 Synthetic Data Generation (SDG) models over the last decade, providing a comprehensive overview of model types, functionality, and improvements. Common attributes are identified, leading to a classification and trend analysis. The findings reveal increased model performance and complexity, with neural network-based approaches prevailing, except for privacy-preserving data generation. Computer vision dominates, with GANs as primary generative models, while diffusion models, transformers, and RNNs compete. Implications from our performance evaluation highlight the scarcity of common metrics and datasets, making comparisons challenging. Additionally, the neglect of training and computational costs in literature necessitates attention in future research. This work serves as a guide for SDG model selection and identifies crucial areas for future exploration.
△ Less
Submitted 1 February, 2024; v1 submitted 4 January, 2024;
originally announced January 2024.
-
WordScape: a Pipeline to extract multilingual, visually rich Documents with Layout Annotations from Web Crawl Data
Authors:
Maurice Weber,
Carlo Siebenschuh,
Rory Butler,
Anton Alexandrov,
Valdemar Thanner,
Georgios Tsolakis,
Haris Jabbar,
Ian Foster,
Bo Li,
Rick Stevens,
Ce Zhang
Abstract:
We introduce WordScape, a novel pipeline for the creation of cross-disciplinary, multilingual corpora comprising millions of pages with annotations for document layout detection. Relating visual and textual items on document pages has gained further significance with the advent of multimodal models. Various approaches proved effective for visual question answering or layout segmentation. However,…
▽ More
We introduce WordScape, a novel pipeline for the creation of cross-disciplinary, multilingual corpora comprising millions of pages with annotations for document layout detection. Relating visual and textual items on document pages has gained further significance with the advent of multimodal models. Various approaches proved effective for visual question answering or layout segmentation. However, the interplay of text, tables, and visuals remains challenging for a variety of document understanding tasks. In particular, many models fail to generalize well to diverse domains and new languages due to insufficient availability of training data. WordScape addresses these limitations. Our automatic annotation pipeline parses the Open XML structure of Word documents obtained from the web, jointly providing layout-annotated document images and their textual representations. In turn, WordScape offers unique properties as it (1) leverages the ubiquity of the Word file format on the internet, (2) is readily accessible through the Common Crawl web corpus, (3) is adaptive to domain-specific documents, and (4) offers culturally and linguistically diverse document pages with natural semantic structure and high-quality text. Together with the pipeline, we will additionally release 9.5M urls to word documents which can be processed using WordScape to create a dataset of over 40M pages. Finally, we investigate the quality of text and layout annotations extracted by WordScape, assess the impact on document understanding benchmarks, and demonstrate that manual labeling costs can be substantially reduced.
△ Less
Submitted 15 December, 2023;
originally announced December 2023.
-
Rapid detection of rare events from in situ X-ray diffraction data using machine learning
Authors:
Weijian Zheng,
Jun-Sang Park,
Peter Kenesei,
Ahsan Ali,
Zhengchun Liu,
Ian T. Foster,
Nicholas Schwarz,
Rajkumar Kettimuthu,
Antonino Miceli,
Hemant Sharma
Abstract:
High-energy X-ray diffraction methods can non-destructively map the 3D microstructure and associated attributes of metallic polycrystalline engineering materials in their bulk form. These methods are often combined with external stimuli such as thermo-mechanical loading to take snapshots over time of the evolving microstructure and attributes. However, the extreme data volumes and the high costs o…
▽ More
High-energy X-ray diffraction methods can non-destructively map the 3D microstructure and associated attributes of metallic polycrystalline engineering materials in their bulk form. These methods are often combined with external stimuli such as thermo-mechanical loading to take snapshots over time of the evolving microstructure and attributes. However, the extreme data volumes and the high costs of traditional data acquisition and reduction approaches pose a barrier to quickly extracting actionable insights and improving the temporal resolution of these snapshots. Here we present a fully automated technique capable of rapidly detecting the onset of plasticity in high-energy X-ray microscopy data. Our technique is computationally faster by at least 50 times than the traditional approaches and works for data sets that are up to 9 times sparser than a full data set. This new technique leverages self-supervised image representation learning and clustering to transform massive data into compact, semantic-rich representations of visually salient characteristics (e.g., peak shapes). These characteristics can be a rapid indicator of anomalous events such as changes in diffraction peak shapes. We anticipate that this technique will provide just-in-time actionable information to drive smarter experiments that effectively deploy multi-modal X-ray diffraction methods that span many decades of length scales.
△ Less
Submitted 6 December, 2023;
originally announced December 2023.
-
Scaling transformer neural networks for skillful and reliable medium-range weather forecasting
Authors:
Tung Nguyen,
Rohan Shah,
Hritik Bansal,
Troy Arcomano,
Romit Maulik,
Veerabhadra Kotamarthi,
Ian Foster,
Sandeep Madireddy,
Aditya Grover
Abstract:
Weather forecasting is a fundamental problem for anticipating and mitigating the impacts of climate change. Recently, data-driven approaches for weather forecasting based on deep learning have shown great promise, achieving accuracies that are competitive with operational systems. However, those methods often employ complex, customized architectures without sufficient ablation analysis, making it…
▽ More
Weather forecasting is a fundamental problem for anticipating and mitigating the impacts of climate change. Recently, data-driven approaches for weather forecasting based on deep learning have shown great promise, achieving accuracies that are competitive with operational systems. However, those methods often employ complex, customized architectures without sufficient ablation analysis, making it difficult to understand what truly contributes to their success. Here we introduce Stormer, a simple transformer model that achieves state-of-the-art performance on weather forecasting with minimal changes to the standard transformer backbone. We identify the key components of Stormer through careful empirical analyses, including weather-specific embedding, randomized dynamics forecast, and pressure-weighted loss. At the core of Stormer is a randomized forecasting objective that trains the model to forecast the weather dynamics over varying time intervals. During inference, this allows us to produce multiple forecasts for a target lead time and combine them to obtain better forecast accuracy. On WeatherBench 2, Stormer performs competitively at short to medium-range forecasts and outperforms current methods beyond 7 days, while requiring orders-of-magnitude less training data and compute. Additionally, we demonstrate Stormer's favorable scaling properties, showing consistent improvements in forecast accuracy with increases in model size and training tokens. Code and checkpoints are available at https://github.com/tung-nd/stormer.
△ Less
Submitted 22 October, 2024; v1 submitted 6 December, 2023;
originally announced December 2023.
-
Accelerating Electronic Stopping Power Predictions by 10 Million Times with a Combination of Time-Dependent Density Functional Theory and Machine Learning
Authors:
Logan Ward,
Ben Blaiszik,
Cheng-Wei Lee,
Troy Martin,
Ian Foster,
André Schleife
Abstract:
Knowing the rate at which particle radiation releases energy in a material, the stopping power, is key to designing nuclear reactors, medical treatments, semiconductor and quantum materials, and many other technologies. While the nuclear contribution to stopping power, i.e., elastic scattering between atoms, is well understood in the literature, the route for gathering data on the electronic contr…
▽ More
Knowing the rate at which particle radiation releases energy in a material, the stopping power, is key to designing nuclear reactors, medical treatments, semiconductor and quantum materials, and many other technologies. While the nuclear contribution to stopping power, i.e., elastic scattering between atoms, is well understood in the literature, the route for gathering data on the electronic contribution has for decades remained costly and reliant on many simplifying assumptions, including that materials are isotropic. We establish a method that combines time-dependent density functional theory (TDDFT) and machine learning to reduce the time to assess new materials to mere hours on a supercomputer and provides valuable data on how atomic details influence electronic stopping. Our approach uses TDDFT to compute the electronic stopping contributions to stopping power from first principles in several directions and then machine learning to interpolate to other directions at a cost of 10 million times fewer core-hours. We demonstrate the combined approach in a study of proton irradiation in aluminum and employ it to predict how the depth of maximum energy deposition, the "Bragg Peak," varies depending on incident angle -- a quantity otherwise inaccessible to modelers. The lack of any experimental information requirement makes our method applicable to most materials, and its speed makes it a prime candidate for enabling quantum-to-continuum models of radiation damage. The prospect of reusing valuable TDDFT data for training the model make our approach appealing for applications in the age of materials data science.
△ Less
Submitted 25 June, 2024; v1 submitted 1 November, 2023;
originally announced November 2023.
-
Attention Lens: A Tool for Mechanistically Interpreting the Attention Head Information Retrieval Mechanism
Authors:
Mansi Sakarvadia,
Arham Khan,
Aswathy Ajith,
Daniel Grzenda,
Nathaniel Hudson,
André Bauer,
Kyle Chard,
Ian Foster
Abstract:
Transformer-based Large Language Models (LLMs) are the state-of-the-art for natural language tasks. Recent work has attempted to decode, by reverse engineering the role of linear layers, the internal mechanisms by which LLMs arrive at their final predictions for text completion tasks. Yet little is known about the specific role of attention heads in producing the final token prediction. We propose…
▽ More
Transformer-based Large Language Models (LLMs) are the state-of-the-art for natural language tasks. Recent work has attempted to decode, by reverse engineering the role of linear layers, the internal mechanisms by which LLMs arrive at their final predictions for text completion tasks. Yet little is known about the specific role of attention heads in producing the final token prediction. We propose Attention Lens, a tool that enables researchers to translate the outputs of attention heads into vocabulary tokens via learned attention-head-specific transformations called lenses. Preliminary findings from our trained lenses indicate that attention heads play highly specialized roles in language models. The code for Attention Lens is available at github.com/msakarvadia/AttentionLens.
△ Less
Submitted 24 October, 2023;
originally announced October 2023.
-
DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies
Authors:
Shuaiwen Leon Song,
Bonnie Kruft,
Minjia Zhang,
Conglong Li,
Shiyang Chen,
Chengming Zhang,
Masahiro Tanaka,
Xiaoxia Wu,
Jeff Rasley,
Ammar Ahmad Awan,
Connor Holmes,
Martin Cai,
Adam Ghanem,
Zhongzhu Zhou,
Yuxiong He,
Pete Luferenko,
Divya Kumar,
Jonathan Weyn,
Ruixiong Zhang,
Sylwester Klocek,
Volodymyr Vragov,
Mohammed AlQuraishi,
Gustaf Ahdritz,
Christina Floristean,
Cristina Negri
, et al. (67 additional authors not shown)
Abstract:
In the upcoming decade, deep learning may revolutionize the natural sciences, enhancing our capacity to model and predict natural occurrences. This could herald a new era of scientific exploration, bringing significant advancements across sectors from drug development to renewable energy. To answer this call, we present DeepSpeed4Science initiative (deepspeed4science.ai) which aims to build unique…
▽ More
In the upcoming decade, deep learning may revolutionize the natural sciences, enhancing our capacity to model and predict natural occurrences. This could herald a new era of scientific exploration, bringing significant advancements across sectors from drug development to renewable energy. To answer this call, we present DeepSpeed4Science initiative (deepspeed4science.ai) which aims to build unique capabilities through AI system technology innovations to help domain experts to unlock today's biggest science mysteries. By leveraging DeepSpeed's current technology pillars (training, inference and compression) as base technology enablers, DeepSpeed4Science will create a new set of AI system technologies tailored for accelerating scientific discoveries by addressing their unique complexity beyond the common technical approaches used for accelerating generic large language models (LLMs). In this paper, we showcase the early progress we made with DeepSpeed4Science in addressing two of the critical system challenges in structural biology research.
△ Less
Submitted 11 October, 2023; v1 submitted 6 October, 2023;
originally announced October 2023.