-
To Stream or Not to Stream: Towards A Quantitative Model for Remote HPC Processing Decisions
Authors:
Flavio Castro,
Weijian Zheng,
Joaquin Chung,
Ian Foster,
Rajkumar Kettimuthu
Abstract:
Modern scientific instruments generate data at rates that increasingly exceed local compute capabilities and, when paired with the staging and I/O overheads of file-based transfers, also render file-based use of remote HPC resources impractical for time-sensitive analysis and experimental steering. Real-time streaming frameworks promise to reduce latency and improve system efficiency, but lack a p…
▽ More
Modern scientific instruments generate data at rates that increasingly exceed local compute capabilities and, when paired with the staging and I/O overheads of file-based transfers, also render file-based use of remote HPC resources impractical for time-sensitive analysis and experimental steering. Real-time streaming frameworks promise to reduce latency and improve system efficiency, but lack a principled way to assess their feasibility. In this work, we introduce a quantitative framework and an accompanying Streaming Speed Score to evaluate whether remote high-performance computing (HPC) resources can provide timely data processing compared to local alternatives. Our model incorporates key parameters including data generation rate, transfer efficiency, remote processing power, and file input/output overhead to compute total processing completion time and identify operational regimes where streaming is beneficial. We motivate our methodology with use cases from facilities such as APS, FRIB, LCLS-II, and the LHC, and validate our approach through an illustrative case study based on LCLS-II data. Our measurements show that streaming can achieve up to 97% lower end-to-end completion time than file-based methods under high data rates, while worst-case congestion can increase transfer times by over an order of magnitude, underscoring the importance of tail latency in streaming feasibility decisions.
△ Less
Submitted 29 September, 2025; v1 submitted 23 September, 2025;
originally announced September 2025.
-
InterQnet: A Heterogeneous Full-Stack Approach to Co-designing Scalable Quantum Networks
Authors:
Joaquin Chung,
Daniel Dilley,
Ely Eastman,
Alvin Gonzales,
Kara Hokenstad,
Md Shariful Islam,
Varun Jorapur,
Joseph Petrullo,
Andy C. Y. Li,
Bikun Li,
Vasileios Niaouris,
Anirudh Ramesh,
Ansh Singal,
Caitao Zhan,
Michael Bishof,
Eric Chitambar,
Jacob P. Covey,
Alan Dibos,
Xu Han,
Liang Jiang,
Prem Kumar,
Jeffrey Larson,
Zain H. Saleem,
Rajkumar Kettimuthu
Abstract:
Quantum communications have progressed significantly, moving from a theoretical concept to small-scale experiments to recent metropolitan-scale demonstrations. As the technology matures, it is expected to revolutionize quantum computing in much the same way that classical networks revolutionized classical computing. Quantum communications will also enable breakthroughs in quantum sensing, metrolog…
▽ More
Quantum communications have progressed significantly, moving from a theoretical concept to small-scale experiments to recent metropolitan-scale demonstrations. As the technology matures, it is expected to revolutionize quantum computing in much the same way that classical networks revolutionized classical computing. Quantum communications will also enable breakthroughs in quantum sensing, metrology, and other areas. However, scalability has emerged as a major challenge, particularly in terms of the number and heterogeneity of nodes, the distances between nodes, the diversity of applications, and the scale of user demand. This paper describes InterQnet, a multidisciplinary project that advances scalable quantum communications through a comprehensive approach that improves devices, error handling, and network architecture. InterQnet has a two-pronged strategy to address scalability challenges: InterQnet-Achieve focuses on practical realizations of heterogeneous quantum networks by building and then integrating first-generation quantum repeaters with error mitigation schemes and centralized automated network control systems. The resulting system will enable quantum communications between two heterogeneous quantum platforms through a third type of platform operating as a repeater node. InterQnet-Scale focuses on a systems study of architectural choices for scalable quantum networks by developing forward-looking models of quantum network devices, advanced error correction schemes, and entanglement protocols. Here we report our current progress toward achieving our scalability goals.
△ Less
Submitted 23 September, 2025;
originally announced September 2025.
-
Federated Learning over 5G, WiFi, and Ethernet: Measurements and Evaluation
Authors:
Robert J. Hayek,
Joaquin Chung,
Kayla Comer,
Chandra R. Murthy,
Rajkumar Kettimuthu,
Igor Kadota
Abstract:
Federated Learning (FL) deployments using IoT devices is an area that is poised to significantly benefit from advances in NextG wireless. In this paper, we deploy a FL application using a 5G-NR Standalone (SA) testbed with open-source and Commercial Off-the-Shelf (COTS) components. The 5G testbed architecture consists of a network of resource-constrained edge devices, namely Raspberry Pi's, and a…
▽ More
Federated Learning (FL) deployments using IoT devices is an area that is poised to significantly benefit from advances in NextG wireless. In this paper, we deploy a FL application using a 5G-NR Standalone (SA) testbed with open-source and Commercial Off-the-Shelf (COTS) components. The 5G testbed architecture consists of a network of resource-constrained edge devices, namely Raspberry Pi's, and a central server equipped with a Software Defined Radio (SDR) and running O-RAN software. Our testbed allows edge devices to communicate with the server using WiFi and Ethernet, instead of 5G. FL is deployed using the Flower FL framework, for which we developed a comprehensive instrumentation tool to collect and analyze diverse communications and machine learning performance metrics including: model aggregation time, downlink transmission time, training time, and uplink transmission time. Leveraging these measurements, we perform a comparative analysis of the FL application across three network interfaces: 5G, WiFi, and Ethernet. Our experimental results suggest that, on 5G, the uplink model transfer time is a significant factor in convergence time of FL. In particular, we find that the 5G uplink contributes to roughly 23% of the duration of one average communication round when using all edge devices in our testbed. When comparing the uplink time of the 5G testbed, we find that it is 33.3x higher than Ethernet and 17.8x higher than WiFi. Our results also suggest that 5G exacerbates the well-known straggler effect. For reproducibility, we have open-sourced our FL application, instrumentation tools, and testbed configuration.
△ Less
Submitted 6 April, 2025;
originally announced April 2025.
-
Globus Service Enhancements for Exascale Applications and Facilities
Authors:
Weijian Zheng,
Jack Kordas,
Tyler J. Skluzacek,
Raj Kettimuthu,
Ian Foster
Abstract:
Many extreme-scale applications require the movement of large quantities of data to, from, and among leadership computing facilities, as well as other scientific facilities and the home institutions of facility users. These applications, particularly when leadership computing facilities are involved, can touch upon edge cases (e.g., terabyte files) that had not been a focus of previous Globus opti…
▽ More
Many extreme-scale applications require the movement of large quantities of data to, from, and among leadership computing facilities, as well as other scientific facilities and the home institutions of facility users. These applications, particularly when leadership computing facilities are involved, can touch upon edge cases (e.g., terabyte files) that had not been a focus of previous Globus optimization work, which had emphasized rather the movement of many smaller (megabyte to gigabyte) files. We report here on how automated client-driven chunking can be used to accelerate both the movement of large files and the integrity checking operations that have proven to be essential for large data transfers. We present detailed performance studies that provide insights into the benefits of these modifications in a range of file transfer scenarios.
△ Less
Submitted 29 March, 2025;
originally announced March 2025.
-
Design and Simulation of the Adaptive Continuous Entanglement Generation Protocol
Authors:
Caitao Zhan,
Joaquin Chung,
Allen Zang,
Alexander Kolar,
Rajkumar Kettimuthu
Abstract:
Generating and distributing remote entangled pairs (EPs) is a primary function of quantum networks, as entanglement is the fundamental resource for key quantum network applications. A critical performance metric for quantum networks is the time-to-serve (TTS) for users' EP requests, which is the time to distribute EPs between the requested nodes. Minimizing the TTS is essential given the limited q…
▽ More
Generating and distributing remote entangled pairs (EPs) is a primary function of quantum networks, as entanglement is the fundamental resource for key quantum network applications. A critical performance metric for quantum networks is the time-to-serve (TTS) for users' EP requests, which is the time to distribute EPs between the requested nodes. Minimizing the TTS is essential given the limited qubit coherence time. In this paper, we study the Adaptive Continuous entanglement generation Protocol (ACP), which enables quantum network nodes to continuously generate EPs with their neighbors, while adaptively selecting the neighbors to optimize TTS. Meanwhile, entanglement purification is used to mitigate decoherence in pre-generated EPs prior to the arrival of user requests. We extend the SeQUeNCe simulator to fully implement ACP and conduct extensive simulations across various network scales. Our results show that ACP reduces TTS by up to 94% and increases entanglement fidelity by up to 0.05.
△ Less
Submitted 16 February, 2025; v1 submitted 3 February, 2025;
originally announced February 2025.
-
Effective Defect Detection Using Instance Segmentation for NDI
Authors:
Ashiqur Rahman,
Venkata Devesh Reddy Seethi,
Austin Yunker,
Zachary Kral,
Rajkumar Kettimuthu,
Hamed Alhoori
Abstract:
Ultrasonic testing is a common Non-Destructive Inspection (NDI) method used in aerospace manufacturing. However, the complexity and size of the ultrasonic scans make it challenging to identify defects through visual inspection or machine learning models. Using computer vision techniques to identify defects from ultrasonic scans is an evolving research area. In this study, we used instance segmenta…
▽ More
Ultrasonic testing is a common Non-Destructive Inspection (NDI) method used in aerospace manufacturing. However, the complexity and size of the ultrasonic scans make it challenging to identify defects through visual inspection or machine learning models. Using computer vision techniques to identify defects from ultrasonic scans is an evolving research area. In this study, we used instance segmentation to identify the presence of defects in the ultrasonic scan images of composite panels that are representative of real components manufactured in aerospace. We used two models based on Mask-RCNN (Detectron 2) and YOLO 11 respectively. Additionally, we implemented a simple statistical pre-processing technique that reduces the burden of requiring custom-tailored pre-processing techniques. Our study demonstrates the feasibility and effectiveness of using instance segmentation in the NDI pipeline by significantly reducing data pre-processing time, inspection time, and overall costs.
△ Less
Submitted 23 January, 2025;
originally announced January 2025.
-
Simulation of Entanglement-Enabled Connectivity in QLANs using SeQUeNCe
Authors:
Francesco Mazza,
Caitao Zhan,
Joaquin Chung,
Rajkumar Kettimuthu,
Marcello Caleffi,
Angela Sara Cacciapuoti
Abstract:
Quantum Local Area Networks (QLANs) represent a promising building block for larger scale quantum networks with the ambitious goal -- in a long time horizon -- of realizing a Quantum Internet. Surprisingly, the physical topology of a QLAN can be enriched by a set of artificial links, enabled by shared multipartite entangled states among the nodes of the network. This novel concept of artificial to…
▽ More
Quantum Local Area Networks (QLANs) represent a promising building block for larger scale quantum networks with the ambitious goal -- in a long time horizon -- of realizing a Quantum Internet. Surprisingly, the physical topology of a QLAN can be enriched by a set of artificial links, enabled by shared multipartite entangled states among the nodes of the network. This novel concept of artificial topology revolutionizes the possibilities of connectivity within the local network, enabling an on-demand manipulation of the artificial network topology. In this paper, we discuss the implementation of the QLAN model in SeQUeNCe, a discrete-event simulator of quantum networks. Specifically, we provide an analysis of how network nodes interact, with an emphasis on the interplay between quantum operations and classical signaling within the network. Remarkably, through the modeling of a measurement protocol and a correction protocol, our QLAN model implementation enables the simulation of the manipulation process of a shared entangled quantum state, and the subsequent engineering of the entanglement-based connectivity. Our simulations demonstrate how to obtain different virtual topologies with different manipulations of the shared resources and with all the possible measurement outcomes, with an arbitrary number of nodes within the network.
△ Less
Submitted 17 November, 2024;
originally announced November 2024.
-
Analytical Performance Estimations for Quantum Repeater Network Scenarios
Authors:
Allen Zang,
Joaquin Chung,
Rajkumar Kettimuthu,
Martin Suchara,
Tian Zhong
Abstract:
Quantum repeater chains will form the backbone of future quantum networks that distribute entanglement between network nodes. Therefore, it is important to understand the entanglement distribution performance of quantum repeater chains, especially their throughput and latency. By using Markov chains to model the stochastic dynamics in quantum repeater chains, we offer analytical estimations for…
▽ More
Quantum repeater chains will form the backbone of future quantum networks that distribute entanglement between network nodes. Therefore, it is important to understand the entanglement distribution performance of quantum repeater chains, especially their throughput and latency. By using Markov chains to model the stochastic dynamics in quantum repeater chains, we offer analytical estimations for long-run throughput and on-demand latency of continuous entanglement distribution. We first study single-link entanglement generation using general multiheralded protocols. We then model entanglement distribution with entanglement swapping over two links, using either a single- or a double-heralded entanglement generation protocol. We also demonstrate how the two-link results offer insights into the performance of general $2^k$-link nested repeater chains. Our results enrich the quantitative understanding of quantum repeater network performance, especially the dependence on system parameters. The analytical formulae themselves are valuable reference resources for the quantum networking community. They can serve as benchmarks for quantum network simulation validation or as examples of quantum network dynamics modeling using the Markov chain formalism.
△ Less
Submitted 16 January, 2025; v1 submitted 16 July, 2024;
originally announced July 2024.
-
MalleTrain: Deep Neural Network Training on Unfillable Supercomputer Nodes
Authors:
Xiaolong Ma,
Feng Yan,
Lei Yang,
Ian Foster,
Michael E. Papka,
Zhengchun Liu,
Rajkumar Kettimuthu
Abstract:
First-come first-serve scheduling can result in substantial (up to 10%) of transiently idle nodes on supercomputers. Recognizing that such unfilled nodes are well-suited for deep neural network (DNN) training, due to the flexible nature of DNN training tasks, Liu et al. proposed that the re-scaling DNN training tasks to fit gaps in schedules be formulated as a mixed-integer linear programming (MIL…
▽ More
First-come first-serve scheduling can result in substantial (up to 10%) of transiently idle nodes on supercomputers. Recognizing that such unfilled nodes are well-suited for deep neural network (DNN) training, due to the flexible nature of DNN training tasks, Liu et al. proposed that the re-scaling DNN training tasks to fit gaps in schedules be formulated as a mixed-integer linear programming (MILP) problem, and demonstrated via simulation the potential benefits of the approach. Here, we introduce MalleTrain, a system that provides the first practical implementation of this approach and that furthermore generalizes it by allowing it use even for DNN training applications for which model information is unknown before runtime. Key to this latter innovation is the use of a lightweight online job profiling advisor (JPA) to collect critical scalability information for DNN jobs -- information that it then employs to optimize resource allocations dynamically, in real time. We describe the MalleTrain architecture and present the results of a detailed experimental evaluation on a supercomputer GPU cluster and several representative DNN training workloads, including neural architecture search and hyperparameter optimization. Our results not only confirm the practical feasibility of leveraging idle supercomputer nodes for DNN training but improve significantly on prior results, improving training throughput by up to 22.3\% without requiring users to provide job scalability information.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
Toward a Quantum Information System Cybersecurity Taxonomy and Testbed: Exploiting a Unique Opportunity for Early Impact
Authors:
Benjamin Blakely,
Joaquin Chung,
Alec Poczatek,
Ryan Syed,
Raj Kettimuthu
Abstract:
Any human-designed system can potentially be exploited in ways that its designers did not envision, and information systems or networks using quantum components do not escape this reality. We are presented with a unique but quickly waning opportunity to bring cybersecurity concerns to the forefront for quantum information systems before they become widely deployed. The resources and knowledge requ…
▽ More
Any human-designed system can potentially be exploited in ways that its designers did not envision, and information systems or networks using quantum components do not escape this reality. We are presented with a unique but quickly waning opportunity to bring cybersecurity concerns to the forefront for quantum information systems before they become widely deployed. The resources and knowledge required to do so, however, may not be common in the cybersecurity community. Yet, a nexus exist. Cybersecurity starts with risk, and there are good taxonomies for security vulnerabilities and impacts in classical systems. In this paper, we propose a preliminary taxonomy for quantum cybersecurity vulnerabilities that accounts for the latest advances in quantum information systems, and must evolve to incorporate well-established cybersecurity principles and methodologies. We envision a testbed environment designed and instrumented with the specific purpose of enabling a broad collaborative community of cybersecurity and quantum information system experts to conduct experimental evaluation of software and hardware security including both physical and virtual quantum components. Furthermore, we envision that such a resource may be available as a user facility to the open science research community.
△ Less
Submitted 18 April, 2024;
originally announced April 2024.
-
Rapid detection of rare events from in situ X-ray diffraction data using machine learning
Authors:
Weijian Zheng,
Jun-Sang Park,
Peter Kenesei,
Ahsan Ali,
Zhengchun Liu,
Ian T. Foster,
Nicholas Schwarz,
Rajkumar Kettimuthu,
Antonino Miceli,
Hemant Sharma
Abstract:
High-energy X-ray diffraction methods can non-destructively map the 3D microstructure and associated attributes of metallic polycrystalline engineering materials in their bulk form. These methods are often combined with external stimuli such as thermo-mechanical loading to take snapshots over time of the evolving microstructure and attributes. However, the extreme data volumes and the high costs o…
▽ More
High-energy X-ray diffraction methods can non-destructively map the 3D microstructure and associated attributes of metallic polycrystalline engineering materials in their bulk form. These methods are often combined with external stimuli such as thermo-mechanical loading to take snapshots over time of the evolving microstructure and attributes. However, the extreme data volumes and the high costs of traditional data acquisition and reduction approaches pose a barrier to quickly extracting actionable insights and improving the temporal resolution of these snapshots. Here we present a fully automated technique capable of rapidly detecting the onset of plasticity in high-energy X-ray microscopy data. Our technique is computationally faster by at least 50 times than the traditional approaches and works for data sets that are up to 9 times sparser than a full data set. This new technique leverages self-supervised image representation learning and clustering to transform massive data into compact, semantic-rich representations of visually salient characteristics (e.g., peak shapes). These characteristics can be a rapid indicator of anomalous events such as changes in diffraction peak shapes. We anticipate that this technique will provide just-in-time actionable information to drive smarter experiments that effectively deploy multi-modal X-ray diffraction methods that span many decades of length scales.
△ Less
Submitted 6 December, 2023;
originally announced December 2023.
-
Towards Distributed Quantum Computing by Qubit and Gate Graph Partitioning Techniques
Authors:
Marc Grau Davis,
Joaquin Chung,
Dirk Englund,
Rajkumar Kettimuthu
Abstract:
Distributed quantum computing is motivated by the difficulty in building large-scale, individual quantum computers. To solve that problem, a large quantum circuit is partitioned and distributed to small quantum computers for execution. Partitions running on different quantum computers share quantum information using entangled Bell pairs. However, entanglement generation and purification introduces…
▽ More
Distributed quantum computing is motivated by the difficulty in building large-scale, individual quantum computers. To solve that problem, a large quantum circuit is partitioned and distributed to small quantum computers for execution. Partitions running on different quantum computers share quantum information using entangled Bell pairs. However, entanglement generation and purification introduces both a runtime and memory overhead on distributed quantum computing. In this paper we study that trade-off by proposing two techniques for partitioning large quantum circuits and for distribution to small quantum computers. Our techniques map a quantum circuit to a graph representation. We study two approaches: one that considers only gate teleportation, and another that considers both gate and state teleportation to achieve the distributed execution. Then we apply the METIS graph partitioning algorithm to obtain the partitions and the number of entanglement requests between them. We use the SeQUeNCe quantum communication simulator to measure the time required for generating all the entanglements required to execute the distributed circuit. We find that the best partitioning technique will depend on the specific circuit of interest.
△ Less
Submitted 5 October, 2023;
originally announced October 2023.
-
Masked Sinogram Model with Transformer for ill-Posed Computed Tomography Reconstruction: a Preliminary Study
Authors:
Zhengchun Liu,
Rajkumar Kettimuthu,
Ian Foster
Abstract:
Computed Tomography (CT) is an imaging technique where information about an object are collected at different angles (called projections or scans). Then the cross-sectional image showing the internal structure of the slice is produced by solving an inverse problem. Limited by certain factors such as radiation dosage, projection angles, the produced images can be noisy or contain artifacts. Inspire…
▽ More
Computed Tomography (CT) is an imaging technique where information about an object are collected at different angles (called projections or scans). Then the cross-sectional image showing the internal structure of the slice is produced by solving an inverse problem. Limited by certain factors such as radiation dosage, projection angles, the produced images can be noisy or contain artifacts. Inspired by the success of transformer for natural language processing, the core idea of this preliminary study is to consider a projection of tomography as a word token, and the whole scan of the cross-section (A.K.A. sinogram) as a sentence in the context of natural language processing. Then we explore the idea of foundation model by training a masked sinogram model (MSM) and fine-tune MSM for various downstream applications including CT reconstruction under data collections restriction (e.g., photon-budget) and a data-driven solution to approximate solutions of the inverse problem for CT reconstruction. Models and data used in this study are available at https://github.com/lzhengchun/TomoTx.
△ Less
Submitted 3 September, 2022;
originally announced September 2022.
-
La Résistance: Harnessing Heterogeneous Resources for Adaptive Resiliency in 6G Networks
Authors:
Ganesh C. Sankaran,
Joaquin Chung,
Raj Kettimuthu
Abstract:
Recent years have seen more critical applications designed to protect human lives (e.g., environmental sensing, emergency response, and tactical defense) being deployed over wireless networks. These critical deployments expect higher data rates, ultra-low latency, and ultra-high reliability. 6G wireless networks are expected to fill the gap in terms of the first two aspects (i.e., higher data rate…
▽ More
Recent years have seen more critical applications designed to protect human lives (e.g., environmental sensing, emergency response, and tactical defense) being deployed over wireless networks. These critical deployments expect higher data rates, ultra-low latency, and ultra-high reliability. 6G wireless networks are expected to fill the gap in terms of the first two aspects (i.e., higher data rates and ultra-low latency), however providing ultra-high reliability is a wide open challenge. All is well when everything works but when there is a failure or a security attack, the entire system collapses exposing the associated human lives to imminent danger. Avoiding this requires the strongest of assurances that safety and security aspects are protected no matter what. Large scale critical applications are waiting for this piece of the puzzle to be solved. At this juncture, we envision the bold theme of La Résistance 6G (LR6G) that would pave the way for deploying mission-critical applications and services over 6G networks. It aims to achieve ultra-high reliability and resiliency to any disruptions, be it failures or security attacks. A few disruptions are easy to handle such as a cloud VM or primary link failure. In both of these cases, applications can be restored by activating the standby resource. However, some disruptions can be detrimental such as when a cut-vertex fails or when the disruption leaves the critical application to fail without access to a standby resource. These critical applications (e.g. Smart Manufacturing, Smart City, Ocean monitoring, Wildfire monitoring, etc.) are highly distributed in nature. They must continue to deliver their mission objectives during a disruption to protect human lives. In this paper, we present our LR6G vision and outline the challenges towards achieving this vision.
△ Less
Submitted 29 April, 2022;
originally announced May 2022.
-
fairDMS: Rapid Model Training by Data and Model Reuse
Authors:
Ahsan Ali,
Hemant Sharma,
Rajkumar Kettimuthu,
Peter Kenesei,
Dennis Trujillo,
Antonino Miceli,
Ian Foster,
Ryan Coffee,
Jana Thayer,
Zhengchun Liu
Abstract:
Extracting actionable information rapidly from data produced by instruments such as the Linac Coherent Light Source (LCLS-II) and Advanced Photon Source Upgrade (APS-U) is becoming ever more challenging due to high (up to TB/s) data rates. Conventional physics-based information retrieval methods are hard-pressed to detect interesting events fast enough to enable timely focusing on a rare event or…
▽ More
Extracting actionable information rapidly from data produced by instruments such as the Linac Coherent Light Source (LCLS-II) and Advanced Photon Source Upgrade (APS-U) is becoming ever more challenging due to high (up to TB/s) data rates. Conventional physics-based information retrieval methods are hard-pressed to detect interesting events fast enough to enable timely focusing on a rare event or correction of an error. Machine learning~(ML) methods that learn cheap surrogate classifiers present a promising alternative, but can fail catastrophically when changes in instrument or sample result in degradation in ML performance. To overcome such difficulties, we present a new data storage and ML model training architecture designed to organize large volumes of data and models so that when model degradation is detected, prior models and/or data can be queried rapidly and a more suitable model retrieved and fine-tuned for new conditions. We show that our approach can achieve up to 100x data labelling speedup compared to the current state-of-the-art, 200x improvement in training speed, and 92x speedup in-terms of end-to-end model updating time.
△ Less
Submitted 11 August, 2022; v1 submitted 20 April, 2022;
originally announced April 2022.
-
High-Performance Ptychographic Reconstruction with Federated Facilities
Authors:
Tekin Bicer,
Xiaodong Yu,
Daniel J. Ching,
Ryan Chard,
Mathew J. Cherukara,
Bogdan Nicolae,
Rajkumar Kettimuthu,
Ian T. Foster
Abstract:
Beamlines at synchrotron light source facilities are powerful scientific instruments used to image samples and observe phenomena at high spatial and temporal resolutions. Typically, these facilities are equipped only with modest compute resources for the analysis of generated experimental datasets. However, high data rate experiments can easily generate data in volumes that take days (or even week…
▽ More
Beamlines at synchrotron light source facilities are powerful scientific instruments used to image samples and observe phenomena at high spatial and temporal resolutions. Typically, these facilities are equipped only with modest compute resources for the analysis of generated experimental datasets. However, high data rate experiments can easily generate data in volumes that take days (or even weeks) to process on those local resources. To address this challenge, we present a system that unifies leadership computing and experimental facilities by enabling the automated establishment of data analysis pipelines that extend from edge data acquisition systems at synchrotron beamlines to remote computing facilities; under the covers, our system uses Globus Auth authentication to minimize user interaction, funcX to run user-defined functions on supercomputers, and Globus Flows to define and execute workflows. We describe the application of this system to ptychography, an ultra-high-resolution coherent diffraction imaging technique that can produce 100s of gigabytes to terabytes in a single experiment. When deployed on the DGX A100 ThetaGPU cluster at the Argonne Leadership Computing Facility and a microscopy beamline at the Advanced Photon Source, our system performs analysis as an experiment progresses to provide timely feedback.
△ Less
Submitted 22 November, 2021;
originally announced November 2021.
-
BFTrainer: Low-Cost Training of Neural Networks on Unfillable Supercomputer Nodes
Authors:
Zhengchun Liu,
Rajkumar Kettimuthu,
Michael E. Papka,
Ian Foster
Abstract:
Supercomputer FCFS-based scheduling policies result in many transient idle nodes, a phenomenon that is only partially alleviated by backfill scheduling methods that promote small jobs to run before large jobs. Here we describe how to realize a novel use for these otherwise wasted resources, namely, deep neural network (DNN) training. This important workload is easily organized as many small fragme…
▽ More
Supercomputer FCFS-based scheduling policies result in many transient idle nodes, a phenomenon that is only partially alleviated by backfill scheduling methods that promote small jobs to run before large jobs. Here we describe how to realize a novel use for these otherwise wasted resources, namely, deep neural network (DNN) training. This important workload is easily organized as many small fragments that can be configured dynamically to fit essentially any node*time hole in a supercomputer's schedule. We describe how the task of rescaling suitable DNN training tasks to fit dynamically changing holes can be formulated as a deterministic mixed integer linear programming (MILP)-based resource allocation algorithm, and show that this MILP problem can be solved efficiently at run time. We show further how this MILP problem can be adapted to optimize for administrator- or user-defined metrics. We validate our method with supercomputer scheduler logs and different DNN training scenarios, and demonstrate efficiencies of up to 93% compared with running the same training tasks on dedicated nodes. Our method thus enables substantial supercomputer resources to be allocated to DNN training with no impact on other applications.
△ Less
Submitted 22 June, 2021;
originally announced June 2021.
-
Towards Accommodating Real-time Jobs on HPC Platforms
Authors:
Sam Nickolay,
Eun-Sung Jung,
Rajkumar Kettimuthu,
Ian Foster
Abstract:
Increasing data volumes in scientific experiments necessitate the use of high-performance computing (HPC) resources for data analysis. In many scientific fields, the data generated from scientific instruments and supercomputer simulations must be analyzed rapidly. In fact, the requirement for quasi-instant feedback is growing. Scientists want to use results from one experiment to guide the selecti…
▽ More
Increasing data volumes in scientific experiments necessitate the use of high-performance computing (HPC) resources for data analysis. In many scientific fields, the data generated from scientific instruments and supercomputer simulations must be analyzed rapidly. In fact, the requirement for quasi-instant feedback is growing. Scientists want to use results from one experiment to guide the selection of the next or even to improve the course of a single experiment. Current HPC systems are typically batch-scheduled under policies in which an arriving job is run immediately only if enough resources are available; otherwise, it is queued. It is hard for these systems to support real-time jobs. Real-time jobs, in order to meet their requirements, should sometimes have to preempt batch jobs and/or be scheduled ahead of batch jobs that were submitted earlier. Accommodating real-time jobs may negatively impact system utilization also, especially when preemption/restart of batch jobs is involved. We first explore several existing scheduling strategies to make real-time jobs more likely to be scheduled in due time. We then rigorously formulate the problem as a mixed-integer linear programming for offline scheduling and develop novel scheduling heuristics for online scheduling. We perform simulation studies using trace logs of Mira, the IBM BG/Q system at Argonne National Laboratory, to quantify the impact of real-time jobs on batch job performance for various percentages of real-time jobs in the workload. We present new insights gained from grouping jobs into different categories based on runtime and the number of nodes used and studying the performance of each category. Our results show that with 10% real-time job percentages, just-in-time checkpointing combined with our heuristic can improve the slowdowns of real-time jobs by 35% while limiting the increase of the slowdowns of batch jobs to 10%.
△ Less
Submitted 24 March, 2021;
originally announced March 2021.
-
Fast and accurate learned multiresolution dynamical downscaling for precipitation
Authors:
Jiali Wang,
Zhengchun Liu,
Ian Foster,
Won Chang,
Rajkumar Kettimuthu,
Rao Kotamarthi
Abstract:
This study develops a neural network-based approach for emulating high-resolution modeled precipitation data with comparable statistical properties but at greatly reduced computational cost. The key idea is to use combination of low- and high- resolution simulations to train a neural network to map from the former to the latter. Specifically, we define two types of CNNs, one that stacks variables…
▽ More
This study develops a neural network-based approach for emulating high-resolution modeled precipitation data with comparable statistical properties but at greatly reduced computational cost. The key idea is to use combination of low- and high- resolution simulations to train a neural network to map from the former to the latter. Specifically, we define two types of CNNs, one that stacks variables directly and one that encodes each variable before stacking, and we train each CNN type both with a conventional loss function, such as mean square error (MSE), and with a conditional generative adversarial network (CGAN), for a total of four CNN variants. We compare the four new CNN-derived high-resolution precipitation results with precipitation generated from original high resolution simulations, a bilinear interpolater and the state-of-the-art CNN-based super-resolution (SR) technique. Results show that the SR technique produces results similar to those of the bilinear interpolator with smoother spatial and temporal distributions and smaller data variabilities and extremes than the original high resolution simulations. While the new CNNs trained by MSE generate better results over some regions than the interpolator and SR technique do, their predictions are still not as close as the original high resolution simulations. The CNNs trained by CGAN generate more realistic and physically reasonable results, better capturing not only data variability in time and space but also extremes such as intense and long-lasting storms. The new proposed CNN-based downscaling approach can downscale precipitation from 50~km to 12~km in 14~min for 30~years once the network is trained (training takes 4~hours using 1~GPU), while the conventional dynamical downscaling would take 1~month using 600 CPU cores to generate simulations at the resolution of 12~km over contiguous United States.
△ Less
Submitted 17 January, 2021;
originally announced January 2021.
-
Petascale XCT: 3D Image Reconstruction with Hierarchical Communications on Multi-GPU Nodes
Authors:
Mert Hidayetoglu,
Tekin Bicer,
Simon Garcia de Gonzalo,
Bin Ren,
Vincent De Andrade,
Doga Gursoy,
Raj Kettimuthu,
Ian T. Foster,
Wen-mei W. Hwu
Abstract:
X-ray computed tomography is a commonly used technique for noninvasive imaging at synchrotron facilities. Iterative tomographic reconstruction algorithms are often preferred for recovering high quality 3D volumetric images from 2D X-ray images, however, their use has been limited to small/medium datasets due to their computational requirements. In this paper, we propose a high-performance iterativ…
▽ More
X-ray computed tomography is a commonly used technique for noninvasive imaging at synchrotron facilities. Iterative tomographic reconstruction algorithms are often preferred for recovering high quality 3D volumetric images from 2D X-ray images, however, their use has been limited to small/medium datasets due to their computational requirements. In this paper, we propose a high-performance iterative reconstruction system for terabyte(s)-scale 3D volumes. Our design involves three novel optimizations: (1) optimization of (back)projection operators by extending the 2D memory-centric approach to 3D; (2) performing hierarchical communications by exploiting "fat-node" architecture with many GPUs; (3) utilization of mixed-precision types while preserving convergence rate and quality. We extensively evaluate the proposed optimizations and scaling on the Summit supercomputer. Our largest reconstruction is a mouse brain volume with 9Kx11Kx11K voxels, where the total reconstruction time is under three minutes using 24,576 GPUs, reaching 65 PFLOPS: 34% of Summit's peak performance.
△ Less
Submitted 15 September, 2020;
originally announced September 2020.
-
Design and Evaluation of a Simple Data Interface for Efficient Data Transfer Across Diverse Storage
Authors:
Zhengchun Liu,
Rajkumar Kettimuthu,
Joaquin Chung,
Rachana Ananthakrishnan,
Michael Link,
Ian Foster
Abstract:
Modern science and engineering computing environments often feature storage systems of different types, from parallel file systems in high-performance computing centers to object stores operated by cloud providers. To enable easy, reliable, secure, and performant data exchange among these different systems, we propose Connector, a pluggable data access architecture for diverse, distributed storage…
▽ More
Modern science and engineering computing environments often feature storage systems of different types, from parallel file systems in high-performance computing centers to object stores operated by cloud providers. To enable easy, reliable, secure, and performant data exchange among these different systems, we propose Connector, a pluggable data access architecture for diverse, distributed storage. By abstracting low-level storage system details, this abstraction permits a managed data transfer service (Globus in our case) to interact with a large and easily extended set of storage systems. Equally important, it supports third-party transfers: that is, direct data transfers from source to destination that are initiated by a third-party client but do not engage that third party in the data path. The abstraction also enables management of transfers for performance optimization, error handling, and end-to-end integrity. We present the Connector design, describe implementations for different storage services, evaluate tradeoffs inherent in managed vs.\ direct transfers, motivate recommended deployment options, and propose a performance model-based method that allows for easy characterization of performance in different contexts without exhaustive benchmarking.
△ Less
Submitted 7 September, 2020;
originally announced September 2020.
-
BraggNN: Fast X-ray Bragg Peak Analysis Using Deep Learning
Authors:
Zhengchun Liu,
Hemant Sharma,
Jun-Sang Park,
Peter Kenesei,
Antonino Miceli,
Jonathan Almer,
Rajkumar Kettimuthu,
Ian Foster
Abstract:
X-ray diffraction based microscopy techniques such as High Energy Diffraction Microscopy rely on knowledge of the position of diffraction peaks with high precision. These positions are typically computed by fitting the observed intensities in area detector data to a theoretical peak shape such as pseudo-Voigt. As experiments become more complex and detector technologies evolve, the computational c…
▽ More
X-ray diffraction based microscopy techniques such as High Energy Diffraction Microscopy rely on knowledge of the position of diffraction peaks with high precision. These positions are typically computed by fitting the observed intensities in area detector data to a theoretical peak shape such as pseudo-Voigt. As experiments become more complex and detector technologies evolve, the computational cost of such peak detection and shape fitting becomes the biggest hurdle to the rapid analysis required for real-time feedback during in-situ experiments. To this end, we propose BraggNN, a deep learning-based method that can determine peak positions much more rapidly than conventional pseudo-Voigt peak fitting. When applied to a test dataset, BraggNN gives errors of less than 0.29 and 0.57 pixels, relative to the conventional method, for 75% and 95% of the peaks, respectively. When applied to a real experimental dataset, a 3D reconstruction that used peak positions computed by BraggNN yields 15% better results on average as compared to a reconstruction obtained using peak positions determined using conventional 2D pseudo-Voigt fitting. Recent advances in deep learning method implementations and special-purpose model inference accelerators allow BraggNN to deliver enormous performance improvements relative to the conventional method, running, for example, more than 200 times faster than a conventional method on a consumer-class GPU card with out-of-the-box software.
△ Less
Submitted 2 June, 2021; v1 submitted 18 August, 2020;
originally announced August 2020.
-
Scientific Image Restoration Anywhere
Authors:
Vibhatha Abeykoon,
Zhengchun Liu,
Rajkumar Kettimuthu,
Geoffrey Fox,
Ian Foster
Abstract:
The use of deep learning models within scientific experimental facilities frequently requires low-latency inference, so that, for example, quality control operations can be performed while data are being collected. Edge computing devices can be useful in this context, as their low cost and compact form factor permit them to be co-located with the experimental apparatus. Can such devices, with thei…
▽ More
The use of deep learning models within scientific experimental facilities frequently requires low-latency inference, so that, for example, quality control operations can be performed while data are being collected. Edge computing devices can be useful in this context, as their low cost and compact form factor permit them to be co-located with the experimental apparatus. Can such devices, with their limited resources, can perform neural network feed-forward computations efficiently and effectively? We explore this question by evaluating the performance and accuracy of a scientific image restoration model, for which both model input and output are images, on edge computing devices. Specifically, we evaluate deployments of TomoGAN, an image-denoising model based on generative adversarial networks developed for low-dose x-ray imaging, on the Google Edge TPU and NVIDIA Jetson. We adapt TomoGAN for edge execution, evaluate model inference performance, and propose methods to address the accuracy drop caused by model quantization. We show that these edge computing devices can deliver accuracy comparable to that of a full-fledged CPU or GPU model, at speeds that are more than adequate for use in the intended deployments, denoising a 1024 x 1024 image in less than a second. Our experiments also show that the Edge TPU models can provide 3x faster inference response than a CPU-based model and 1.5x faster than an edge GPU-based model. This combination of high speed and low cost permits image restoration anywhere.
△ Less
Submitted 12 November, 2019;
originally announced November 2019.
-
Deep Learning Accelerated Light Source Experiments
Authors:
Zhengchun Liu,
Tekin Bicer,
Rajkumar Kettimuthu,
Ian Foster
Abstract:
Experimental protocols at synchrotron light sources typically process and validate data only after an experiment has completed, which can lead to undetected errors and cannot enable online steering. Real-time data analysis can enable both detection of, and recovery from, errors, and optimization of data acquisition. However, modern scientific instruments, such as detectors at synchrotron light sou…
▽ More
Experimental protocols at synchrotron light sources typically process and validate data only after an experiment has completed, which can lead to undetected errors and cannot enable online steering. Real-time data analysis can enable both detection of, and recovery from, errors, and optimization of data acquisition. However, modern scientific instruments, such as detectors at synchrotron light sources, can generate data at GBs/sec rates. Data processing methods such as the widely used computational tomography usually require considerable computational resources, and yield poor quality reconstructions in the early stages of data acquisition when available views are sparse. We describe here how a deep convolutional neural network can be integrated into the real-time streaming tomography pipeline to enable better-quality images in the early stages of data acquisition. Compared with conventional streaming tomography processing, our method can significantly improve tomography image quality, deliver comparable images using only 32% of the data needed for conventional streaming processing, and save 68% experiment time for data acquisition.
△ Less
Submitted 9 October, 2019;
originally announced October 2019.
-
TomoGAN: Low-Dose Synchrotron X-Ray Tomography with Generative Adversarial Networks
Authors:
Zhengchun Liu,
Tekin Bicer,
Rajkumar Kettimuthu,
Doga Gursoy,
Francesco De Carlo,
Ian Foster
Abstract:
Synchrotron-based x-ray tomography is a noninvasive imaging technique that allows for reconstructing the internal structure of materials at high spatial resolutions from tens of micrometers to a few nanometers. In order to resolve sample features at smaller length scales, however, a higher radiation dose is required. Therefore, the limitation on the achievable resolution is set primarily by noise…
▽ More
Synchrotron-based x-ray tomography is a noninvasive imaging technique that allows for reconstructing the internal structure of materials at high spatial resolutions from tens of micrometers to a few nanometers. In order to resolve sample features at smaller length scales, however, a higher radiation dose is required. Therefore, the limitation on the achievable resolution is set primarily by noise at these length scales. We present \TOMOGAN{}, a denoising technique based on generative adversarial networks, for improving the quality of reconstructed images for low-dose imaging conditions. We evaluate our approach in two photon-budget-limited experimental conditions: (1) sufficient number of low-dose projections (based on Nyquist sampling), and (2) insufficient or limited number of high-dose projections. In both cases the angular sampling is assumed to be isotropic, and the photon budget throughout the experiment is fixed based on the maximum allowable radiation dose on the sample. Evaluation with both simulated and experimental datasets shows that our approach can significantly reduce noise in reconstructed images, improving the structural similarity score of simulation and experimental data from 0.18 to 0.9 and from 0.18 to 0.41, respectively. Furthermore, the quality of the reconstructed images with filtered back projection followed by our denoising approach exceeds that of reconstructions with the simultaneous iterative reconstruction technique, showing the computational superiority of our approach.
△ Less
Submitted 30 December, 2019; v1 submitted 20 February, 2019;
originally announced February 2019.