Skip to main content

Showing 1–36 of 36 results for author: Wahib, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.04647  [pdf, ps, other

    cs.DC math.NA

    RAPTOR: Practical Numerical Profiling of Scientific Applications

    Authors: Faveo Hoerold, Ivan R. Ivanov, Akash Dhruv, William S. Moses, Anshu Dubey, Mohamed Wahib, Jens Domke

    Abstract: The proliferation of low-precision units in modern high-performance architectures increasingly burdens domain scientists. Historically, the choice in HPC was easy: can we get away with 32 bit floating-point operations and lower bandwidth requirements, or is FP64 necessary? Driven by Artificial Intelligence, vendors introduced novel low-precision units for vector and tensor operations, and FP64 cap… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

    Comments: 12 pages, 8 figures, to be published in SC'25

  2. arXiv:2506.21411  [pdf, ps, other

    cs.LG

    Distributed Cross-Channel Hierarchical Aggregation for Foundation Models

    Authors: Aristeidis Tsaris, Isaac Lyngaas, John Lagregren, Mohamed Wahib, Larry York, Prasanna Balaprakash, Dan Lu, Feiyi Wang, Xiao Wang

    Abstract: Vision-based scientific foundation models hold significant promise for advancing scientific discovery and innovation. This potential stems from their ability to aggregate images from diverse sources such as varying physical groundings or data acquisition systems and to learn spatio-temporal correlations using transformer architectures. However, tokenizing and aggregating images can be compute-inte… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

  3. arXiv:2505.14864  [pdf, ps, other

    cs.DC cs.AI

    Balanced and Elastic End-to-end Training of Dynamic LLMs

    Authors: Mohamed Wahib, Muhammed Abdullah Soyturk, Didem Unat

    Abstract: To reduce computational and memory costs in Large Language Models (LLMs), dynamic workload reduction schemes like Mixture of Experts (MoEs), parameter pruning, layer freezing, sparse attention, early token exit, and Mixture of Depths (MoDs) have emerged. However, these methods introduce severe workload imbalances, limiting their practicality for large-scale distributed training. We propose DynMo,… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

  4. arXiv:2505.13955  [pdf, other

    cs.DC

    Paradigm Shift in Infrastructure Inspection Technology: Leveraging High-performance Imaging and Advanced AI Analytics to Inspect Road Infrastructure

    Authors: Du Wu, Enzhi Zhang, Isaac Lyngaas, Xiao Wang, Amir Ziabari, Tao Luo, Peng Chen, Kento Sato, Fumiyoshi Shoji, Takaki Hatsui, Kentaro Uesugi, Akira Seo, Yasuhito Sakai, Toshio Endo, Tetsuya Ishikawa, Satoshi Matsuoka, Mohamed Wahib

    Abstract: Effective road infrastructure management is crucial for modern society. Traditional manual inspection techniques remain constrained by cost, efficiency, and scalability, while camera and laser imaging methods fail to capture subsurface defects critical for long-term structural integrity. This paper introduces ROVAI, an end-to-end framework that integrates high-resolution X-ray computed tomography… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

    Comments: Submitting this work to be considered for the Gordon Bell Award in SC25

  5. arXiv:2505.12621  [pdf, ps, other

    cs.CL cs.IR

    Think Before You Attribute: Improving the Performance of LLMs Attribution Systems

    Authors: João Eduardo Batista, Emil Vatai, Mohamed Wahib

    Abstract: Large Language Models (LLMs) are increasingly applied in various science domains, yet their broader adoption remains constrained by a critical challenge: the lack of trustworthy, verifiable outputs. Current LLMs often generate answers without reliable source attribution, or worse, with incorrect attributions, posing a barrier to their use in scientific and high-stakes settings, where traceability… ▽ More

    Submitted 18 May, 2025; originally announced May 2025.

    Comments: 22 pages (9 pages of content, 4 pages of references, 9 pages of supplementary material), 7 figures, 10 tables

  6. arXiv:2505.04802  [pdf, other

    cs.LG astro-ph.EP cs.AI cs.DC physics.ao-ph

    ORBIT-2: Scaling Exascale Vision Foundation Models for Weather and Climate Downscaling

    Authors: Xiao Wang, Jong-Youl Choi, Takuya Kurihaya, Isaac Lyngaas, Hong-Jun Yoon, Ming Fan, Nasik Muhammad Nafi, Aristeidis Tsaris, Ashwin M. Aji, Maliha Hossain, Mohamed Wahib, Dali Wang, Peter Thornton, Prasanna Balaprakash, Moetasim Ashfaq, Dan Lu

    Abstract: Sparse observations and coarse-resolution climate models limit effective regional decision-making, underscoring the need for robust downscaling. However, existing AI methods struggle with generalization across variables and geographies and are constrained by the quadratic complexity of Vision Transformer (ViT) self-attention. We introduce ORBIT-2, a scalable foundation model for global, hyper-reso… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

  7. A Unifying Framework to Enable Artificial Intelligence in High Performance Computing Workflows

    Authors: Jens Domke, Mohamed Wahib, Anshu Dubey, Tal Ben-Nun, Erik W. Draeger

    Abstract: Current trends point to a future where large-scale scientific applications are tightly-coupled HPC/AI hybrids. Hence, we urgently need to invest in creating a seamless, scalable framework where HPC and AI/ML can efficiently work together and adapt to novel hardware and vendor libraries without starting from scratch every few years. The current ecosystem and sparsely-connected community are not suf… ▽ More

    Submitted 5 May, 2025; originally announced May 2025.

    Comments: article is still in press; DOI was already assgined by publisher; publication will appear in Computing in Science & Engineering (CiSE) https://www.computer.org/csdl/magazine/cs

  8. arXiv:2502.16851  [pdf, other

    cs.DC cs.PF

    Can Tensor Cores Benefit Memory-Bound Kernels? (No!)

    Authors: Lingqi Zhang, Jiajun Huang, Sheng Di, Satoshi Matsuoka, Mohamed Wahib

    Abstract: Tensor cores are specialized processing units within GPUs that have demonstrated significant efficiency gains in compute-bound applications such as Deep Learning Training by accelerating dense matrix operations. Given their success, researchers have attempted to extend tensor core capabilities beyond dense matrix computations to other computational patterns, including memory-bound kernels. Recent… ▽ More

    Submitted 27 February, 2025; v1 submitted 24 February, 2025; originally announced February 2025.

  9. Scaling Large-scale GNN Training to Thousands of Processors on CPU-based Supercomputers

    Authors: Chen Zhuang, Lingqi Zhang, Du Wu, Peng Chen, Jiajun Huang, Xin Liu, Rio Yokota, Nikoli Dryden, Toshio Endo, Satoshi Matsuoka, Mohamed Wahib

    Abstract: Graph Convolutional Networks (GCNs), particularly for large-scale graphs, are crucial across numerous domains. However, training distributed full-batch GCNs on large-scale graphs suffers from inefficient memory access patterns and high communication overhead. To address these challenges, we introduce \method{}, an efficient and scalable distributed GCN training framework tailored for CPU-powered s… ▽ More

    Submitted 26 May, 2025; v1 submitted 24 November, 2024; originally announced November 2024.

  10. arXiv:2410.03210  [pdf, other

    cs.LG

    Tadashi: Enabling AI-Based Automated Code Generation With Guaranteed Correctness

    Authors: Emil Vatai, Aleksandr Drozd, Ivan R. Ivanov, Joao E. Batista, Yinghao Ren, Mohamed Wahib

    Abstract: Frameworks and domain-specific languages for auto-generating code have traditionally depended on human experts to implement rigorous methods ensuring the legality of code transformations. Recently, machine learning (ML) has gained traction for generating code optimized for specific hardware targets. However, ML approaches-particularly black-box neural networks-offer no guarantees on the correctnes… ▽ More

    Submitted 2 June, 2025; v1 submitted 4 October, 2024; originally announced October 2024.

    Comments: Submitted to SC25

  11. arXiv:2407.15600  [pdf, other

    cs.NE cs.AI

    A Pairwise Comparison Relation-assisted Multi-objective Evolutionary Neural Architecture Search Method with Multi-population Mechanism

    Authors: Yu Xue, Chenchen Zhu, MengChu Zhou, Mohamed Wahib, Moncef Gabbouj

    Abstract: Neural architecture search (NAS) enables re-searchers to automatically explore vast search spaces and find efficient neural networks. But NAS suffers from a key bottleneck, i.e., numerous architectures need to be evaluated during the search process, which requires a lot of computing resources and time. In order to improve the efficiency of NAS, a series of methods have been proposed to reduce the… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

  12. arXiv:2405.15780  [pdf, other

    cs.CV cs.LG

    Sequence Length Scaling in Vision Transformers for Scientific Images on Frontier

    Authors: Aristeidis Tsaris, Chengming Zhang, Xiao Wang, Junqi Yin, Siyan Liu, Moetasim Ashfaq, Ming Fan, Jong Youl Choi, Mohamed Wahib, Dan Lu, Prasanna Balaprakash, Feiyi Wang

    Abstract: Vision Transformers (ViTs) are pivotal for foundational models in scientific imagery, including Earth science applications, due to their capability to process large sequence lengths. While transformers for text has inspired scaling sequence lengths in ViTs, yet adapting these for ViTs introduces unique challenges. We develop distributed sequence parallelism for ViTs, enabling them to handle up to… ▽ More

    Submitted 17 April, 2024; originally announced May 2024.

  13. arXiv:2404.09707  [pdf, other

    cs.CV cs.AI cs.LG

    Adaptive Patching for High-resolution Image Segmentation with Transformers

    Authors: Enzhi Zhang, Isaac Lyngaas, Peng Chen, Xiao Wang, Jun Igarashi, Yuankai Huo, Mohamed Wahib, Masaharu Munetomo

    Abstract: Attention-based models are proliferating in the space of image analytics, including segmentation. The standard method of feeding images to transformer encoders is to divide the images into patches and then feed the patches to the model as a linear sequence of tokens. For high-resolution images, e.g. microscopic pathology images, the quadratic compute and memory cost prohibits the use of an attenti… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

  14. arXiv:2401.03378  [pdf, other

    cs.DC math.NA

    CG-Kit: Code Generation Toolkit for Performant and Maintainable Variants of Source Code Applied to Flash-X Hydrodynamics Simulations

    Authors: Johann Rudi, Youngjun Lee, Aidan H. Chadha, Mohamed Wahib, Klaus Weide, Jared P. O'Neal, Anshu Dubey

    Abstract: CG-Kit is a new code generation toolkit that we propose as a solution for portability and maintainability for scientific computing applications. The development of CG-Kit is rooted in the urgent need created by the shifting landscape of high-performance computing platforms and the algorithmic complexities of a particular large-scale multiphysics application: Flash-X. This combination leads to uniq… ▽ More

    Submitted 6 January, 2024; originally announced January 2024.

    Comments: submitted

  15. arXiv:2311.02382  [pdf, other

    cs.DC cs.AI

    Ultra-Long Sequence Distributed Transformer

    Authors: Xiao Wang, Isaac Lyngaas, Aristeidis Tsaris, Peng Chen, Sajal Dash, Mayanka Chandra Shekar, Tao Luo, Hong-Jun Yoon, Mohamed Wahib, John Gouley

    Abstract: Transformer models trained on long sequences often achieve higher accuracy than short sequences. Unfortunately, conventional transformers struggle with long sequence training due to the overwhelming computation and memory requirements. Existing methods for long sequence training offer limited speedup and memory reduction, and may compromise accuracy. This paper presents a novel and efficient distr… ▽ More

    Submitted 8 November, 2023; v1 submitted 4 November, 2023; originally announced November 2023.

  16. arXiv:2310.10102  [pdf, other

    cs.DC cs.CV cs.LG

    KAKURENBO: Adaptively Hiding Samples in Deep Neural Network Training

    Authors: Truong Thao Nguyen, Balazs Gerofi, Edgar Josafat Martinez-Noriega, François Trahay, Mohamed Wahib

    Abstract: This paper proposes a method for hiding the least-important samples during the training of deep neural networks to increase efficiency, i.e., to reduce the cost of training. Using information about the loss and prediction confidence during training, we adaptively find samples to exclude in a given epoch based on their contribution to the overall learning process, without significantly degrading ac… ▽ More

    Submitted 16 October, 2023; originally announced October 2023.

    Comments: Advances in Neural Information Processing Systems 2023 (NeurIPS 2023)

  17. Exploiting Scratchpad Memory for Deep Temporal Blocking: A case study for 2D Jacobian 5-point iterative stencil kernel (j2d5pt)

    Authors: Lingqi Zhang, Mohamed Wahib, Peng Chen, Jintao Meng, Xiao Wang, Toshio Endo, Satoshi Matsuoka

    Abstract: General Purpose Graphics Processing Units (GPGPU) are used in most of the top systems in HPC. The total capacity of scratchpad memory has increased by more than 40 times in the last decade. However, existing optimizations for stencil computations using temporal blocking have not aggressively exploited the large capacity of scratchpad memory. This work uses the 2D Jacobian 5-point iterative stencil… ▽ More

    Submitted 5 June, 2023; originally announced June 2023.

    Comments: This is short paper is published in the 15th workshop on general purpose processing using GPU (GPGPU 2023)

  18. Revisiting Temporal Blocking Stencil Optimizations

    Authors: Lingqi Zhang, Mohamed Wahib, Peng Chen, Jintao Meng, Xiao Wang, Toshio Endo, Satoshi Matsuoka

    Abstract: Iterative stencils are used widely across the spectrum of High Performance Computing (HPC) applications. Many efforts have been put into optimizing stencil GPU kernels, given the prevalence of GPU-accelerated supercomputers. To improve the data locality, temporal blocking is an optimization that combines a batch of time steps to process them together. Under the observation that GPUs are evolving t… ▽ More

    Submitted 12 May, 2023; originally announced May 2023.

    Comments: This paper will be published in 2023 International Conference on Supercomputing (ICS23)

  19. arXiv:2301.02432  [pdf, other

    cs.DC cs.AR cs.CY cs.LG cs.SI

    Myths and Legends in High-Performance Computing

    Authors: Satoshi Matsuoka, Jens Domke, Mohamed Wahib, Aleksandr Drozd, Torsten Hoefler

    Abstract: In this thought-provoking article, we discuss certain myths and legends that are folklore among members of the high-performance computing community. We gathered these myths from conversations at conferences and meetings, product advertisements, papers, and other communications such as tweets, blogs, and news articles within and beyond our community. We believe they represent the zeitgeist of the c… ▽ More

    Submitted 24 October, 2023; v1 submitted 6 January, 2023; originally announced January 2023.

  20. arXiv:2208.11630  [pdf, other

    physics.comp-ph astro-ph.IM cs.MS

    Flash-X, a multiphysics simulation software instrument

    Authors: Anshu Dubey, Klaus Weide, Jared O'Neal, Akash Dhruv, Sean Couch, J. Austin Harris, Tom Klosterman, Rajeev Jain, Johann Rudi, Bronson Messer, Michael Pajkos, Jared Carlson, Ran Chu, Mohamed Wahib, Saurabh Chawdhary, Paul M. Ricker, Dongwook Lee, Katie Antypas, Katherine M. Riley, Christopher Daley, Murali Ganapathy, Francis X. Timmes, Dean M. Townsley, Marcos Vanella, John Bachan , et al. (6 additional authors not shown)

    Abstract: Flash-X is a highly composable multiphysics software system that can be used to simulate physical phenomena in several scientific domains. It derives some of its solvers from FLASH, which was first released in 2000. Flash-X has a new framework that relies on abstractions and asynchronous communications for performance portability across a range of increasingly heterogeneous hardware platforms. Fla… ▽ More

    Submitted 24 August, 2022; originally announced August 2022.

    Comments: 16 pages, 5 Figures, published open access in SoftwareX

    Journal ref: SoftwareX, Volume 19, 2022, 101168,ISSN 2352-7110

  21. Image Gradient Decomposition for Parallel and Memory-Efficient Ptychographic Reconstruction

    Authors: Xiao Wang, Aristeidis Tsaris, Debangshu Mukherjee, Mohamed Wahib, Peng Chen, Mark Oxley, Olga Ovchinnikova, Jacob Hinkle

    Abstract: Ptychography is a popular microscopic imaging modality for many scientific discoveries and sets the record for highest image resolution. Unfortunately, the high image resolution for ptychographic reconstruction requires significant amount of memory and computations, forcing many applications to compromise their image resolution in exchange for a smaller memory footprint and a shorter reconstructio… ▽ More

    Submitted 16 December, 2022; v1 submitted 12 May, 2022; originally announced May 2022.

    Journal ref: Proceedings of the SC 22. IEEE Press, Article 8, 1-13 (2022)

  22. arXiv:2204.07336  [pdf, ps, other

    cs.DC

    Preparing for the Future -- Rethinking Proxy Apps

    Authors: Satoshi Matsuoka, Jens Domke, Mohamed Wahib, Aleksandr Drozd, Ray Bair, Andrew A. Chien, Jeffrey S. Vetter, John Shalf

    Abstract: A considerable amount of research and engineering went into designing proxy applications, which represent common high-performance computing workloads, to co-design and evaluate the current generation of supercomputers, e.g., RIKEN's Supercomputer Fugaku, ANL's Aurora, or ORNL's Frontier. This process was necessary to standardize the procurement while avoiding duplicated effort at each HPC center t… ▽ More

    Submitted 15 April, 2022; originally announced April 2022.

  23. arXiv:2204.02235  [pdf, other

    cs.DC

    At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads

    Authors: Jens Domke, Emil Vatai, Balazs Gerofi, Yuetsu Kodama, Mohamed Wahib, Artur Podobas, Sparsh Mittal, Miquel Pericàs, Lingqi Zhang, Peng Chen, Aleksandr Drozd, Satoshi Matsuoka

    Abstract: Over the last three decades, innovations in the memory subsystem were primarily targeted at overcoming the data movement bottleneck. In this paper, we focus on a specific market trend in memory technology: 3D-stacked memory and caches. We investigate the impact of extending the on-chip memory capabilities in future HPC-focused processors, particularly by 3D-stacked SRAM. First, we propose a method… ▽ More

    Submitted 16 October, 2023; v1 submitted 5 April, 2022; originally announced April 2022.

  24. PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU Applications

    Authors: Lingqi Zhang, Mohamed Wahib, Peng Chen, Jintao Meng, Xiao Wang, Toshio Endo, Satoshi Matsuoka

    Abstract: Iterative memory-bound solvers commonly occur in HPC codes. Typical GPU implementations have a loop on the host side that invokes the GPU kernel as much as time/algorithm steps there are. The termination of each kernel implicitly acts the barrier required after advancing the solution every time step. We propose an execution model for running memory-bound iterative GPU kernels: PERsistent KernelS (… ▽ More

    Submitted 12 May, 2023; v1 submitted 5 April, 2022; originally announced April 2022.

    Comments: This paper will be published in 2023 International Conference on Supercomputing (ICS23)

  25. arXiv:2110.11466  [pdf, other

    cs.LG cs.DC

    MLPerf HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems

    Authors: Steven Farrell, Murali Emani, Jacob Balma, Lukas Drescher, Aleksandr Drozd, Andreas Fink, Geoffrey Fox, David Kanter, Thorsten Kurth, Peter Mattson, Dawei Mu, Amit Ruhela, Kento Sato, Koichi Shirahata, Tsuguchika Tabaru, Aristeidis Tsaris, Jan Balewski, Ben Cumming, Takumi Danjo, Jens Domke, Takaaki Fukai, Naoto Fukumoto, Tatsuya Fukushi, Balazs Gerofi, Takumi Honda , et al. (18 additional authors not shown)

    Abstract: Scientific communities are increasingly adopting machine learning and deep learning models in their applications to accelerate scientific insights. High performance computing systems are pushing the frontiers of performance with a rich diversity of hardware resources and massive scale-out capabilities. There is a critical need to understand fair and effective benchmarking of machine learning appli… ▽ More

    Submitted 26 October, 2021; v1 submitted 21 October, 2021; originally announced October 2021.

  26. Performance Portable Back-projection Algorithms on CPUs: Agnostic Data Locality and Vectorization Optimizations

    Authors: Peng Chen, Mohamed Wahib, Xiao Wang, Shinichiro Takizawa, Takahiro Hirofuchi, Hirotaka Ogawa, Satoshi Matsuoka

    Abstract: Computed Tomography (CT) is a key 3D imaging technology that fundamentally relies on the compute-intense back-projection operation to generate 3D volumes. GPUs are typically used for back-projection in production CT devices. However, with the rise of power-constrained micro-CT devices, and also the emergence of CPUs comparable in performance to GPUs, back-projection for CPUs could become favorable… ▽ More

    Submitted 27 April, 2021; originally announced April 2021.

    Comments: ACM International Conference on Supercomputing 2021 (ICS'21)

  27. An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks

    Authors: Albert Njoroge Kahira, Truong Thao Nguyen, Leonardo Bautista Gomez, Ryousei Takano, Rosa M Badia, Mohamed Wahib

    Abstract: Deep Neural Network (DNN) frameworks use distributed training to enable faster time to convergence and alleviate memory capacity limitations when training large models and/or using high dimension inputs. With the steady increase in datasets and model sizes, model/hybrid parallelism is deemed to have an important role in the future of distributed training of DNNs. We analyze the compute, communicat… ▽ More

    Submitted 19 April, 2021; originally announced April 2021.

    Comments: The International ACM Symposium on High-Performance Parallel and Distributed Computing 2021 (HPDC'21)

  28. arXiv:2010.14373  [pdf, other

    cs.DC

    Matrix Engines for High Performance Computing:A Paragon of Performance or Grasping at Straws?

    Authors: Jens Domke, Emil Vatai, Aleksandr Drozd, Peng Chen, Yosuke Oyama, Lingqi Zhang, Shweta Salaria, Daichi Mukunoki, Artur Podobas, Mohamed Wahib, Satoshi Matsuoka

    Abstract: Matrix engines or units, in different forms and affinities, are becoming a reality in modern processors; CPUs and otherwise. The current and dominant algorithmic approach to Deep Learning merits the commercial investments in these units, and deduced from the No.1 benchmark in supercomputing, namely High Performance Linpack, one would expect an awakened enthusiasm by the HPC community, too. Hence… ▽ More

    Submitted 27 February, 2021; v1 submitted 27 October, 2020; originally announced October 2020.

    Comments: IEEE International Parallel and Distributed Processing Symposium 2021 (IPDPS'21)

  29. GTOPX Space Mission Benchmarks

    Authors: Martin Schlueter, Mehdi Neshat, Mohamed Wahib, Masaharu Munetomo, Markus Wagner

    Abstract: This contribution introduces the GTOPX space mission benchmark collection, which is an extension of GTOP database published by the European Space Agency (ESA). GTOPX consists of ten individual benchmark instances representing real-world interplanetary space trajectory design problems. In regard to the original GTOP collection, GTOPX includes three new problem instances featuring mixed-integer and… ▽ More

    Submitted 17 February, 2021; v1 submitted 15 October, 2020; originally announced October 2020.

  30. arXiv:2008.11421  [pdf, other

    cs.DC cs.LG

    Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA

    Authors: Mohamed Wahib, Haoyu Zhang, Truong Thao Nguyen, Aleksandr Drozd, Jens Domke, Lingqi Zhang, Ryousei Takano, Satoshi Matsuoka

    Abstract: The dedicated memory of hardware accelerators can be insufficient to store all weights and/or intermediate states of large deep learning models. Although model parallelism is a viable approach to reduce the memory pressure issue, significant modification of the source code and considerations for algorithms are required. An alternative solution is to use out-of-core methods instead of, or in additi… ▽ More

    Submitted 26 August, 2020; originally announced August 2020.

    Comments: ACM/IEEE Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'20)

  31. A Computational-Graph Partitioning Method for Training Memory-Constrained DNNs

    Authors: Fareed Qararyah, Mohamed Wahib, Doğa Dikbayır, Mehmet Esat Belviranli, Didem Unat

    Abstract: Many state-of-the-art Deep Neural Networks (DNNs) have substantial memory requirements. Limited device memory becomes a bottleneck when training those models. We propose ParDNN, an automatic, generic, and non-intrusive partitioning strategy for DNNs that are represented as computational graphs. ParDNN decides a placement of DNN's underlying computational graph operations across multiple devices so… ▽ More

    Submitted 5 May, 2021; v1 submitted 19 August, 2020; originally announced August 2020.

  32. arXiv:2004.05371  [pdf, other

    cs.DC

    A Study of Single and Multi-device Synchronization Methods in Nvidia GPUs

    Authors: Lingqi Zhang, Mohamed Wahib, Haoyu Zhang, Satoshi Matsuoka

    Abstract: GPUs are playing an increasingly important role in general-purpose computing. Many algorithms require synchronizations at different levels of granularity in a single GPU. Additionally, the emergence of dense GPU nodes also calls for multi-GPU synchronization. Nvidia's latest CUDA provides a variety of synchronization methods. Until now, there is no full understanding of the characteristics of thos… ▽ More

    Submitted 11 April, 2020; originally announced April 2020.

    Comments: IPDPS20

    Journal ref: IEEE International Parallel & Distributed Processing Symposium 2020

  33. AN5D: Automated Stencil Framework for High-Degree Temporal Blocking on GPUs

    Authors: Kazuaki Matsumura, Hamid Reza Zohouri, Mohamed Wahib, Toshio Endo, Satoshi Matsuoka

    Abstract: Stencil computation is one of the most widely-used compute patterns in high performance computing applications. Spatial and temporal blocking have been proposed to overcome the memory-bound nature of this type of computation by moving memory pressure from external memory to on-chip memory on GPUs. However, correctly implementing those optimizations while considering the complexity of the architect… ▽ More

    Submitted 3 February, 2020; v1 submitted 6 January, 2020; originally announced January 2020.

  34. iFDK: A Scalable Framework for Instant High-resolution Image Reconstruction

    Authors: Peng Chen, Mohamed Wahib, Shinichiro Takizawa, Ryousei Takano, Satoshi Matsuoka

    Abstract: Computed Tomography (CT) is a widely used technology that requires compute-intense algorithms for image reconstruction. We propose a novel back-projection algorithm that reduces the projection computation cost to 1/6 of the standard algorithm. We also propose an efficient implementation that takes advantage of the heterogeneity of GPU-accelerated systems by overlapping the filtering and back-proje… ▽ More

    Submitted 6 September, 2019; originally announced September 2019.

    Comments: ACM/IEEE Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'19)

  35. A Versatile Software Systolic Execution Model for GPU Memory-Bound Kernels

    Authors: Peng Chen, Mohamed Wahib, Shinichiro Takizawa, Ryousei Takano, Satoshi Matsuoka

    Abstract: This paper proposes a versatile high-performance execution model, inspired by systolic arrays, for memory-bound regular kernels running on CUDA-enabled GPUs. We formulate a systolic model that shifts partial sums by CUDA warp primitives for the computation. We also employ register files as a cache resource in order to operate the entire model efficiently. We demonstrate the effectiveness and versa… ▽ More

    Submitted 6 September, 2019; v1 submitted 13 July, 2019; originally announced July 2019.

    Comments: ACM/IEEE Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'19)

  36. arXiv:1810.09330  [pdf, ps, other

    cs.DC

    Double-precision FPUs in High-Performance Computing: an Embarrassment of Riches?

    Authors: Jens Domke, Kazuaki Matsumura, Mohamed Wahib, Haoyu Zhang, Keita Yashima, Toshiki Tsuchikawa, Yohei Tsuji, Artur Podobas, Satoshi Matsuoka

    Abstract: Among the (uncontended) common wisdom in High-Performance Computing (HPC) is the applications' need for large amount of double-precision support in hardware. Hardware manufacturers, the TOP500 list, and (rarely revisited) legacy software have without doubt followed and contributed to this view. In this paper, we challenge that wisdom, and we do so by exhaustively comparing a large number of HPC… ▽ More

    Submitted 25 March, 2019; v1 submitted 22 October, 2018; originally announced October 2018.

    Comments: IEEE International Parallel and Distributed Processing Symposium 2019