Skip to main content

Showing 1–39 of 39 results for author: Keyes, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.11164  [pdf, ps, other

    cs.CV cs.AI cs.LG

    Synthetic Geology -- Structural Geology Meets Deep Learning

    Authors: Simon Ghyselincks, Valeriia Okhmak, Stefano Zampini, George Turkiyyah, David Keyes, Eldad Haber

    Abstract: Visualizing the first few kilometers of the Earth's subsurface, a long-standing challenge gating a virtually inexhaustible list of important applications, is coming within reach through deep learning. Building on techniques of generative artificial intelligence applied to voxelated images, we demonstrate a method that extends surface geological data supplemented by boreholes to a three-dimensional… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: 10 pages, 8 figures, submitted to "Communications Earth & Environment", geological simulation code at https://doi.org/10.5281/zenodo.15244035, generative AI code at https://github.com/chipnbits/flowtrain_stochastic_interpolation/releases/tag/v1.0.0

  2. arXiv:2505.06896  [pdf, ps, other

    cs.DC stat.CO

    RCOMPSs: A Scalable Runtime System for R Code Execution on Manycore Systems

    Authors: Xiran Zhang, Javier Conejero, Sameh Abdulah, Jorge Ejarque, Ying Sun, Rosa M. Badia, David E. Keyes, Marc G. Genton

    Abstract: R has become a cornerstone of scientific and statistical computing due to its extensive package ecosystem, expressive syntax, and strong support for reproducible analysis. However, as data sizes and computational demands grow, native R parallelism support remains limited. This paper presents RCOMPSs, a scalable runtime system that enables efficient parallel execution of R applications on multicore… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

  3. arXiv:2504.19171  [pdf, other

    cs.PF stat.CO

    GPU-Accelerated Parallel Selected Inversion for Structured Matrices Using sTiles

    Authors: Esmail Abdul Fattah, Hatem Ltaief, Havard Rue, David Keyes

    Abstract: Selected inversion is essential for applications such as Bayesian inference, electronic structure calculations, and inverse covariance estimation, where computing only specific elements of large sparse matrix inverses significantly reduces computational and memory overhead. We present an efficient implementation of a two-phase parallel algorithm for computing selected elements of the inverse of a… ▽ More

    Submitted 27 April, 2025; originally announced April 2025.

  4. arXiv:2504.12004  [pdf, other

    cs.DC

    Scaled Block Vecchia Approximation for High-Dimensional Gaussian Process Emulation on GPUs

    Authors: Qilong Pan, Sameh Abdulah, Mustafa Abduljabbar, Hatem Ltaief, Andreas Herten, Mathis Bode, Matthew Pratola, Arindam Fadikar, Marc G. Genton, David E. Keyes, Ying Sun

    Abstract: Emulating computationally intensive scientific simulations is essential to enable uncertainty quantification, optimization, and decision-making at scale. Gaussian Processes (GPs) offer a flexible and data-efficient foundation for statistical emulation, but their poor scalability limits applicability to large datasets. We introduce the Scaled Block Vecchia (SBV) algorithm for distributed GPU-based… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

  5. arXiv:2503.12668  [pdf, other

    cs.LG cs.PF

    ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory

    Authors: Liangyu Wang, Jie Ren, Hang Xu, Junxiao Wang, Huanyi Xie, David E. Keyes, Di Wang

    Abstract: Fine-tuning large pre-trained LLMs generally demands extensive GPU memory. Traditional first-order optimizers like SGD encounter substantial difficulties due to increased memory requirements from storing activations and gradients during both the forward and backward phases as the model size expands. Alternatively, zeroth-order (ZO) techniques can compute gradients using just forward operations, el… ▽ More

    Submitted 16 March, 2025; originally announced March 2025.

    Comments: 14 pages, 7 figures

  6. arXiv:2502.00356  [pdf, other

    cs.DC

    GPU-Accelerated Modified Bessel Function of the Second Kind for Gaussian Processes

    Authors: Zipei Geng, Sameh Abdulah, Ying Sun, Hatem Ltaief, David E. Keyes, Marc G. Genton

    Abstract: Modified Bessel functions of the second kind are widely used in physics, engineering, spatial statistics, and machine learning. Since contemporary scientific applications, including machine learning, rely on GPUs for acceleration, providing robust GPU-hosted implementations of special functions, such as the modified Bessel function, is crucial for performance. Existing implementations of the modif… ▽ More

    Submitted 5 April, 2025; v1 submitted 1 February, 2025; originally announced February 2025.

  7. arXiv:2501.02483  [pdf, other

    cs.PF math.NA

    sTiles: An Accelerated Computational Framework for Sparse Factorizations of Structured Matrices

    Authors: Esmail Abdul Fattah, Hatem Ltaief, Havard Rue, David Keyes

    Abstract: This paper introduces sTiles, a GPU-accelerated framework for factorizing sparse structured symmetric matrices. By leveraging tile algorithms for fine-grained computations, sTiles uses a structure-aware task execution flow to handle challenging arrowhead sparse matrices with variable bandwidths, common in scientific and engineering fields. It minimizes fill-in during Cholesky factorization using p… ▽ More

    Submitted 5 January, 2025; originally announced January 2025.

    Comments: 13 pages, 14 figures

  8. arXiv:2410.19460  [pdf, other

    cs.LG cs.AI cs.PF math.NA

    Accelerating AI Performance using Anderson Extrapolation on GPUs

    Authors: Saleem Abdul Fattah Ahmed Al Dajani, David E. Keyes

    Abstract: We present a novel approach for accelerating AI performance by leveraging Anderson extrapolation, a vector-to-vector mapping technique based on a window of historical iterations. By identifying the crossover point (Fig. 1) where a mixing penalty is incurred, the method focuses on reducing iterations to convergence, with fewer more compute-intensive but generally cacheable iterations, balancing spe… ▽ More

    Submitted 18 December, 2024; v1 submitted 25 October, 2024; originally announced October 2024.

    Comments: 6 pages, 6 figures, 1 table, Accepted by NeurIPS 2024 Workshop MLNCP https://openreview.net/forum?id=wkP2ZFRn9e

    Journal ref: Neural Information Processing Systems (NeurIPS). Machine Learning with New Compute Paradigms (MLNCP) Workshop, October 2024

  9. arXiv:2410.09819  [pdf, other

    cs.DC

    Accelerating Mixed-Precision Out-of-Core Cholesky Factorization with Static Task Scheduling

    Authors: Jie Ren, Hatem Ltaief, Sameh Abdulah, David E. Keyes

    Abstract: This paper explores the performance optimization of out-of-core (OOC) Cholesky factorization on shared-memory systems equipped with multiple GPUs. We employ fine-grained computational tasks to expose concurrency while creating opportunities to overlap data movement asynchronously with computations, especially when dealing with matrices that cannot fit on the GPU memory. We leverage the directed ac… ▽ More

    Submitted 13 October, 2024; originally announced October 2024.

  10. arXiv:2409.01712  [pdf, other

    q-bio.GN cs.AR cs.LG cs.MS cs.PF

    Toward Capturing Genetic Epistasis From Multivariate Genome-Wide Association Studies Using Mixed-Precision Kernel Ridge Regression

    Authors: Hatem Ltaief, Rabab Alomairy, Qinglei Cao, Jie Ren, Lotfi Slim, Thorsten Kurth, Benedikt Dorschner, Salim Bougouffa, Rached Abdelkhalak, David E. Keyes

    Abstract: We exploit the widening margin in tensor-core performance between [FP64/FP32/FP16/INT8,FP64/FP32/FP16/FP8/INT8] on NVIDIA [Ampere,Hopper] GPUs to boost the performance of output accuracy-preserving mixed-precision computation of Genome-Wide Association Studies (GWAS) of 305K patients from the UK BioBank, the largest-ever GWAS cohort studied for genetic epistasis using a multivariate approach. Tile… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

  11. arXiv:2407.19724  [pdf, other

    cs.LG physics.app-ph

    Constructing artificial life and materials scientists with accelerated AI using Deep AndersoNN

    Authors: Saleem Abdul Fattah Ahmed Al Dajani, David Keyes

    Abstract: Deep AndersoNN accelerates AI by exploiting the continuum limit as the number of explicit layers in a neural network approaches infinity and can be taken as a single implicit layer, known as a deep equilibrium model. Solving for deep equilibrium model parameters reduces to a nonlinear fixed point iteration problem, enabling the use of vector-to-vector iterative solvers and windowing techniques, su… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

    Comments: 7 pages, 5 figures, 2 tables, Accepted by ICML ML4LMS https://openreview.net/forum?id=qhwyvhqAvI . International Conference on Machine Learning (ICML). Machine Learning for Life and Material Science (ML4LMS) Workshop, May 2024

    Journal ref: International Conference on Machine Learning (ICML). Machine Learning for Life and Material Science (ML4LMS) Workshop, May 2024

  12. arXiv:2405.14892  [pdf, other

    cs.DC stat.CO

    Parallel Approximations for High-Dimensional Multivariate Normal Probability Computation in Confidence Region Detection Applications

    Authors: Xiran Zhang, Sameh Abdulah, Jian Cao, Hatem Ltaief, Ying Sun, Marc G. Genton, David E. Keyes

    Abstract: Addressing the statistical challenge of computing the multivariate normal (MVN) probability in high dimensions holds significant potential for enhancing various applications. One common way to compute high-dimensional MVN probabilities is the Separation-of-Variables (SOV) algorithm. This algorithm is known for its high computational complexity of O(n^3) and space complexity of O(n^2), mainly due t… ▽ More

    Submitted 18 May, 2024; originally announced May 2024.

  13. arXiv:2403.12188  [pdf, other

    cs.LG cs.MS math.OC

    PETScML: Second-order solvers for training regression problems in Scientific Machine Learning

    Authors: Stefano Zampini, Umberto Zerbinati, George Turkiyyah, David Keyes

    Abstract: In recent years, we have witnessed the emergence of scientific machine learning as a data-driven tool for the analysis, by means of deep-learning techniques, of data produced by computational science and engineering applications. At the core of these methods is the supervised training algorithm to learn the neural network realization, a highly non-convex optimization problem that is usually solved… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

    MSC Class: 65K10; 68T07; 65M70; 65Y05 ACM Class: I.2.5; D.2.m; G.4; G.1.6; J.2

  14. arXiv:2403.07412  [pdf, other

    stat.CO cs.DC

    GPU-Accelerated Vecchia Approximations of Gaussian Processes for Geospatial Data using Batched Matrix Computations

    Authors: Qilong Pan, Sameh Abdulah, Marc G. Genton, David E. Keyes, Hatem Ltaief, Ying Sun

    Abstract: Gaussian processes (GPs) are commonly used for geospatial analysis, but they suffer from high computational complexity when dealing with massive data. For instance, the log-likelihood function required in estimating the statistical model parameters for geospatial data is a computationally intensive procedure that involves computing the inverse of a covariance matrix with size n X n, where n repres… ▽ More

    Submitted 3 April, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

  15. arXiv:2312.07748  [pdf, other

    cs.DC

    Portability and Scalability Evaluation of Large-Scale Statistical Modeling and Prediction Software through HPC-Ready Containers

    Authors: Sameh Abdulah, Jorge Ejarque, Omar Marzouk, Hatem Ltaief, Ying Sun, Marc G. Genton, Rosa M. Badia, David E. Keyes

    Abstract: HPC-based applications often have complex workflows with many software dependencies that hinder their portability on contemporary HPC architectures. In addition, these applications often require extraordinary efforts to deploy and execute at performance potential on new HPC systems, while the users expert in these applications generally have less expertise in HPC and related technologies. This pap… ▽ More

    Submitted 4 December, 2023; originally announced December 2023.

  16. arXiv:2109.05451  [pdf, other

    cs.DC cs.MS

    H2Opus: A distributed-memory multi-GPU software package for non-local operators

    Authors: Stefano Zampini, Wajih Boukaram, George Turkiyyah, Omar Knio, David E. Keyes

    Abstract: Hierarchical $\mathcal{H}^2$-matrices are asymptotically optimal representations for the discretizations of non-local operators such as those arising in integral equations or from kernel functions. Their $O(N)$ complexity in both memory and operator application makes them particularly suited for large-scale problems. As a result, there is a need for software that provides support for distributed o… ▽ More

    Submitted 12 September, 2021; originally announced September 2021.

    MSC Class: 65Y05; 65F55; 65R20; 65-04 ACM Class: G.4; G.1.9

  17. arXiv:2108.11932  [pdf, other

    cs.DC cs.MS

    H2OPUS-TLR: High Performance Tile Low Rank Symmetric Factorizations using Adaptive Randomized Approximation

    Authors: Wajih Boukaram, Stefano Zampini, George Turkiyyah, David Keyes

    Abstract: Tile low rank representations of dense matrices partition them into blocks of roughly uniform size, where each off-diagonal tile is compressed and stored as its own low rank factorization. They offer an attractive representation for many data-sparse dense operators that appear in practical applications, where substantial compression and a much smaller memory footprint can be achieved. TLR matrices… ▽ More

    Submitted 26 August, 2021; originally announced August 2021.

    MSC Class: 65F05; 65F08; 65F55 ACM Class: G.4

  18. arXiv:2104.14186  [pdf, other

    math.NA cs.DC cs.MS

    High-Performance Partial Spectrum Computation for Symmetric eigenvalue problems and the SVD

    Authors: D. Keyes, H. Ltaief, Y. Nakatsukasa, D. Sukkari

    Abstract: Current dense symmetric eigenvalue (EIG) and singular value decomposition (SVD) implementations may suffer from the lack of concurrency during the tridiagonal and bidiagonal reductions, respectively. This performance bottleneck is typical for the two-sided transformations due to the Level-2 BLAS memory-bound calls. Therefore, the current state-of-the-art EIG and SVD implementations may achieve onl… ▽ More

    Submitted 29 April, 2021; originally announced April 2021.

  19. arXiv:2008.07437  [pdf, other

    cs.DC

    High Performance Multivariate Geospatial Statistics on Manycore Systems

    Authors: Mary Lai O. Salvaña, Sameh Abdulah, Huang Huang, Hatem Ltaief, Ying Sun, Marc G. Genton, David E. Keyes

    Abstract: Modeling and inferring spatial relationships and predicting missing values of environmental data are some of the main tasks of geospatial statisticians. These routine tasks are accomplished using multivariate geospatial models and the cokriging technique. The latter requires the evaluation of the expensive Gaussian log-likelihood function, which has impeded the adoption of multivariate geospatial… ▽ More

    Submitted 4 April, 2021; v1 submitted 3 August, 2020; originally announced August 2020.

  20. arXiv:2003.05324  [pdf, other

    cs.DC

    Geostatistical Modeling and Prediction Using Mixed-Precision Tile Cholesky Factorization

    Authors: Sameh Abdulah, Hatem Ltaief, Ying Sun, Marc G. Genton, David E. Keyes

    Abstract: Geostatistics represents one of the most challenging classes of scientific applications due to the desire to incorporate an ever increasing number of geospatial locations to accurately model and predict environmental phenomena. For example, the evaluation of the Gaussian log-likelihood function, which constitutes the main computational phase, involves solving systems of linear equations with a lar… ▽ More

    Submitted 8 January, 2020; originally announced March 2020.

  21. arXiv:2002.09561  [pdf, other

    cs.DC eess.SP

    Performance / Complexity Trade-offs of the Sphere Decoder Algorithm for Massive MIMO Systems

    Authors: A. Dabah, H. Ltaief, Z. Rezki, M. -A. Arfaoui, M. -S. Alouini, D. Keyes

    Abstract: Massive MIMO systems are seen by many researchers as a paramount technology toward next generation networks. This technology consists of hundreds of antennas that are capable of sending and receiving simultaneously a huge amount of data. One of the main challenges when using this technology is the necessity of an efficient decoding framework. The latter must guarantee both a low complexity and a g… ▽ More

    Submitted 21 February, 2020; originally announced February 2020.

  22. arXiv:1908.06936  [pdf, other

    cs.DC stat.CO

    Large-scale Environmental Data Science with ExaGeoStatR

    Authors: Sameh Abdulah, Yuxiao Li, Jian Cao, Hatem Ltaief, David E. Keyes, Marc G. Genton, Ying Sun

    Abstract: Parallel computing in Gaussian process calculations becomes necessary for avoiding computational and memory restrictions associated with large-scale environmental data science applications. The evaluation of the Gaussian log-likelihood function requires O(n^2) storage and O(n^3) operations where n is the number of geographical locations. Thus, computing the log-likelihood function with a large num… ▽ More

    Submitted 18 October, 2022; v1 submitted 23 July, 2019; originally announced August 2019.

  23. arXiv:1902.01829  [pdf, other

    cs.DS cs.MS

    Hierarchical Matrix Operations on GPUs: Matrix-Vector Multiplication and Compression

    Authors: Wajih Halim Boukaram, George Turkiyyah, David E. Keyes

    Abstract: Hierarchical matrices are space and time efficient representations of dense matrices that exploit the low rank structure of matrix blocks at different levels of granularity. The hierarchically low rank block partitioning produces representations that can be stored and operated on in near-linear complexity instead of the usual polynomial complexity of dense matrices. In this paper, we present high… ▽ More

    Submitted 5 February, 2019; originally announced February 2019.

  24. arXiv:1804.09536  [pdf, other

    cs.DC cs.MS

    Fast parallel multidimensional FFT using advanced MPI

    Authors: Lisandro Dalcin, Mikael Mortensen, David E Keyes

    Abstract: We present a new method for performing global redistributions of multidimensional arrays essential to parallel fast Fourier (or similar) transforms. Traditional methods use standard all-to-all collective communication of contiguous memory buffers, thus necessary requiring local data realignment steps intermixed in-between redistribution and transform steps. Instead, our method takes advantage of s… ▽ More

    Submitted 25 April, 2018; originally announced April 2018.

  25. arXiv:1803.09948  [pdf, other

    cs.PF cs.CE cs.MS

    Extreme Scale FMM-Accelerated Boundary Integral Equation Solver for Wave Scattering

    Authors: Mustafa Abduljabbar, Mohammed Al Farhan, Noha Al-Harthi, Rui Chen, Rio Yokota, Hakan Bagci, David Keyes

    Abstract: Algorithmic and architecture-oriented optimizations are essential for achieving performance worthy of anticipated energy-austere exascale systems. In this paper, we present an extreme scale FMM-accelerated boundary integral equation solver for wave scattering, which uses FMM as a matrix-vector multiplication inside the GMRES iterative method. Our FMM Helmholtz kernels treat nontrivial singular and… ▽ More

    Submitted 27 March, 2018; originally announced March 2018.

  26. ExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems

    Authors: Sameh Abdulah, Hatem Ltaief, Ying Sun, Marc G. Genton, David E. Keyes

    Abstract: We present ExaGeoStat, a high performance framework for geospatial statistics in climate and environment modeling. In contrast to simulation based on partial differential equations derived from first-principles modeling, ExaGeoStat employs a statistical model based on the evaluation of the Gaussian log-likelihood function, which operates on a large dense covariance matrix. Generated by the paramet… ▽ More

    Submitted 22 June, 2018; v1 submitted 9 August, 2017; originally announced August 2017.

    Comments: 14 pages, 7 figures

  27. arXiv:1707.05141  [pdf, other

    cs.MS cs.DS math.NA

    Batched QR and SVD Algorithms on GPUs with Applications in Hierarchical Matrix Compression

    Authors: Wajih Halim Boukaram, George Turkiyyah, Hatem Ltaief, David E. Keyes

    Abstract: We present high performance implementations of the QR and the singular value decomposition of a batch of small matrices hosted on the GPU with applications in the compression of hierarchical matrices. The one-sided Jacobi algorithm is used for its simplicity and inherent parallelism as a building block for the SVD of low rank blocks using randomized methods. We implement multiple kernels based on… ▽ More

    Submitted 13 July, 2017; originally announced July 2017.

  28. arXiv:1702.05459  [pdf, other

    cs.DC

    Communication Reducing Algorithms for Distributed Hierarchical N-Body Problems with Boundary Distributions

    Authors: Mustafa Abduljabbar, George Markomanolis, Huda Ibeid, Rio Yokota, David Keyes

    Abstract: Reduction of communication and efficient partitioning are key issues for achieving scalability in hierarchical $N$-Body algorithms like FMM. In the present work, we propose four independent strategies to improve partitioning and reduce communication. First of all, we show that the conventional wisdom of using space-filling curve partitioning may not work well for boundary integral problems, which… ▽ More

    Submitted 17 February, 2017; originally announced February 2017.

    Comments: arXiv admin note: text overlap with arXiv:1405.7487

  29. arXiv:1610.02608  [pdf, other

    cs.CE math.HO stat.OT

    Research and Education in Computational Science and Engineering

    Authors: Ulrich Rüde, Karen Willcox, Lois Curfman McInnes, Hans De Sterck, George Biros, Hans Bungartz, James Corones, Evin Cramer, James Crowley, Omar Ghattas, Max Gunzburger, Michael Hanke, Robert Harrison, Michael Heroux, Jan Hesthaven, Peter Jimack, Chris Johnson, Kirk E. Jordan, David E. Keyes, Rolf Krause, Vipin Kumar, Stefan Mayer, Juan Meza, Knut Martin Mørken, J. Tinsley Oden , et al. (8 additional authors not shown)

    Abstract: Over the past two decades the field of computational science and engineering (CSE) has penetrated both basic and applied research in academia, industry, and laboratories to advance discovery, optimize systems, support decision-makers, and educate the scientific and engineering workforce. Informed by centuries of theory and experiment, CSE performs computational experiments to answer questions that… ▽ More

    Submitted 31 December, 2017; v1 submitted 8 October, 2016; originally announced October 2016.

    Comments: Major revision, to appear in SIAM Review

    Report number: Argonne National Laboratory Preprint ANL/MCS-P6054-0916 MSC Class: 00A72; 62-07; 68U20; 68W01; 68W10; 97A99; 97M10; 97N80; 97R20; 97R30 ACM Class: G.0; G.4; I.6; J.0; J.2; J.3; J.4; J.6; J.7; K.3.2

  30. arXiv:1510.05218  [pdf, other

    cs.CE cs.DC cs.PF

    Optimization of an electromagnetics code with multicore wavefront diamond blocking and multi-dimensional intra-tile parallelization

    Authors: Tareq M. Malas, Julian Hornich, Georg Hager, Hatem Ltaief, Christoph Pflaum, David E. Keyes

    Abstract: Understanding and optimizing the properties of solar cells is becoming a key issue in the search for alternatives to nuclear and fossil energy sources. A theoretical analysis via numerical simulations involves solving Maxwell's Equations in discretized form and typically requires substantial computing effort. We start from a hybrid-parallel (MPI+OpenMP) production code that implements the Time Har… ▽ More

    Submitted 18 October, 2015; originally announced October 2015.

  31. arXiv:1510.04995  [pdf, other

    cs.DC cs.PF

    Multi-dimensional intra-tile parallelization for memory-starved stencil computations

    Authors: Tareq Malas, Georg Hager, Hatem Ltaief, David Keyes

    Abstract: Optimizing the performance of stencil algorithms has been the subject of intense research over the last two decades. Since many stencil schemes have low arithmetic intensity, most optimizations focus on increasing the temporal data access locality, thus reducing the data traffic through the main memory interface with the ultimate goal of decoupling from this bottleneck. There are, however, only fe… ▽ More

    Submitted 16 October, 2015; originally announced October 2015.

  32. arXiv:1505.07630  [pdf, other

    cs.DC

    Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

    Authors: Amani AlOnazi, David Keyes, Alexey Lastovetsky, Vladimir Rychkov

    Abstract: Hardware-aware design and optimization is crucial in exploiting emerging architectures for PDE-based computational fluid dynamics applications. In this work, we study optimizations aimed at acceleration of OpenFOAM-based applications on emerging hybrid heterogeneous platforms. OpenFOAM uses MPI to provide parallel multi-processor functionality, which scales well on homogeneous systems but does not… ▽ More

    Submitted 28 May, 2015; originally announced May 2015.

    Comments: Presented at ParCFD 2014, prepared for submission to Computer and Fluids. 12 pages, 9 figures, 2 tables

    MSC Class: 68W10 ACM Class: G.1.8; C.1.4; I.3.1; G.4; G.1.0; F.1.2; D.1.3; C.1.2

  33. arXiv:1410.5561  [pdf, other

    cs.PF

    Towards energy efficiency and maximum computational intensity for stencil algorithms using wavefront diamond temporal blocking

    Authors: Tareq Malas, Georg Hager, Hatem Ltaief, David Keyes

    Abstract: We study the impact of tunable parameters on computational intensity (i.e., inverse code balance) and energy consumption of multicore-optimized wavefront diamond temporal blocking (MWD) applied to different stencil-based update schemes. MWD combines the concepts of diamond tiling and multicore-aware wavefront blocking in order to achieve lower cache size requirements than standard single-core wave… ▽ More

    Submitted 21 October, 2014; originally announced October 2014.

  34. Multicore-optimized wavefront diamond blocking for optimizing stencil updates

    Authors: Tareq Malas, Georg Hager, Hatem Ltaief, Holger Stengel, Gerhard Wellein, David Keyes

    Abstract: The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency of caches to accelerate stencil updates and approach theoretical peak performance. A key ingredient is the reduction of data traffic across slow data paths, espec… ▽ More

    Submitted 12 October, 2014; originally announced October 2014.

  35. arXiv:1410.1726  [pdf, other

    cs.MS

    KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators

    Authors: Ahmad Abdelfattah, David Keyes, Hatem Ltaief

    Abstract: KBLAS is a new open source high performance library that provides optimized kernels for a subset of Level 2 BLAS functionalities on CUDA-enabled GPUs. Since performance of dense matrix-vector multiplication is hindered by the overhead of memory accesses, a double-buffering optimization technique is employed to overlap data motion with computation. After identifying a proper set of tuning parameter… ▽ More

    Submitted 7 October, 2014; originally announced October 2014.

    Comments: Submitted to the ACM Transactions on Mathematical Software

  36. arXiv:1406.1974  [pdf, other

    cs.DC math.NA

    Communication Complexity of the Fast Multipole Method and its Algebraic Variants

    Authors: Rio Yokota, George Turkiyyah, David Keyes

    Abstract: A combination of hierarchical tree-like data structures and data access patterns from fast multipole methods and hierarchical low-rank approximation of linear operators from H-matrix methods appears to form an algorithmic path forward for efficient implementation of many linear algebraic operations of scientific computing at the exascale. The combination provides asymptotically optimal computation… ▽ More

    Submitted 8 June, 2014; originally announced June 2014.

    MSC Class: 70F10 ACM Class: D.1.2; D.1.3; G.1.0; G.1.2

  37. arXiv:1405.7487  [pdf, other

    cs.DC

    Asynchronous Execution of the Fast Multipole Method Using Charm++

    Authors: Mustafa AbdulJabbar, Rio Yokota, David Keyes

    Abstract: Fast multipole methods (FMM) on distributed mem- ory have traditionally used a bulk-synchronous model of com- municating the local essential tree (LET) and overlapping it with computation of the local data. This could be perceived as an extreme case of data aggregation, where the whole LET is communicated at once. Charm++ allows a much finer control over the granularity of communication, and has a… ▽ More

    Submitted 29 May, 2014; originally announced May 2014.

    MSC Class: 70F10 ACM Class: D.1.2; D.1.3; G.1.0; G.1.2

  38. arXiv:1405.6362  [pdf, other

    cs.DC

    A Performance Model for the Communication in Fast Multipole Methods on HPC Platforms

    Authors: Huda Ibeid, Rio Yokota, David Keyes

    Abstract: Exascale systems are predicted to have approximately one billion cores, assuming Gigahertz cores. Limitations on affordable network topologies for distributed memory systems of such massive scale bring new challenges to the current parallel programing model. Currently, there are many efforts to evaluate the hardware and software bottlenecks of exascale designs. There is therefore an urgent need to… ▽ More

    Submitted 25 May, 2014; originally announced May 2014.

    MSC Class: 70F10 ACM Class: D.1.2; D.1.3; G.1.0; G.1.2

  39. Optimizing the Performance of Streaming Numerical Kernels on the IBM Blue Gene/P PowerPC 450 Processor

    Authors: Tareq M. Malas, Aron J. Ahmadia, Jed Brown, John A. Gunnels, David E. Keyes

    Abstract: Several emerging petascale architectures use energy-efficient processors with vectorized computational units and in-order thread processing. On these architectures the sustained performance of streaming numerical kernels, ubiquitous in the solution of partial differential equations, represents a challenge despite the regularity of memory access. Sophisticated optimization techniques are required t… ▽ More

    Submitted 17 January, 2012; originally announced January 2012.