-
Supporting 64-bit global indices in Epetra and other Trilinos packages -- Techniques used and lessons learned
Authors:
Chetan Jhurani,
Travis M. Austin,
Michael A. Heroux,
James M. Willenbring
Abstract:
The Trilinos Project is an effort to facilitate the design, development, integration and ongoing support of mathematical software libraries within an object-oriented framework. It is intended for large-scale, complex multiphysics engineering and scientific applications. Epetra is one of its basic packages. It provides serial and parallel linear algebra capabilities. Before Trilinos version 11.0, r…
▽ More
The Trilinos Project is an effort to facilitate the design, development, integration and ongoing support of mathematical software libraries within an object-oriented framework. It is intended for large-scale, complex multiphysics engineering and scientific applications. Epetra is one of its basic packages. It provides serial and parallel linear algebra capabilities. Before Trilinos version 11.0, released in 2012, Epetra used the C++ int data-type for storing global and local indices for degrees of freedom (DOFs). Since int is typically 32-bit, this limited the largest problem size to be smaller than approximately two billion DOFs. This was true even if a distributed memory machine could handle larger problems. We have added optional support for C++ long long data-type, which is at least 64-bit wide, for global indices. To save memory, maintain the speed of memory-bound operations, and reduce further changes to the code, the local indices are still 32-bit. We document the changes required to achieve this feature and how the new functionality can be used. We also report on the lessons learned in modifying a mature and popular package from various perspectives -- design goals, backward compatibility, engineering decisions, C++ language features, effects on existing users and other packages, and build integration.
△ Less
Submitted 25 July, 2013;
originally announced July 2013.
-
A Nonlinear Constrained Optimization Framework for Comfortable and Customizable Motion Planning of Nonholonomic Mobile Robots - Part II
Authors:
Shilpa Gulati,
Chetan Jhurani,
Benjamin Kuipers
Abstract:
In this series of papers, we present a motion planning framework for planning comfortable and customizable motion of nonholonomic mobile robots such as intelligent wheelchairs and autonomous cars. In Part I, we presented the mathematical foundation of our framework, where we model motion discomfort as a weighted cost functional and define comfortable motion planning as a nonlinear constrained opti…
▽ More
In this series of papers, we present a motion planning framework for planning comfortable and customizable motion of nonholonomic mobile robots such as intelligent wheelchairs and autonomous cars. In Part I, we presented the mathematical foundation of our framework, where we model motion discomfort as a weighted cost functional and define comfortable motion planning as a nonlinear constrained optimization problem of computing trajectories that minimize this discomfort given the appropriate boundary conditions and constraints.
In this paper, we discretize the infinite-dimensional optimization problem using conforming finite elements. We describe shape functions to handle different kinds of boundary conditions and the choice of unknowns to obtain a sparse Hessian matrix. We also describe in detail how any trajectory computation problem can have infinitely many locally optimal solutions and our method of handling them. Additionally, since we have a nonlinear and constrained problem, computation of high quality initial guesses is crucial for efficient solution. We show how to compute them.
△ Less
Submitted 22 May, 2013;
originally announced May 2013.
-
A Nonlinear Constrained Optimization Framework for Comfortable and Customizable Motion Planning of Nonholonomic Mobile Robots - Part I
Authors:
Shilpa Gulati,
Chetan Jhurani,
Benjamin Kuipers
Abstract:
In this series of papers, we present a motion planning framework for planning comfortable and customizable motion of nonholonomic mobile robots such as intelligent wheelchairs and autonomous cars. In this first one we present the mathematical foundation of our framework.
The motion of a mobile robot that transports a human should be comfortable and customizable. We identify several properties th…
▽ More
In this series of papers, we present a motion planning framework for planning comfortable and customizable motion of nonholonomic mobile robots such as intelligent wheelchairs and autonomous cars. In this first one we present the mathematical foundation of our framework.
The motion of a mobile robot that transports a human should be comfortable and customizable. We identify several properties that a trajectory must have for comfort. We model motion discomfort as a weighted cost functional and define comfortable motion planning as a nonlinear constrained optimization problem of computing trajectories that minimize this discomfort given the appropriate boundary conditions and constraints. The optimization problem is infinite-dimensional and we discretize it using conforming finite elements. We also outline a method by which different users may customize the motion to achieve personal comfort.
There exists significant past work in kinodynamic motion planning, to the best of our knowledge, our work is the first comprehensive formulation of kinodynamic motion planning for a nonholonomic mobile robot as a nonlinear optimization problem that includes all of the following - a careful analysis of boundary conditions, continuity requirements on trajectory, dynamic constraints, obstacle avoidance constraints, and a robust numerical implementation.
In this paper, we present the mathematical foundation of the motion planning framework and formulate the full nonlinear constrained optimization problem. We describe, in brief, the discretization method using finite elements and the process of computing initial guesses for the optimization problem. Details of the above two are presented in Part II of the series.
△ Less
Submitted 22 May, 2013;
originally announced May 2013.
-
Batched Kronecker product for 2-D matrices and 3-D arrays on NVIDIA GPUs
Authors:
Chetan Jhurani
Abstract:
We describe an interface and an implementation for performing Kronecker product actions on NVIDIA GPUs for multiple small 2-D matrices and 3-D arrays processed in parallel as a batch. This method is suited to cases where the Kronecker product component matrices are identical but the operands in a matrix-free application vary in the batch. Any batched GEMM (General Matrix Multiply) implementation,…
▽ More
We describe an interface and an implementation for performing Kronecker product actions on NVIDIA GPUs for multiple small 2-D matrices and 3-D arrays processed in parallel as a batch. This method is suited to cases where the Kronecker product component matrices are identical but the operands in a matrix-free application vary in the batch. Any batched GEMM (General Matrix Multiply) implementation, for example ours [1] or the one in cuBLAS, can also be used for performing batched Kronecker products on GPUs. However, the specialized implementation presented here is faster and uses less memory. Partly this is because a simple GEMM based approach would require extra copies to and from main memory. We focus on matrix sizes less than or equal to 16, since these are the typical polynomial degrees in Finite Elements, but the implementation can be easily extended for other sizes. We obtain 143 and 285 GFlop/s for single precision real when processing matrices of size 10 and 16, respectively on NVIDIA Tesla K20c using CUDA 5.0. The corresponding speeds for 3-D array Kronecker products are 126 and 268 GFlop/s, respectively. Double precision is easily supported using the C++ template mechanism.
△ Less
Submitted 25 April, 2013;
originally announced April 2013.
-
A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices
Authors:
Chetan Jhurani,
Paul Mullowney
Abstract:
We present an interface and an implementation of the General Matrix Multiply (GEMM) routine for multiple small matrices processed simultaneously on NVIDIA graphics processing units (GPUs). We focus on matrix sizes under 16. The implementation can be easily extended to larger sizes. For single precision matrices, our implementation is 30% to 600% faster than the batched cuBLAS implementation distri…
▽ More
We present an interface and an implementation of the General Matrix Multiply (GEMM) routine for multiple small matrices processed simultaneously on NVIDIA graphics processing units (GPUs). We focus on matrix sizes under 16. The implementation can be easily extended to larger sizes. For single precision matrices, our implementation is 30% to 600% faster than the batched cuBLAS implementation distributed in the CUDA Toolkit 5.0 on NVIDIA Tesla K20c. For example, we obtain 104 GFlop/s and 216 GFlop/s when multiplying 100,000 independent matrix pairs of size 10 and 16, respectively. Similar improvement in performance is obtained for other sizes, in single and double precision for real and complex types, and when the number of matrices is smaller. Apart from our implementation, our different function interface also plays an important role in the improved performance. Applications of this software include Finite Element computation on GPUs.
△ Less
Submitted 25 April, 2013;
originally announced April 2013.
-
Subspace-preserving sparsification of matrices with minimal perturbation to the near null-space. Part II: Approximation and Implementation
Authors:
Chetan Jhurani
Abstract:
This is the second of two papers to describe a matrix sparsification algorithm that takes a general real or complex matrix as input and produces a sparse output matrix of the same size. The first paper presented the original algorithm, its features, and theoretical results.
Since the output of this sparsification algorithm is a matrix rather than a vector, it can be costly in memory and run-time…
▽ More
This is the second of two papers to describe a matrix sparsification algorithm that takes a general real or complex matrix as input and produces a sparse output matrix of the same size. The first paper presented the original algorithm, its features, and theoretical results.
Since the output of this sparsification algorithm is a matrix rather than a vector, it can be costly in memory and run-time if an implementation does not exploit the structural properties of the algorithm and the matrix. Here we show how to modify the original algorithm to increase its efficiency. This is possible by computing an approximation to the exact result. We introduce extra constraints that are automatically determined based on the input matrix. This addition reduces the number of unknown degrees of freedom but still preserves many matrix subspaces. We also describe our open-source library that implements this sparsification algorithm and has interfaces in C++, C, and MATLAB.
△ Less
Submitted 25 April, 2013;
originally announced April 2013.
-
Subspace-preserving sparsification of matrices with minimal perturbation to the near null-space. Part I: Basics
Authors:
Chetan Jhurani
Abstract:
This is the first of two papers to describe a matrix sparsification algorithm that takes a general real or complex matrix as input and produces a sparse output matrix of the same size. The non-zero entries in the output are chosen to minimize changes to the singular values and singular vectors corresponding to the near null-space of the input. The output matrix is constrained to preserve left and…
▽ More
This is the first of two papers to describe a matrix sparsification algorithm that takes a general real or complex matrix as input and produces a sparse output matrix of the same size. The non-zero entries in the output are chosen to minimize changes to the singular values and singular vectors corresponding to the near null-space of the input. The output matrix is constrained to preserve left and right null-spaces exactly. The sparsity pattern of the output matrix is automatically determined or can be given as input.
If the input matrix belongs to a common matrix subspace, we prove that the computed sparse matrix belongs to the same subspace. This works without imposing explicit constraints pertaining to the subspace. This property holds for the subspaces of Hermitian, complex-symmetric, Hamiltonian, circulant, centrosymmetric, and persymmetric matrices, and for each of the skew counterparts.
Applications of our method include computation of reusable sparse preconditioning matrices for reliable and efficient solution of high-order finite element systems. The second paper in this series describes our open-source implementation, and presents further technical details.
△ Less
Submitted 25 April, 2013;
originally announced April 2013.