-
RLibm-MultiRound: Correctly Rounded Math Libraries Without Worrying about the Application's Rounding Mode
Authors:
Sehyeok Park,
Justin Kim,
Santosh Nagarakatte
Abstract:
Our RLibm project generates a single implementation for an elementary function that produces correctly rounded results for multiple rounding modes and representations with up to 32-bits. They are appealing for developing fast reference libraries without double rounding issues. The key insight is to build polynomials that produce the correctly rounded result for a representation with two additional…
▽ More
Our RLibm project generates a single implementation for an elementary function that produces correctly rounded results for multiple rounding modes and representations with up to 32-bits. They are appealing for developing fast reference libraries without double rounding issues. The key insight is to build polynomials that produce the correctly rounded result for a representation with two additional bits when compared to the largest target representation and with the "non-standard" round-to-odd rounding mode, which makes double rounding the RLibm math library result to any smaller target representation innocuous. The resulting approximations generated by the RLibm approach are implemented with machine supported floating-point operations with the round-to-nearest rounding mode. When an application uses a rounding mode other than the round-to-nearest mode, the RLibm math library saves the application's rounding mode, changes the system's rounding mode to round-to-nearest, computes the correctly rounded result, and restores the application's rounding mode. This frequent change of rounding modes has a performance cost.
This paper proposes two new methods, which we call rounding-invariant outputs and rounding-invariant input bounds, to avoid the frequent changes to the rounding mode and the dependence on the round-to-nearest mode. First, our new rounding-invariant outputs method proposes using the round-to-zero rounding mode to implement RLibm's polynomial approximations. We propose fast, error-free transformations to emulate a round-to-zero result from any standard rounding mode without changing the rounding mode. Second, our rounding-invariant input bounds method factors any rounding error due to different rounding modes using interval bounds in the RLibm pipeline. Both methods make a different set of trade-offs and improve the performance of resulting libraries by more than 2X.
△ Less
Submitted 29 May, 2025; v1 submitted 9 April, 2025;
originally announced April 2025.
-
Report of the DOE/NSF Workshop on Correctness in Scientific Computing, June 2023, Orlando, FL
Authors:
Maya Gokhale,
Ganesh Gopalakrishnan,
Jackson Mayo,
Santosh Nagarakatte,
Cindy Rubio-González,
Stephen F. Siegel
Abstract:
This report is a digest of the DOE/NSF Workshop on Correctness in Scientific Computing (CSC'23) held on June 17, 2023, as part of the Federated Computing Research Conference (FCRC) 2023. CSC was conceived by DOE and NSF to address the growing concerns about correctness among those who employ computational methods to perform large-scale scientific simulations. These concerns have escalated, given t…
▽ More
This report is a digest of the DOE/NSF Workshop on Correctness in Scientific Computing (CSC'23) held on June 17, 2023, as part of the Federated Computing Research Conference (FCRC) 2023. CSC was conceived by DOE and NSF to address the growing concerns about correctness among those who employ computational methods to perform large-scale scientific simulations. These concerns have escalated, given the complexity, scale, and heterogeneity of today's HPC software and hardware. If correctness is not proactively addressed, there is the risk of producing flawed science on top of unacceptable productivity losses faced by computational scientists and engineers. HPC systems are beginning to include data-driven methods, including machine learning and surrogate models, and their impact on overall HPC system correctness was also felt urgent to discuss.
Stakeholders of correctness in this space were identified to belong to several sub-disciplines of computer science; from computer architecture researchers who design special-purpose hardware that offers high energy efficiencies; numerical algorithm designers who develop efficient computational schemes based on reduced precision as well as reduced data movement; all the way to researchers in programming language and formal methods who seek methodologies for correct compilation and verification. To include attendees with such a diverse set of backgrounds, CSC was held during the Federated Computing Research Conference (FCRC) 2023.
△ Less
Submitted 27 December, 2023; v1 submitted 25 December, 2023;
originally announced December 2023.
-
RLIBM-PROG: Progressive Polynomial Approximations for Fast Correctly Rounded Math Libraries
Authors:
Mridul Aanjaneya,
Jay P. Lim,
Santosh Nagarakatte
Abstract:
This paper presents a novel method for generating a single polynomial approximation that produces correctly rounded results for all inputs of an elementary function for multiple representations. The generated polynomial approximation has the nice property that the first few lower degree terms produce correctly rounded results for specific representations of smaller bitwidths, which we call progres…
▽ More
This paper presents a novel method for generating a single polynomial approximation that produces correctly rounded results for all inputs of an elementary function for multiple representations. The generated polynomial approximation has the nice property that the first few lower degree terms produce correctly rounded results for specific representations of smaller bitwidths, which we call progressive performance. To generate such progressive polynomial approximations, we approximate the correctly rounded result and formulate the computation of correctly rounded polynomial approximations as a linear program similar to our prior work on the RLibm project. To enable the use of resulting polynomial approximations in mainstream libraries, we want to avoid piecewise polynomials with large lookup tables. We observe that the problem of computing polynomial approximations for elementary functions is a linear programming problem in low dimensions, i.e., with a small number of unknowns. We design a fast randomized algorithm for computing polynomial approximations with progressive performance. Our method produces correct and fast polynomials that require a small amount of storage. A few polynomial approximations from our prototype have already been incorporated into LLVM's math library.
△ Less
Submitted 17 March, 2022; v1 submitted 24 November, 2021;
originally announced November 2021.
-
RLIBM-ALL: A Novel Polynomial Approximation Method to Produce Correctly Rounded Results for Multiple Representations and Rounding Modes
Authors:
Jay P. Lim,
Santosh Nagarakatte
Abstract:
Mainstream math libraries for floating point (FP) do not produce correctly rounded results for all inputs. In contrast, CR-LIBM and RLIBM provide correctly rounded implementations for a specific FP representation with one rounding mode. Using such libraries for a representation with a new rounding mode or with different precision will result in wrong results due to double rounding. This paper prop…
▽ More
Mainstream math libraries for floating point (FP) do not produce correctly rounded results for all inputs. In contrast, CR-LIBM and RLIBM provide correctly rounded implementations for a specific FP representation with one rounding mode. Using such libraries for a representation with a new rounding mode or with different precision will result in wrong results due to double rounding. This paper proposes a novel method to generate a single polynomial approximation that produces correctly rounded results for all inputs for multiple rounding modes and multiple precision configurations. To generate a correctly rounded library for $n$-bits, our key idea is to generate such a polynomial approximation for a representation with $n+2$-bits using the \emph{round-to-odd} mode. We prove that the resulting polynomial approximation will produce correctly rounded results for all five rounding modes in the standard and for multiple representations with $k$-bits such that $|E| +1 < k \leq n$, where $|E|$ is the number of exponent bits in the representation. Building on our prior work in the RLIBM project, we also approximate the correctly rounded result when we generate the library with $n+2$-bits using the round-to-odd mode. We also generate polynomial approximations by structuring it as a linear programming problem but propose enhancements to polynomial generation to handle the round-to-odd mode. Our prototype is the first 32-bit float library that produces correctly rounded results with all rounding modes in the IEEE standard for all inputs with a single polynomial approximation. It also produces correctly rounded results for any FP configuration ranging from 10-bits to 32-bits while also being faster than mainstream libraries.
△ Less
Submitted 29 November, 2021; v1 submitted 15 August, 2021;
originally announced August 2021.
-
SPOTS: An Accelerator for Sparse Convolutional Networks Leveraging Systolic General Matrix-Matrix Multiplication
Authors:
Mohammadreza Soltaniyeh,
Richard P. Martin,
Santosh Nagarakatte
Abstract:
This paper proposes a new hardware accelerator for sparse convolutional neural networks (CNNs) by building a hardware unit to perform the Image to Column (IM2COL) transformation of the input feature map coupled with a systolic array-based general matrix-matrix multiplication (GEMM) unit. Our design carefully overlaps the IM2COL transformation with the GEMM computation to maximize parallelism. We p…
▽ More
This paper proposes a new hardware accelerator for sparse convolutional neural networks (CNNs) by building a hardware unit to perform the Image to Column (IM2COL) transformation of the input feature map coupled with a systolic array-based general matrix-matrix multiplication (GEMM) unit. Our design carefully overlaps the IM2COL transformation with the GEMM computation to maximize parallelism. We propose a novel design for the IM2COL unit that uses a set of distributed local memories connected by a ring network, which improves energy efficiency and latency by streaming the input feature map only once. We propose a tall systolic array for the GEMM unit while also providing the ability to organize it as multiple small GEMM units, which enables our design to handle a wide range of CNNs and their parameters. Further, our design improves performance by effectively mapping the sparse data to the hardware units by utilizing sparsity in both input feature maps and weights. Our prototype, SPOTS, is on average 1.74X faster than Eyeriss. It is also 78X, and 12X more energy-efficient when compared to CPU and GPU implementations, respectively.
△ Less
Submitted 24 November, 2021; v1 submitted 28 July, 2021;
originally announced July 2021.
-
Sound, Precise, and Fast Abstract Interpretation with Tristate Numbers
Authors:
Harishankar Vishwanathan,
Matan Shachnai,
Srinivas Narayana,
Santosh Nagarakatte
Abstract:
Extended Berkeley Packet Filter (BPF) is a language and run-time system that allows non-superusers to extend the Linux and Windows operating systems by downloading user code into the kernel. To ensure that user code is safe to run in kernel context, BPF relies on a static analyzer that proves properties about the code, such as bounded memory access and the absence of operations that crash. The BPF…
▽ More
Extended Berkeley Packet Filter (BPF) is a language and run-time system that allows non-superusers to extend the Linux and Windows operating systems by downloading user code into the kernel. To ensure that user code is safe to run in kernel context, BPF relies on a static analyzer that proves properties about the code, such as bounded memory access and the absence of operations that crash. The BPF static analyzer checks safety using abstract interpretation with several abstract domains. Among these, the domain of tnums (tristate numbers) is a key domain used to reason about the bitwise uncertainty in program values. This paper formally specifies the tnum abstract domain and its arithmetic operators. We provide the first proofs of soundness and optimality of the abstract arithmetic operators for tnum addition and subtraction used in the BPF analyzer. Further, we describe a novel sound algorithm for multiplication of tnums that is more precise and efficient (runs 33% faster on average) than the Linux kernel's algorithm. Our tnum multiplication is now merged in the Linux kernel.
△ Less
Submitted 15 December, 2021; v1 submitted 11 May, 2021;
originally announced May 2021.
-
RLIBM-32: High Performance Correctly Rounded Math Libraries for 32-bit Floating Point Representations
Authors:
Jay P. Lim,
Santosh Nagarakatte
Abstract:
This paper proposes a set of techniques to develop correctly rounded math libraries for 32-bit float and posit types. It enhances our RLibm approach that frames the problem of generating correctly rounded libraries as a linear programming problem in the context of 16-bit types to scale to 32-bit types. Specifically, this paper proposes new algorithms to (1) generate polynomials that produce correc…
▽ More
This paper proposes a set of techniques to develop correctly rounded math libraries for 32-bit float and posit types. It enhances our RLibm approach that frames the problem of generating correctly rounded libraries as a linear programming problem in the context of 16-bit types to scale to 32-bit types. Specifically, this paper proposes new algorithms to (1) generate polynomials that produce correctly rounded outputs for all inputs using counterexample guided polynomial generation, (2) generate efficient piecewise polynomials with bit-pattern based domain splitting, and (3) deduce the amount of freedom available to produce correct results when range reduction involves multiple elementary functions. The resultant math library for the 32-bit float type is faster than state-of-the-art math libraries while producing the correct output for all inputs. We have also developed a set of correctly rounded elementary functions for 32-bit posits.
△ Less
Submitted 8 April, 2021;
originally announced April 2021.
-
A Novel Approach to Generate Correctly Rounded Math Libraries for New Floating Point Representations
Authors:
Jay P. Lim,
Mridul Aanjaneya,
John Gustafson,
Santosh Nagarakatte
Abstract:
Given the importance of floating-point~(FP) performance in numerous domains, several new variants of FP and its alternatives have been proposed (e.g., Bfloat16, TensorFloat32, and Posits). These representations do not have correctly rounded math libraries. Further, the use of existing FP libraries for these new representations can produce incorrect results. This paper proposes a novel approach for…
▽ More
Given the importance of floating-point~(FP) performance in numerous domains, several new variants of FP and its alternatives have been proposed (e.g., Bfloat16, TensorFloat32, and Posits). These representations do not have correctly rounded math libraries. Further, the use of existing FP libraries for these new representations can produce incorrect results. This paper proposes a novel approach for generating polynomial approximations that can be used to implement correctly rounded math libraries. Existing methods generate polynomials that approximate the real value of an elementary function $f(x)$ and produce wrong results due to approximation errors and rounding errors in the implementation. In contrast, our approach generates polynomials that approximate the correctly rounded value of $f(x)$ (i.e., the value of $f(x)$ rounded to the target representation). It provides more margin to identify efficient polynomials that produce correctly rounded results for all inputs. We frame the problem of generating efficient polynomials that produce correctly rounded results as a linear programming problem. Our approach guarantees that we produce the correct result even with range reduction techniques. Using our approach, we have developed correctly rounded, yet faster, implementations of elementary functions for multiple target representations.
△ Less
Submitted 20 November, 2020; v1 submitted 9 July, 2020;
originally announced July 2020.
-
Synergistic CPU-FPGA Acceleration of Sparse Linear Algebra
Authors:
Mohammadreza Soltaniyeh,
Richard P. Martin,
Santosh Nagarakatte
Abstract:
This paper describes REAP, a software-hardware approach that enables high performance sparse linear algebra computations on a cooperative CPU-FPGA platform. REAP carefully separates the task of organizing the matrix elements from the computation phase. It uses the CPU to provide a first-pass re-organization of the matrix elements, allowing the FPGA to focus on the computation. We introduce a new i…
▽ More
This paper describes REAP, a software-hardware approach that enables high performance sparse linear algebra computations on a cooperative CPU-FPGA platform. REAP carefully separates the task of organizing the matrix elements from the computation phase. It uses the CPU to provide a first-pass re-organization of the matrix elements, allowing the FPGA to focus on the computation. We introduce a new intermediate representation that allows the CPU to communicate the sparse data and the scheduling decisions to the FPGA. The computation is optimized on the FPGA for effective resource utilization with pipelining. REAP improves the performance of Sparse General Matrix Multiplication (SpGEMM) and Sparse Cholesky Factorization by 3.2X and 1.85X compared to widely used sparse libraries for them on the CPU, respectively.
△ Less
Submitted 28 April, 2020;
originally announced April 2020.
-
A Fast Causal Profiler for Task Parallel Programs
Authors:
Adarsh Yoga,
Santosh Nagarakatte
Abstract:
This paper proposes TASKPROF, a profiler that identifies parallelism bottlenecks in task parallel programs. It leverages the structure of a task parallel execution to perform fine-grained attribution of work to various parts of the program. TASKPROF's use of hardware performance counters to perform fine-grained measurements minimizes perturbation. TASKPROF's profile execution runs in parallel usin…
▽ More
This paper proposes TASKPROF, a profiler that identifies parallelism bottlenecks in task parallel programs. It leverages the structure of a task parallel execution to perform fine-grained attribution of work to various parts of the program. TASKPROF's use of hardware performance counters to perform fine-grained measurements minimizes perturbation. TASKPROF's profile execution runs in parallel using multi-cores. TASKPROF's causal profile enables users to estimate improvements in parallelism when a region of code is optimized even when concrete optimizations are not yet known. We have used TASKPROF to isolate parallelism bottlenecks in twenty three applications that use the Intel Threading Building Blocks library. We have designed parallelization techniques in five applications to in- crease parallelism by an order of magnitude using TASKPROF. Our user study indicates that developers are able to isolate performance bottlenecks with ease using TASKPROF.
△ Less
Submitted 2 July, 2017; v1 submitted 3 May, 2017;
originally announced May 2017.
-
Precondition Inference for Peephole Optimizations in LLVM
Authors:
David Menendez,
Santosh Nagarakatte
Abstract:
Peephole optimizations are a common source of compiler bugs. Compiler developers typically transform an incorrect peephole optimization into a valid one by strengthening the precondition. This process is challenging and tedious. This paper proposes ALIVE-INFER, a data-driven approach that infers preconditions for peephole optimizations expressed in Alive. ALIVE-INFER generates positive and negativ…
▽ More
Peephole optimizations are a common source of compiler bugs. Compiler developers typically transform an incorrect peephole optimization into a valid one by strengthening the precondition. This process is challenging and tedious. This paper proposes ALIVE-INFER, a data-driven approach that infers preconditions for peephole optimizations expressed in Alive. ALIVE-INFER generates positive and negative examples for an optimization, enumerates predicates on-demand, and learns a set of predicates that separate the positive and negative examples. ALIVE-INFER repeats this process until it finds a precondition that ensures the validity of the optimization. ALIVE-INFER reports both a weakest precondition and a set of succinct partial preconditions to the developer. Our prototype generates preconditions that are weaker than LLVM's preconditions for 73 optimizations in the Alive suite. We also demonstrate the applicability of this technique to generalize 54 optimization patterns generated by Souper, an LLVM~IR--based superoptimizer.
△ Less
Submitted 24 March, 2017; v1 submitted 18 November, 2016;
originally announced November 2016.