Search | arXiv e-print repository

CAOTE: KV Caching through Attention Output Error based Token Eviction

Authors: Raghavv Goel, Junyoung Park, Mukul Gagrani, Dalton Jones, Matthew Morse, Harper Langston, Mingu Lee, Chris Lott

Abstract: While long context support of large language models has extended their abilities, it also incurs challenges in memory and compute which becomes crucial bottlenecks in resource-restricted devices. Token eviction, a widely adopted post-training methodology designed to alleviate the bottlenecks by evicting less important tokens from the cache, typically uses attention scores as proxy metrics for toke… ▽ More While long context support of large language models has extended their abilities, it also incurs challenges in memory and compute which becomes crucial bottlenecks in resource-restricted devices. Token eviction, a widely adopted post-training methodology designed to alleviate the bottlenecks by evicting less important tokens from the cache, typically uses attention scores as proxy metrics for token importance. However, one major limitation of attention score as a token-wise importance metrics is that it lacks the information about contribution of tokens to the attention output. In this paper, we propose a simple eviction criterion based on the contribution of cached tokens to attention outputs. Our method, CAOTE, optimizes for eviction error due to token eviction, by seamlessly integrating attention scores and value vectors. This is the first method which uses value vector information on top of attention-based eviction scores. Additionally, CAOTE can act as a meta-heuristic method with flexible usage with any token eviction method. We show that CAOTE, when combined with the state-of-the-art attention score-based methods, always improves accuracies on the downstream task, indicating the importance of leveraging information from values during token eviction process. △ Less

Submitted 23 April, 2025; v1 submitted 18 April, 2025; originally announced April 2025.

Comments: 14 pages, 2 figures

arXiv:2407.11306 [pdf, other]

PADRe: A Unifying Polynomial Attention Drop-in Replacement for Efficient Vision Transformer

Authors: Pierre-David Letourneau, Manish Kumar Singh, Hsin-Pai Cheng, Shizhong Han, Yunxiao Shi, Dalton Jones, Matthew Harper Langston, Hong Cai, Fatih Porikli

Abstract: We present Polynomial Attention Drop-in Replacement (PADRe), a novel and unifying framework designed to replace the conventional self-attention mechanism in transformer models. Notably, several recent alternative attention mechanisms, including Hyena, Mamba, SimA, Conv2Former, and Castling-ViT, can be viewed as specific instances of our PADRe framework. PADRe leverages polynomial functions and dra… ▽ More We present Polynomial Attention Drop-in Replacement (PADRe), a novel and unifying framework designed to replace the conventional self-attention mechanism in transformer models. Notably, several recent alternative attention mechanisms, including Hyena, Mamba, SimA, Conv2Former, and Castling-ViT, can be viewed as specific instances of our PADRe framework. PADRe leverages polynomial functions and draws upon established results from approximation theory, enhancing computational efficiency without compromising accuracy. PADRe's key components include multiplicative nonlinearities, which we implement using straightforward, hardware-friendly operations such as Hadamard products, incurring only linear computational and memory costs. PADRe further avoids the need for using complex functions such as Softmax, yet it maintains comparable or superior accuracy compared to traditional self-attention. We assess the effectiveness of PADRe as a drop-in replacement for self-attention across diverse computer vision tasks. These tasks include image classification, image-based 2D object detection, and 3D point cloud object detection. Empirical results demonstrate that PADRe runs significantly faster than the conventional self-attention (11x ~ 43x faster on server GPU and mobile NPU) while maintaining similar accuracy when substituting self-attention in the transformer models. △ Less

Submitted 15 July, 2024; originally announced July 2024.

arXiv:2311.02037 [pdf, other]

An Efficient Framework for Global Non-Convex Polynomial Optimization with Algebraic Constraints

Authors: Mitchell Tong Harris, Pierre-David Letourneau, Dalton Jones, M. Harper Langston

Abstract: We present an efficient framework for solving algebraically-constrained global non-convex polynomial optimization problems over subsets of the hypercube. We prove the existence of an equivalent nonlinear reformulation of such problems that possesses essentially no spurious local minima. Through numerical experiments on previously intractable global constrained polynomial optimization problems in h… ▽ More We present an efficient framework for solving algebraically-constrained global non-convex polynomial optimization problems over subsets of the hypercube. We prove the existence of an equivalent nonlinear reformulation of such problems that possesses essentially no spurious local minima. Through numerical experiments on previously intractable global constrained polynomial optimization problems in high dimension, we show that polynomial scaling in dimension and degree is achievable when computing the optimal value and location. △ Less

Submitted 4 September, 2024; v1 submitted 3 November, 2023; originally announced November 2023.

arXiv:2309.14584 [pdf, other]

A Sparse Fast Chebyshev Transform for High-Dimensional Approximation

Authors: Dalton Jones, Pierre-David Letourneau, Matthew J. Morse, M. Harper Langston

Abstract: We present the Fast Chebyshev Transform (FCT), a fast, randomized algorithm to compute a Chebyshev approximation of functions in high-dimensions from the knowledge of the location of its nonzero Chebyshev coefficients. Rather than sampling a full-resolution Chebyshev grid in each dimension, we randomly sample several grids with varied resolutions and solve a least-squares problem in coefficient sp… ▽ More We present the Fast Chebyshev Transform (FCT), a fast, randomized algorithm to compute a Chebyshev approximation of functions in high-dimensions from the knowledge of the location of its nonzero Chebyshev coefficients. Rather than sampling a full-resolution Chebyshev grid in each dimension, we randomly sample several grids with varied resolutions and solve a least-squares problem in coefficient space in order to compute a polynomial approximating the function of interest across all grids simultaneously. We theoretically and empirically show that the FCT exhibits quasi-linear scaling and high numerical accuracy on challenging and complex high-dimensional problems. We demonstrate the effectiveness of our approach compared to alternative Chebyshev approximation schemes. In particular, we highlight our algorithm's effectiveness in high dimensions, demonstrating significant speedups over commonly-used alternative techniques. △ Less

Submitted 2 October, 2023; v1 submitted 25 September, 2023; originally announced September 2023.

MSC Class: 90C23; 41A50; 65Y20; 65D15; 93E24; 65T50; 14Q15

arXiv:2308.16731 [pdf, other]

An Efficient Framework for Global Non-Convex Polynomial Optimization over the Hypercube

Authors: Pierre-David Letourneau, Dalton Jones, Matthew Morse, M. Harper Langston

Abstract: We present a novel efficient theoretical and numerical framework for solving global non-convex polynomial optimization problems. We analytically demonstrate that such problems can be efficiently reformulated using a non-linear objective over a convex set; further, these reformulated problems possess no spurious local minima (i.e., every local minimum is a global minimum). We introduce an algorithm… ▽ More We present a novel efficient theoretical and numerical framework for solving global non-convex polynomial optimization problems. We analytically demonstrate that such problems can be efficiently reformulated using a non-linear objective over a convex set; further, these reformulated problems possess no spurious local minima (i.e., every local minimum is a global minimum). We introduce an algorithm for solving these resulting problems using the augmented Lagrangian and the method of Burer and Monteiro. We show through numerical experiments that polynomial scaling in dimension and degree is achievable for computing the optimal value and location of previously intractable global polynomial optimization problems in high dimension. △ Less

Submitted 16 May, 2024; v1 submitted 31 August, 2023; originally announced August 2023.

arXiv:2304.04869 [pdf, other]

doi 10.1088/1538-3873/acd1b5

The James Webb Space Telescope Mission

Authors: Jonathan P. Gardner, John C. Mather, Randy Abbott, James S. Abell, Mark Abernathy, Faith E. Abney, John G. Abraham, Roberto Abraham, Yasin M. Abul-Huda, Scott Acton, Cynthia K. Adams, Evan Adams, David S. Adler, Maarten Adriaensen, Jonathan Albert Aguilar, Mansoor Ahmed, Nasif S. Ahmed, Tanjira Ahmed, Rüdeger Albat, Loïc Albert, Stacey Alberts, David Aldridge, Mary Marsha Allen, Shaune S. Allen, Martin Altenburg , et al. (983 additional authors not shown)

Abstract: Twenty-six years ago a small committee report, building on earlier studies, expounded a compelling and poetic vision for the future of astronomy, calling for an infrared-optimized space telescope with an aperture of at least $4m$. With the support of their governments in the US, Europe, and Canada, 20,000 people realized that vision as the $6.5m$ James Webb Space Telescope. A generation of astrono… ▽ More Twenty-six years ago a small committee report, building on earlier studies, expounded a compelling and poetic vision for the future of astronomy, calling for an infrared-optimized space telescope with an aperture of at least $4m$. With the support of their governments in the US, Europe, and Canada, 20,000 people realized that vision as the $6.5m$ James Webb Space Telescope. A generation of astronomers will celebrate their accomplishments for the life of the mission, potentially as long as 20 years, and beyond. This report and the scientific discoveries that follow are extended thank-you notes to the 20,000 team members. The telescope is working perfectly, with much better image quality than expected. In this and accompanying papers, we give a brief history, describe the observatory, outline its objectives and current observing program, and discuss the inventions and people who made it possible. We cite detailed reports on the design and the measured performance on orbit. △ Less

Submitted 10 April, 2023; originally announced April 2023.

Comments: Accepted by PASP for the special issue on The James Webb Space Telescope Overview, 29 pages, 4 figures

arXiv:2108.11027 [pdf, other]

doi 10.1088/1748-0221/16/10/T10002

Simulations of Future Particle Accelerators: Issues and Mitigations

Authors: D. Sagan, M. Berz, N. M. Cook, Y. Hao, G. Hoffstaetter, A. Huebl, C. -K. Huang, M. H. Langston, C. E. Mayes, C. E. Mitchell, C. -K. Ng, J. Qiang, R. D. Ryne, A. Scheinker, E. Stern, J. -L. Vay, D. Winklehner, H. Zhang

Abstract: The ever increasing demands placed upon machine performance have resulted in the need for more comprehensive particle accelerator modeling. Computer simulations are key to the success of particle accelerators. Many aspects of particle accelerators rely on computer modeling at some point, sometimes requiring complex simulation tools and massively parallel supercomputing. Examples include the modeli… ▽ More The ever increasing demands placed upon machine performance have resulted in the need for more comprehensive particle accelerator modeling. Computer simulations are key to the success of particle accelerators. Many aspects of particle accelerators rely on computer modeling at some point, sometimes requiring complex simulation tools and massively parallel supercomputing. Examples include the modeling of beams at extreme intensities and densities (toward the quantum degeneracy limit), and with ultra-fine control (down to the level of individual particles). In the future, adaptively tuned models might also be relied upon to provide beam measurements beyond the resolution of existing diagnostics. Much time and effort has been put into creating accelerator software tools, some of which are highly successful. However, there are also shortcomings such as the general inability of existing software to be easily modified to meet changing simulation needs. In this paper possible mitigating strategies are discussed for issues faced by the accelerator community as it endeavors to produce better and more comprehensive modeling tools. This includes lack of coordination between code developers, lack of standards to make codes portable and/or reusable, lack of documentation, among others. △ Less

Submitted 24 August, 2021; originally announced August 2021.

Comments: 21 pages, 1 figure. To be published in JINST

arXiv:1604.06682 [pdf, ps, other]

A sparse multidimensional FFT for real positive vectors

Authors: Pierre-David Letourneau, Harper Langston, Benoit Meister, Richard Lethin

Abstract: We present a sparse multidimensional FFT (sMFFT) randomized algorithm for real positive vectors. The algorithm works in any fixed dimension, requires (O(R log(R) log(N)) ) samples and runs in O( R log^2(R) log(N)) complexity (where N is the total size of the vector in d dimensions and R is the number of nonzeros). It is stable to low-level noise and exhibits an exponentially small probability of f… ▽ More We present a sparse multidimensional FFT (sMFFT) randomized algorithm for real positive vectors. The algorithm works in any fixed dimension, requires (O(R log(R) log(N)) ) samples and runs in O( R log^2(R) log(N)) complexity (where N is the total size of the vector in d dimensions and R is the number of nonzeros). It is stable to low-level noise and exhibits an exponentially small probability of failure. △ Less

Submitted 7 December, 2016; v1 submitted 22 April, 2016; originally announced April 2016.

Comments: Fixed minor typos. Corrected use of Q^{-1} in Algorithm 3 and theorem

arXiv:1512.01542 [pdf, other]

Optimizing the domain wall fermion Dirac operator using the R-Stream source-to-source compiler

Authors: Meifeng Lin, Eric Papenhausen, M. Harper Langston, Benoit Meister, Muthu Baskaran, Taku Izubuchi, Chulwoo Jung

Abstract: The application of the Dirac operator on a spinor field, the Dslash operation, is the most computation-intensive part of the lattice QCD simulations. It is often the key kernel to optimize to achieve maximum performance on various platforms. Here we report on a project to optimize the domain wall fermion Dirac operator in Columbia Physics System (CPS) using the R-Stream source-to-source compiler.… ▽ More The application of the Dirac operator on a spinor field, the Dslash operation, is the most computation-intensive part of the lattice QCD simulations. It is often the key kernel to optimize to achieve maximum performance on various platforms. Here we report on a project to optimize the domain wall fermion Dirac operator in Columbia Physics System (CPS) using the R-Stream source-to-source compiler. Our initial target platform is the Intel PC clusters. We discuss the optimization strategies involved before and after the automatic code generation with R-Stream and present some preliminary benchmark results. △ Less

Submitted 4 December, 2015; originally announced December 2015.

Comments: 7 pages, 4 figures. Proceedings of the 33rd International Symposium on Lattice Field Theory, July 14 -18, 2015, Kobe, Japan

Journal ref: PoS(LATTICE 2015)022

arXiv:1409.1914 [pdf, ps, other]

A Tale of Three Runtimes

Authors: Nicolas Vasilache, Muthu Baskaran, Tom Henretty, Benoit Meister, M. Harper Langston, Sanket Tavarageri, Richard Lethin

Abstract: This contribution discusses the automatic generation of event-driven, tuple-space based programs for task-oriented execution models from a sequential C specification. We developed a hierarchical mapping solution using auto-parallelizing compiler technology to target three different runtimes relying on event-driven tasks (EDTs). Our solution benefits from the important observation that loop types e… ▽ More This contribution discusses the automatic generation of event-driven, tuple-space based programs for task-oriented execution models from a sequential C specification. We developed a hierarchical mapping solution using auto-parallelizing compiler technology to target three different runtimes relying on event-driven tasks (EDTs). Our solution benefits from the important observation that loop types encode short, transitive relations among EDTs that are compact and efficiently evaluated at runtime. In this context, permutable loops are of particular importance as they translate immediately into conservative point-to-point synchronizations of distance 1. Our solution generates calls into a runtime-agnostic C++ layer, which we have retargeted to Intel's Concurrent Collections (CnC), ETI's SWARM, and the Open Community Runtime (OCR). Experience with other runtime systems motivates our introduction of support for hierarchical async-finishes in CnC. Experimental data is provided to show the benefit of automatically generated code for EDT-based runtimes as well as comparisons across runtimes. △ Less

Submitted 5 September, 2014; originally announced September 2014.

Showing 1–10 of 10 results for author: Langston, H