Skip to main content

Showing 1–1 of 1 results for author: L, Y S B

.
  1. arXiv:2402.09360  [pdf, other

    cs.LG cs.AI

    HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference

    Authors: Yashas Samaga B L, Varun Yerram, Chong You, Srinadh Bhojanapalli, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli

    Abstract: Autoregressive decoding with generative Large Language Models (LLMs) on accelerators (GPUs/TPUs) is often memory-bound where most of the time is spent on transferring model parameters from high bandwidth memory (HBM) to cache. On the other hand, recent works show that LLMs can maintain quality with significant sparsity/redundancy in the feedforward (FFN) layers by appropriately training the model… ▽ More

    Submitted 14 February, 2024; originally announced February 2024.