Search | arXiv e-print repository

Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation

Authors: Shubham Agarwal, Sai Sundaresan, Subrata Mitra, Debabrata Mahapatra, Archit Gupta, Rounak Sharma, Nirmal Joshua Kapu, Tong Yu, Shiv Saini

Abstract: Retrieval-Augmented Generation (RAG) is often used with Large Language Models (LLMs) to infuse domain knowledge or user-specific information. In RAG, given a user query, a retriever extracts chunks of relevant text from a knowledge base. These chunks are sent to an LLM as part of the input prompt. Typically, any given chunk is repeatedly retrieved across user questions. However, currently, for eve… ▽ More Retrieval-Augmented Generation (RAG) is often used with Large Language Models (LLMs) to infuse domain knowledge or user-specific information. In RAG, given a user query, a retriever extracts chunks of relevant text from a knowledge base. These chunks are sent to an LLM as part of the input prompt. Typically, any given chunk is repeatedly retrieved across user questions. However, currently, for every question, attention-layers in LLMs fully compute the key values (KVs) repeatedly for the input chunks, as state-of-the-art methods cannot reuse KV-caches when chunks appear at arbitrary locations with arbitrary contexts. Naive reuse leads to output quality degradation. This leads to potentially redundant computations on expensive GPUs and increases latency. In this work, we propose Cache-Craft, a system for managing and reusing precomputed KVs corresponding to the text chunks (we call chunk-caches) in RAG-based systems. We present how to identify chunk-caches that are reusable, how to efficiently perform a small fraction of recomputation to fix the cache to maintain output quality, and how to efficiently store and evict chunk-caches in the hardware for maximizing reuse while masking any overheads. With real production workloads as well as synthetic datasets, we show that Cache-Craft reduces redundant computation by 51% over SOTA prefix-caching and 75% over full recomputation. Additionally, with continuous batching on a real production workload, we get a 1.6X speed up in throughput and a 2X reduction in end-to-end response latency over prefix-caching while maintaining quality, for both the LLaMA-3-8B and LLaMA-3-70B models. △ Less

Submitted 5 February, 2025; originally announced February 2025.

Comments: Accepted at SIGMOD 2025

arXiv:2411.18636 [pdf, other]

doi 10.20944/preprints202411.1218.v1

Towards Advanced Speech Signal Processing: A Statistical Perspective on Convolution-Based Architectures and its Applications

Authors: Nirmal Joshua Kapu, Raghav Karan

Abstract: This article surveys convolution-based models including convolutional neural networks (CNNs), Conformers, ResNets, and CRNNs-as speech signal processing models and provide their statistical backgrounds and speech recognition, speaker identification, emotion recognition, and speech enhancement applications. Through comparative training cost assessment, model size, accuracy and speed assessment, we… ▽ More This article surveys convolution-based models including convolutional neural networks (CNNs), Conformers, ResNets, and CRNNs-as speech signal processing models and provide their statistical backgrounds and speech recognition, speaker identification, emotion recognition, and speech enhancement applications. Through comparative training cost assessment, model size, accuracy and speed assessment, we compare the strengths and weaknesses of each model, identify potential errors and propose avenues for further research, emphasizing the central role it plays in advancing applications of speech technologies. △ Less

Submitted 20 November, 2024; originally announced November 2024.

arXiv:2411.00865 [pdf, other]

doi 10.1109/IITCEE64140.2025.10915349

Demo-Craft: Using In-Context Learning to Improve Code Generation in Large Language Models

Authors: Nirmal Joshua Kapu, Mihit Sreejith

Abstract: Generating executable code from natural language instructions using Large Language Models (LLMs) poses challenges such as semantic ambiguity and understanding taskspecific contexts. To address these issues, we propose a system called DemoCraft, which enhances code generation by leveraging in-context learning and demonstration selection, combined with latent concept learning. Latent concept learnin… ▽ More Generating executable code from natural language instructions using Large Language Models (LLMs) poses challenges such as semantic ambiguity and understanding taskspecific contexts. To address these issues, we propose a system called DemoCraft, which enhances code generation by leveraging in-context learning and demonstration selection, combined with latent concept learning. Latent concept learning introduces additional concept tokens, which are trainable embeddings that capture task-specific knowledge. We then test our system on two major datasets: MBPP and Humaneval. Our experimental results demonstrate that the proposed system achieves an approximate 2x increase in the pass@k metric compared to baseline models. Furthermore, we introduce two novel evaluation metrics: correctness@k and similarity@k. Our empirical studies indicate that our system attains nearly a 3x improvement in these metrics as well. △ Less

Submitted 22 March, 2025; v1 submitted 30 October, 2024; originally announced November 2024.

Comments: Accepted at IEEE ICIITCEE 2025. Presented on 16th January 2025 in Bengaluru, India

Showing 1–3 of 3 results for author: Kapu, N J