Skip to main content

Showing 1–3 of 3 results for author: Kofsky, S

.
  1. arXiv:2502.14617  [pdf, other

    cs.DC

    Serving Models, Fast and Slow:Optimizing Heterogeneous LLM Inferencing Workloads at Scale

    Authors: Shashwat Jaiswal, Kunal Jain, Yogesh Simmhan, Anjaly Parayil, Ankur Mallick, Rujia Wang, Renee St. Amant, Chetan Bansal, Victor Rühle, Anoop Kulkarni, Steve Kofsky, Saravan Rajmohan

    Abstract: Large Language Model (LLM) inference workloads handled by global cloud providers can include both latency-sensitive and insensitive tasks, creating a diverse range of Service Level Agreement (SLA) requirements. Managing these mixed workloads is challenging due to the complexity of the inference stack, which includes multiple LLMs, hardware configurations, and geographic distributions. Current opti… ▽ More

    Submitted 20 February, 2025; originally announced February 2025.

    Comments: 15 pages, 17 figures, 2 tables

  2. arXiv:2411.15997  [pdf, other

    cs.LG cs.AI cs.DC cs.MA

    Ensuring Fair LLM Serving Amid Diverse Applications

    Authors: Redwan Ibne Seraj Khan, Kunal Jain, Haiying Shen, Ankur Mallick, Anjaly Parayil, Anoop Kulkarni, Steve Kofsky, Pankhuri Choudhary, Renèe St. Amant, Rujia Wang, Yue Cheng, Ali R. Butt, Victor Rühle, Chetan Bansal, Saravan Rajmohan

    Abstract: In a multi-tenant large language model (LLM) serving platform hosting diverse applications, some users may submit an excessive number of requests, causing the service to become unavailable to other users and creating unfairness. Existing fairness approaches do not account for variations in token lengths across applications and multiple LLM calls, making them unsuitable for such platforms. To addre… ▽ More

    Submitted 24 November, 2024; originally announced November 2024.

  3. arXiv:2408.13510  [pdf, other

    cs.DC eess.SY

    Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Load Balancing

    Authors: Kunal Jain, Anjaly Parayil, Ankur Mallick, Esha Choukse, Xiaoting Qin, Jue Zhang, Íñigo Goiri, Rujia Wang, Chetan Bansal, Victor Rühle, Anoop Kulkarni, Steve Kofsky, Saravan Rajmohan

    Abstract: Large Language Model (LLM) workloads have distinct prefill and decode phases with different compute and memory requirements which should ideally be accounted for when scheduling input queries across different LLM instances in a cluster. However existing scheduling algorithms treat LLM workloads as monolithic jobs without considering the distinct characteristics of the two phases in each workload.… ▽ More

    Submitted 7 January, 2025; v1 submitted 24 August, 2024; originally announced August 2024.

    Comments: 16 pages, 10 figures