-
SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving
Authors:
Xiangchen Li,
Dimitrios Spatharakis,
Saeid Ghafouri,
Jiakun Fan,
Dimitrios Nikolopoulos
Abstract:
Regardless the advancements in device capabilities, efficient inferencing advanced large language models (LLMs) at the edge remains challenging due to limited device memory and power constraints. Existing strategies, such as aggressive quantization, pruning, or remote inference, trade accuracy for efficiency or lead to substantial cost burdens. This position paper introduces a new approach that le…
▽ More
Regardless the advancements in device capabilities, efficient inferencing advanced large language models (LLMs) at the edge remains challenging due to limited device memory and power constraints. Existing strategies, such as aggressive quantization, pruning, or remote inference, trade accuracy for efficiency or lead to substantial cost burdens. This position paper introduces a new approach that leverages speculative decoding, previously viewed primarily as a decoding acceleration technique for autoregressive generation of LLMs, as a promising approach specifically adapted for edge computing by orchestrating computation across heterogeneous devices. We propose SLED, a method that allows lightweight edge devices to draft multiple candidate tokens locally using diverse draft models, while a single, shared edge server efficiently batches and verifies the tokens utilizing a more precise target model. This approach supports device heterogeneity and reduces server-side memory footprint by avoiding the need to deploy multiple target models. Our initial experiments with Jetson Orin Nano, Raspberry Pi 5, and an RTX 6000 edge server indicate substantial benefits: significantly reduced latency, improved energy efficiency, and increased concurrent inference sessions, all without sacrificing model accuracy.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
Sponge: Inference Serving with Dynamic SLOs Using In-Place Vertical Scaling
Authors:
Kamran Razavi,
Saeid Ghafouri,
Max Mühlhäuser,
Pooyan Jamshidi,
Lin Wang
Abstract:
Mobile and IoT applications increasingly adopt deep learning inference to provide intelligence. Inference requests are typically sent to a cloud infrastructure over a wireless network that is highly variable, leading to the challenge of dynamic Service Level Objectives (SLOs) at the request level. This paper presents Sponge, a novel deep learning inference serving system that maximizes resource ef…
▽ More
Mobile and IoT applications increasingly adopt deep learning inference to provide intelligence. Inference requests are typically sent to a cloud infrastructure over a wireless network that is highly variable, leading to the challenge of dynamic Service Level Objectives (SLOs) at the request level. This paper presents Sponge, a novel deep learning inference serving system that maximizes resource efficiency while guaranteeing dynamic SLOs. Sponge achieves its goal by applying in-place vertical scaling, dynamic batching, and request reordering. Specifically, we introduce an Integer Programming formulation to capture the resource allocation problem, providing a mathematical model of the relationship between latency, batch size, and resources. We demonstrate the potential of Sponge through a prototype implementation and preliminary experiments and discuss future works.
△ Less
Submitted 23 April, 2024; v1 submitted 31 March, 2024;
originally announced April 2024.
-
HEET: A Heterogeneity Measure to Quantify the Difference across Distributed Computing Systems
Authors:
Ali Mokhtari,
Saeid Ghafouri,
Pooyan Jamshidi,
Mohsen Amini Salehi
Abstract:
Although system heterogeneity has been extensively studied in the past, there is yet to be a study on measuring the impact of heterogeneity on system performance. For this purpose, we propose a heterogeneity measure that can characterize the impact of the heterogeneity of a system on its performance behavior in terms of throughput or makespan. We develop a mathematical model to characterize a hete…
▽ More
Although system heterogeneity has been extensively studied in the past, there is yet to be a study on measuring the impact of heterogeneity on system performance. For this purpose, we propose a heterogeneity measure that can characterize the impact of the heterogeneity of a system on its performance behavior in terms of throughput or makespan. We develop a mathematical model to characterize a heterogeneous system in terms of its task and machine heterogeneity dimensions and then reduce it to a single value, called Homogeneous Equivalent Execution Time (HEET), which represents the execution time behavior of the entire system. We used AWS EC2 instances to implement a real-world machine learning inference system. Performance evaluation of the HEET score across different heterogeneous system configurations demonstrates that HEET can accurately characterize the performance behavior of these systems. In particular, the results show that our proposed method is capable of predicting the true makespan of heterogeneous systems without online evaluations with an average precision of 84%. This heterogeneity measure is instrumental for solution architects to configure their systems proactively to be sufficiently heterogeneous to meet their desired performance objectives.
△ Less
Submitted 5 December, 2023;
originally announced December 2023.
-
IPA: Inference Pipeline Adaptation to Achieve High Accuracy and Cost-Efficiency
Authors:
Saeid Ghafouri,
Kamran Razavi,
Mehran Salmani,
Alireza Sanaee,
Tania Lorido-Botran,
Lin Wang,
Joseph Doyle,
Pooyan Jamshidi
Abstract:
Efficiently optimizing multi-model inference pipelines for fast, accurate, and cost-effective inference is a crucial challenge in machine learning production systems, given their tight end-to-end latency requirements. To simplify the exploration of the vast and intricate trade-off space of latency, accuracy, and cost in inference pipelines, providers frequently opt to consider one of them. However…
▽ More
Efficiently optimizing multi-model inference pipelines for fast, accurate, and cost-effective inference is a crucial challenge in machine learning production systems, given their tight end-to-end latency requirements. To simplify the exploration of the vast and intricate trade-off space of latency, accuracy, and cost in inference pipelines, providers frequently opt to consider one of them. However, the challenge lies in reconciling latency, accuracy, and cost trade-offs. To address this challenge and propose a solution to efficiently manage model variants in inference pipelines, we present IPA, an online deep learning Inference Pipeline Adaptation system that efficiently leverages model variants for each deep learning task. Model variants are different versions of pre-trained models for the same deep learning task with variations in resource requirements, latency, and accuracy. IPA dynamically configures batch size, replication, and model variants to optimize accuracy, minimize costs, and meet user-defined latency Service Level Agreements (SLAs) using Integer Programming. It supports multi-objective settings for achieving different trade-offs between accuracy and cost objectives while remaining adaptable to varying workloads and dynamic traffic patterns. Navigating a wider variety of configurations allows \namex{} to achieve better trade-offs between cost and accuracy objectives compared to existing methods. Extensive experiments in a Kubernetes implementation with five real-world inference pipelines demonstrate that IPA improves end-to-end accuracy by up to 21% with a minimal cost increase. The code and data for replications are available at https://github.com/reconfigurable-ml-pipeline/ipa.
△ Less
Submitted 26 May, 2024; v1 submitted 24 August, 2023;
originally announced August 2023.
-
Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems
Authors:
Mehran Salmani,
Saeid Ghafouri,
Alireza Sanaee,
Kamran Razavi,
Max Mühlhäuser,
Joseph Doyle,
Pooyan Jamshidi,
Mohsen Sharifi
Abstract:
The use of machine learning (ML) inference for various applications is growing drastically. ML inference services engage with users directly, requiring fast and accurate responses. Moreover, these services face dynamic workloads of requests, imposing changes in their computing resources. Failing to right-size computing resources results in either latency service level objectives (SLOs) violations…
▽ More
The use of machine learning (ML) inference for various applications is growing drastically. ML inference services engage with users directly, requiring fast and accurate responses. Moreover, these services face dynamic workloads of requests, imposing changes in their computing resources. Failing to right-size computing resources results in either latency service level objectives (SLOs) violations or wasted computing resources. Adapting to dynamic workloads considering all the pillars of accuracy, latency, and resource cost is challenging. In response to these challenges, we propose InfAdapter, which proactively selects a set of ML model variants with their resource allocations to meet latency SLO while maximizing an objective function composed of accuracy and cost. InfAdapter decreases SLO violation and costs up to 65% and 33%, respectively, compared to a popular industry autoscaler (Kubernetes Vertical Pod Autoscaler).
△ Less
Submitted 24 April, 2023; v1 submitted 21 April, 2023;
originally announced April 2023.
-
Influence Maximization (IM) in Complex Networks with Limited Visibility Using Statistical Methods
Authors:
Saeid Ghafouri,
Seyed Hossein Khasteh,
Seyed Omid Azarkasb
Abstract:
A social network (SN) is a social structure consisting of a group representing the interaction between them. SNs have recently been widely used and, subsequently, have become suitable and popular platforms for product promotion and information diffusion. People in an SN directly influence each other's interests and behavior. One of the most important problems in SNs is to find people who can have…
▽ More
A social network (SN) is a social structure consisting of a group representing the interaction between them. SNs have recently been widely used and, subsequently, have become suitable and popular platforms for product promotion and information diffusion. People in an SN directly influence each other's interests and behavior. One of the most important problems in SNs is to find people who can have the maximum influence on other nodes in the network in a cascade manner if they are chosen as the seed nodes of a network diffusion scenario. Influential diffusers are people who, if they are chosen as the seed set in a publishing issue in the network, that network will have the most people who have learned about that diffused entity. This is a well-known problem in literature known as influence maximization (IM) problem. Although it has been proven that this is an NP-complete problem and does not have a solution in polynomial time, it has been argued that it has the properties of sub modular functions and, therefore, can be solved using a greedy algorithm. Most of the methods proposed to improve this complexity are based on the assumption that the entire graph is visible. However, this assumption does not hold for many real-world graphs. This study is conducted to extend current maximization methods with link prediction techniques to pseudo-visibility graphs. To this end, a graph generation method called the exponential random graph model (ERGM) is used for link prediction. The proposed method is tested using the data from the Snap dataset of Stanford University. According to the experimental tests, the proposed method is efficient on real-world graphs.
△ Less
Submitted 11 September, 2022; v1 submitted 28 August, 2022;
originally announced August 2022.
-
Opinion Leader Detection in Online Social Networks Based on Output and Input Links
Authors:
Zahra Ghorbani,
Seyed Hossein Khasteh,
Saeid Ghafouri
Abstract:
The understanding of how users in a network update their opinions based on their neighbours opinions has attracted a great deal of interest in the field of network science, and a growing body of literature recognises the significance of this issue. In this research paper, we propose a new dynamic model of opinion formation in directed networks. In this model, the opinion of each node is updated as…
▽ More
The understanding of how users in a network update their opinions based on their neighbours opinions has attracted a great deal of interest in the field of network science, and a growing body of literature recognises the significance of this issue. In this research paper, we propose a new dynamic model of opinion formation in directed networks. In this model, the opinion of each node is updated as the weighted average of its neighbours opinions, where the weights represent social influence. We define a new centrality measure as a social influence metric based on both influence and conformity. We measure this new approach using two opinion formation models: (i) the Degroot model and (ii) our own proposed model. Previously published research studies have not considered conformity, and have only considered the influence of the nodes when computing the social influence. In our definition, nodes with low in-degree and high out-degree that were connected to nodes with high out-degree and low in-degree had higher centrality. As the main contribution of this research, we propose an algorithm for finding a small subset of nodes in a social network that can have a significant impact on the opinions of other nodes. Experiments on real-world data demonstrate that the proposed algorithm significantly outperforms previously published state-of-the-art methods.
△ Less
Submitted 28 August, 2022;
originally announced August 2022.