Search | arXiv e-print repository

Twill: Scheduling Compound AI Systems on Heterogeneous Mobile Edge Platforms

Authors: Zain Taufique, Aman Vyas, Antonio Miele, Pasi Liljeberg, Anil Kanduri

Abstract: Compound AI (cAI) systems chain multiple AI models to solve complex problems. cAI systems are typically composed of deep neural networks (DNNs), transformers, and large language models (LLMs), exhibiting a high degree of computational diversity and dynamic workload variation. Deploying cAI services on mobile edge platforms poses a significant challenge in scheduling concurrent DNN-transformer infe… ▽ More Compound AI (cAI) systems chain multiple AI models to solve complex problems. cAI systems are typically composed of deep neural networks (DNNs), transformers, and large language models (LLMs), exhibiting a high degree of computational diversity and dynamic workload variation. Deploying cAI services on mobile edge platforms poses a significant challenge in scheduling concurrent DNN-transformer inference tasks, which arrive dynamically in an unknown sequence. Existing mobile edge AI inference strategies manage multi-DNN or transformer-only workloads, relying on design-time profiling, and cannot handle concurrent inference of DNNs and transformers required by cAI systems. In this work, we address the challenge of scheduling cAI systems on heterogeneous mobile edge platforms. We present Twill, a run-time framework to handle concurrent inference requests of cAI workloads through task affinity-aware cluster mapping and migration, priority-aware task freezing/unfreezing, and DVFS, while minimizing inference latency within power budgets. We implement and deploy our Twill framework on the Nvidia Jetson Orin NX platform. We evaluate Twill against state-of-the-art edge AI inference techniques over contemporary DNNs and LLMs, reducing inference latency by 54% on average, while honoring power budgets. △ Less

Submitted 1 July, 2025; originally announced July 2025.

Comments: 9 Pages, 9 Figures, Accepted in International Conference on Computer-Aided Design (ICCAD) 2025

arXiv:2411.16086 [pdf, other]

HiDP: Hierarchical DNN Partitioning for Distributed Inference on Heterogeneous Edge Platforms

Authors: Zain Taufique, Aman Vyas, Antonio Miele, Pasi Liljeberg, Anil Kanduri

Abstract: Edge inference techniques partition and distribute Deep Neural Network (DNN) inference tasks among multiple edge nodes for low latency inference, without considering the core-level heterogeneity of edge nodes. Further, default DNN inference frameworks also do not fully utilize the resources of heterogeneous edge nodes, resulting in higher inference latency. In this work, we propose a hierarchical… ▽ More Edge inference techniques partition and distribute Deep Neural Network (DNN) inference tasks among multiple edge nodes for low latency inference, without considering the core-level heterogeneity of edge nodes. Further, default DNN inference frameworks also do not fully utilize the resources of heterogeneous edge nodes, resulting in higher inference latency. In this work, we propose a hierarchical DNN partitioning strategy (HiDP) for distributed inference on heterogeneous edge nodes. Our strategy hierarchically partitions DNN workloads at both global and local levels by considering the core-level heterogeneity of edge nodes. We evaluated our proposed HiDP strategy against relevant distributed inference techniques over widely used DNN models on commercial edge devices. On average our strategy achieved 38% lower latency, 46% lower energy, and 56% higher throughput in comparison with other relevant approaches. △ Less

Submitted 24 November, 2024; originally announced November 2024.

Comments: 7 pages, 8 figures, 1 table, and 1 algorithm. The manuscript is accepted to be published in 28th Design, Automation and Test in Europe Conference (IEEE DATE, 2025)

arXiv:2402.09867 [pdf, other]

Characterizing Accuracy Trade-offs of EEG Applications on Embedded HMPs

Authors: Zain Taufique, Muhammad Awais Bin Altaf, Antonio Miele, Pasi Liljeberg, Anil Kanduri

Abstract: Electroencephalography (EEG) recordings are analyzed using battery-powered wearable devices to monitor brain activities and neurological disorders. These applications require long and continuous processing to generate feasible results. However, wearable devices are constrained with limited energy and computation resources, owing to their small sizes for practical use cases. Embedded heterogeneous… ▽ More Electroencephalography (EEG) recordings are analyzed using battery-powered wearable devices to monitor brain activities and neurological disorders. These applications require long and continuous processing to generate feasible results. However, wearable devices are constrained with limited energy and computation resources, owing to their small sizes for practical use cases. Embedded heterogeneous multi-core platforms (HMPs) can provide better performance within limited energy budgets for EEG applications. Error resilience of the EEG application pipeline can be exploited further to maximize the performance and energy gains with HMPs. However, disciplined tuning of approximation on embedded HMPs requires a thorough exploration of the accuracy-performance-power trade-off space. In this work, we characterize the error resilience of three EEG applications, including Epileptic Seizure Detection, Sleep Stage Classification, and Stress Detection on the real-world embedded HMP test-bed of the Odroid XU3 platform. We present a combinatorial evaluation of power-performance-accuracy trade-offs of EEG applications at different approximation, power, and performance levels to provide insights into the disciplined tuning of approximation in EEG applications on embedded platforms. △ Less

Submitted 15 February, 2024; originally announced February 2024.

Comments: 7 pages, 10 figures

arXiv:2310.10157 [pdf, other]

doi 10.1109/ASP-DAC58780.2024.10473987

Adaptive Workload Distribution for Accuracy-aware DNN Inference on Collaborative Edge Platforms

Authors: Zain Taufique, Antonio Miele, Pasi Liljeberg, Anil Kanduri

Abstract: DNN inference can be accelerated by distributing the workload among a cluster of collaborative edge nodes. Heterogeneity among edge devices and accuracy-performance trade-offs of DNN models present a complex exploration space while catering to the inference performance requirements. In this work, we propose adaptive workload distribution for DNN inference, jointly considering node-level heterogene… ▽ More DNN inference can be accelerated by distributing the workload among a cluster of collaborative edge nodes. Heterogeneity among edge devices and accuracy-performance trade-offs of DNN models present a complex exploration space while catering to the inference performance requirements. In this work, we propose adaptive workload distribution for DNN inference, jointly considering node-level heterogeneity of edge devices, and application-specific accuracy and performance requirements. Our proposed approach combinatorially optimizes heterogeneity-aware workload partitioning and dynamic accuracy configuration of DNN models to ensure performance and accuracy guarantees. We tested our approach on an edge cluster of Odroid XU4, Raspberry Pi4, and Jetson Nano boards and achieved an average gain of 41.52% in performance and 5.2% in output accuracy as compared to state-of-the-art workload distribution strategies. △ Less

Submitted 16 October, 2023; originally announced October 2023.

Comments: 6 pages, 9 figures, 1 table, 1 algorithm, to be published in 29th Asia and South Pacific Design Automation Conference (IEEE ASP-DAC 2024)

Showing 1–4 of 4 results for author: Taufique, Z