Search | arXiv e-print repository

Beyond the Buzz: A Pragmatic Take on Inference Disaggregation

Authors: Tiyasa Mitra, Ritika Borkar, Nidhi Bhatia, Ramon Matas, Shivam Raj, Dheevatsa Mudigere, Ritchie Zhao, Maximilian Golub, Arpan Dutta, Sailaja Madduri, Dharmesh Jani, Brian Pharris, Bita Darvish Rouhani

Abstract: As inference scales to multi-node deployments, disaggregation - splitting inference into distinct phases - offers a promising path to improving the throughput-interactivity Pareto frontier. Despite growing enthusiasm and a surge of open-source efforts, practical deployment of disaggregated serving remains limited due to the complexity of the optimization search space and system-level coordination.… ▽ More As inference scales to multi-node deployments, disaggregation - splitting inference into distinct phases - offers a promising path to improving the throughput-interactivity Pareto frontier. Despite growing enthusiasm and a surge of open-source efforts, practical deployment of disaggregated serving remains limited due to the complexity of the optimization search space and system-level coordination. In this paper, we present the first systematic study of disaggregated inference at scale, evaluating hundreds of thousands of design points across diverse workloads and hardware configurations. We find that disaggregation is most effective for prefill-heavy traffic patterns and larger models. Our results highlight the critical role of dynamic rate matching and elastic scaling in achieving Pareto-optimal performance. Our findings offer actionable insights for efficient disaggregated deployments to navigate the trade-off between system throughput and interactivity. △ Less

Submitted 5 June, 2025; originally announced June 2025.

arXiv:2506.03812 [pdf]

Impact of friction force and retrieval speed on in silico mechanical thrombectomies: a sensitivity analysis

Authors: Mahesh S. Nagargoje, Virginia Fregona, Giulia Luraghi, Francesco Migliavacca, Demitria A Poulos, Bryan C Good, Jose Felix Rodriguez Matas

Abstract: Background: Mechanical Thrombectomy (MT) is a widely accepted first-line treatment for Acute Ischemic Stroke (AIS) and it has been studied using in vitro and in silico models. Thrombectomy outcomes have been performed for patient-specific cases using in silico models. However, until now, in vivo friction coefficients for stent-vessel, stent-clot, and clot-vessel interactions are unknown, but in vi… ▽ More Background: Mechanical Thrombectomy (MT) is a widely accepted first-line treatment for Acute Ischemic Stroke (AIS) and it has been studied using in vitro and in silico models. Thrombectomy outcomes have been performed for patient-specific cases using in silico models. However, until now, in vivo friction coefficients for stent-vessel, stent-clot, and clot-vessel interactions are unknown, but in vitro experiments have been attempted with significant standard deviations. These interactions and friction coefficients have been considered an important aspect of thrombectomy success. Objectives: In the current study, we explored the influence of variation in friction forces for stent-vessel, stent-clot, and clot-vessel interactions using virtual mechanical thrombectomy (VMT). We have performed three simulations for each interaction and varied friction coefficients around the standard deviation observed in the past in vitro studies. Results: (i) clot-vessel friction: higher friction leads to clot fragmentation and VMT failure. (ii) stent-clot friction: it is susceptible to VMT outcomes, with lower values showing the slippage of the clot while higher values lead to fragmentation. (iii) stent-vessel friction: higher friction shows compression of the stent in curved vessels and dislodgment of clot from stent retriever (SR) due to its compression, which leads to VMT failure. (iv) retrieval speed (RS): higher RS (>30 mm/s) leads to significant stent compression and unrealistic behavior of the SR. Conclusions: Analysis of results proposes the necessity for calculating accurate friction factor values and their implementation into in silico models, due to their sensitivity towards thrombectomy outcomes. Such in silico models mimic in vivo thrombectomy more closely and can be used in mechanical thrombectomy planning, management, and decision-making. △ Less

Submitted 4 June, 2025; originally announced June 2025.

arXiv:2505.03632 [pdf]

The role of friction forces in arterial mechanical thrombectomy: a review

Authors: Mahesh S. Nagargoje, Virginia Fregona, Giulia Luraghi, Francesco Migliavacca, Guglielmo Pero, Jose Felix Rodriguez Matas

Abstract: Multiple clinical trials have demonstrated the superiority of mechanical thrombectomy (MT) in treating acute ischemic stroke (AIS). Stent retriever (SR) and aspiration techniques are the standard methods for removing occluded emboli, with evolving technologies improving MT efficiency. However, procedural success remains uncertain. Frictional forces, specifically clot-vessel, clot-SR, and SR-vessel… ▽ More Multiple clinical trials have demonstrated the superiority of mechanical thrombectomy (MT) in treating acute ischemic stroke (AIS). Stent retriever (SR) and aspiration techniques are the standard methods for removing occluded emboli, with evolving technologies improving MT efficiency. However, procedural success remains uncertain. Frictional forces, specifically clot-vessel, clot-SR, and SR-vessel interactions, play a critical role in MT outcomes. This review examines frictional forces during MT and their impact on success, analyzing publications from 2015 to 2025. We focus on studies that calculated friction or retrieval forces using in vitro models. We have also included current trends, limitations, and future perspectives on studying and understanding frictional forces and their implementation into in silico models. Findings indicate that fibrin-rich clots are more difficult to retrieve than red blood cell (RBC)-rich clots due to their higher friction coefficient, three to four times greater, an observation supported by multiple studies. SR-vessel and SR-clot friction also influence MT effectiveness. SR-vessel interaction plays a crucial role in acutely curved vessels, as SR compression reduces its efficiency. In SR-clot interaction, RBC-rich clot fragmentation is linked to relative interaction forces. In summary, obtaining in vivo frictional values remains challenging, and inconsistencies persist in past in vitro studies. Further, a deeper understanding of frictional forces is essential for optimizing MT, improving current SRs, and developing next-generation thrombectomy technologies. △ Less

Submitted 6 May, 2025; originally announced May 2025.

arXiv:2301.10999 [pdf, other]

PerfSAGE: Generalized Inference Performance Predictor for Arbitrary Deep Learning Models on Edge Devices

Authors: Yuji Chai, Devashree Tripathy, Chuteng Zhou, Dibakar Gope, Igor Fedorov, Ramon Matas, David Brooks, Gu-Yeon Wei, Paul Whatmough

Abstract: The ability to accurately predict deep neural network (DNN) inference performance metrics, such as latency, power, and memory footprint, for an arbitrary DNN on a target hardware platform is essential to the design of DNN based models. This ability is critical for the (manual or automatic) design, optimization, and deployment of practical DNNs for a specific hardware deployment platform. Unfortuna… ▽ More The ability to accurately predict deep neural network (DNN) inference performance metrics, such as latency, power, and memory footprint, for an arbitrary DNN on a target hardware platform is essential to the design of DNN based models. This ability is critical for the (manual or automatic) design, optimization, and deployment of practical DNNs for a specific hardware deployment platform. Unfortunately, these metrics are slow to evaluate using simulators (where available) and typically require measurement on the target hardware. This work describes PerfSAGE, a novel graph neural network (GNN) that predicts inference latency, energy, and memory footprint on an arbitrary DNN TFlite graph (TFL, 2017). In contrast, previously published performance predictors can only predict latency and are restricted to pre-defined construction rules or search spaces. This paper also describes the EdgeDLPerf dataset of 134,912 DNNs randomly sampled from four task search spaces and annotated with inference performance metrics from three edge hardware platforms. Using this dataset, we train PerfSAGE and provide experimental results that demonstrate state-of-the-art prediction accuracy with a Mean Absolute Percentage Error of <5% across all targets and model search spaces. These results: (1) Outperform previous state-of-art GNN-based predictors (Dudziak et al., 2020), (2) Accurately predict performance on accelerators (a shortfall of non-GNN-based predictors (Zhang et al., 2021)), and (3) Demonstrate predictions on arbitrary input graphs without modifications to the feature extractor. △ Less

Submitted 26 January, 2023; originally announced January 2023.

arXiv:2201.05842 [pdf, other]

UDC: Unified DNAS for Compressible TinyML Models

Authors: Igor Fedorov, Ramon Matas, Hokchhay Tann, Chuteng Zhou, Matthew Mattina, Paul Whatmough

Abstract: Deploying TinyML models on low-cost IoT hardware is very challenging, due to limited device memory capacity. Neural processing unit (NPU) hardware address the memory challenge by using model compression to exploit weight quantization and sparsity to fit more parameters in the same footprint. However, designing compressible neural networks (NNs) is challenging, as it expands the design space across… ▽ More Deploying TinyML models on low-cost IoT hardware is very challenging, due to limited device memory capacity. Neural processing unit (NPU) hardware address the memory challenge by using model compression to exploit weight quantization and sparsity to fit more parameters in the same footprint. However, designing compressible neural networks (NNs) is challenging, as it expands the design space across which we must make balanced trade-offs. This paper demonstrates Unified DNAS for Compressible (UDC) NNs, which explores a large search space to generate state-of-the-art compressible NNs for NPU. ImageNet results show UDC networks are up to $3.35\times$ smaller (iso-accuracy) or 6.25% more accurate (iso-model size) than previous work. △ Less

Submitted 5 January, 2023; v1 submitted 15 January, 2022; originally announced January 2022.

arXiv:2103.09404 [pdf, other]

Collapsible Linear Blocks for Super-Efficient Super Resolution

Authors: Kartikeya Bhardwaj, Milos Milosavljevic, Liam O'Neil, Dibakar Gope, Ramon Matas, Alex Chalfin, Naveen Suda, Lingchuan Meng, Danny Loh

Abstract: With the advent of smart devices that support 4K and 8K resolution, Single Image Super Resolution (SISR) has become an important computer vision problem. However, most super resolution deep networks are computationally very expensive. In this paper, we propose Super-Efficient Super Resolution (SESR) networks that establish a new state-of-the-art for efficient super resolution. Our approach is base… ▽ More With the advent of smart devices that support 4K and 8K resolution, Single Image Super Resolution (SISR) has become an important computer vision problem. However, most super resolution deep networks are computationally very expensive. In this paper, we propose Super-Efficient Super Resolution (SESR) networks that establish a new state-of-the-art for efficient super resolution. Our approach is based on linear overparameterization of CNNs and creates an efficient model architecture for SISR. With theoretical analysis, we uncover the limitations of existing overparameterization methods and show how the proposed method alleviates them. Detailed experiments across six benchmark datasets demonstrate that SESR achieves similar or better image quality than state-of-the-art models while requiring 2x to 330x fewer Multiply-Accumulate (MAC) operations. As a result, SESR can be used on constrained hardware to perform x2 (1080p to 4K) and x4 (1080p to 8K) SISR. Towards this, we estimate hardware performance numbers for a commercial Arm mobile-Neural Processing Unit (NPU) for 1080p to 4K (x2) and 1080p to 8K (x4) SISR. Our results highlight the challenges faced by super resolution on AI accelerators and demonstrate that SESR is significantly faster (e.g., 6x-8x higher FPS) than existing models on mobile-NPU. Finally, SESR outperforms prior models by 1.5x-2x in latency on Arm CPU and GPU when deployed on a real mobile device. The code for this work is available at https://github.com/ARM-software/sesr. △ Less

Submitted 17 March, 2022; v1 submitted 16 March, 2021; originally announced March 2021.

Comments: Accepted at MLSys 2022 conference

Showing 1–6 of 6 results for author: Matas, R