Skip to main content

Showing 1–4 of 4 results for author: Matas, R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.05508  [pdf, ps, other

    cs.DC cs.AI

    Beyond the Buzz: A Pragmatic Take on Inference Disaggregation

    Authors: Tiyasa Mitra, Ritika Borkar, Nidhi Bhatia, Ramon Matas, Shivam Raj, Dheevatsa Mudigere, Ritchie Zhao, Maximilian Golub, Arpan Dutta, Sailaja Madduri, Dharmesh Jani, Brian Pharris, Bita Darvish Rouhani

    Abstract: As inference scales to multi-node deployments, disaggregation - splitting inference into distinct phases - offers a promising path to improving the throughput-interactivity Pareto frontier. Despite growing enthusiasm and a surge of open-source efforts, practical deployment of disaggregated serving remains limited due to the complexity of the optimization search space and system-level coordination.… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

  2. arXiv:2301.10999  [pdf, other

    cs.LG cs.PF

    PerfSAGE: Generalized Inference Performance Predictor for Arbitrary Deep Learning Models on Edge Devices

    Authors: Yuji Chai, Devashree Tripathy, Chuteng Zhou, Dibakar Gope, Igor Fedorov, Ramon Matas, David Brooks, Gu-Yeon Wei, Paul Whatmough

    Abstract: The ability to accurately predict deep neural network (DNN) inference performance metrics, such as latency, power, and memory footprint, for an arbitrary DNN on a target hardware platform is essential to the design of DNN based models. This ability is critical for the (manual or automatic) design, optimization, and deployment of practical DNNs for a specific hardware deployment platform. Unfortuna… ▽ More

    Submitted 26 January, 2023; originally announced January 2023.

  3. arXiv:2201.05842  [pdf, other

    cs.LG

    UDC: Unified DNAS for Compressible TinyML Models

    Authors: Igor Fedorov, Ramon Matas, Hokchhay Tann, Chuteng Zhou, Matthew Mattina, Paul Whatmough

    Abstract: Deploying TinyML models on low-cost IoT hardware is very challenging, due to limited device memory capacity. Neural processing unit (NPU) hardware address the memory challenge by using model compression to exploit weight quantization and sparsity to fit more parameters in the same footprint. However, designing compressible neural networks (NNs) is challenging, as it expands the design space across… ▽ More

    Submitted 5 January, 2023; v1 submitted 15 January, 2022; originally announced January 2022.

  4. arXiv:2103.09404  [pdf, other

    eess.IV cs.CV cs.LG

    Collapsible Linear Blocks for Super-Efficient Super Resolution

    Authors: Kartikeya Bhardwaj, Milos Milosavljevic, Liam O'Neil, Dibakar Gope, Ramon Matas, Alex Chalfin, Naveen Suda, Lingchuan Meng, Danny Loh

    Abstract: With the advent of smart devices that support 4K and 8K resolution, Single Image Super Resolution (SISR) has become an important computer vision problem. However, most super resolution deep networks are computationally very expensive. In this paper, we propose Super-Efficient Super Resolution (SESR) networks that establish a new state-of-the-art for efficient super resolution. Our approach is base… ▽ More

    Submitted 17 March, 2022; v1 submitted 16 March, 2021; originally announced March 2021.

    Comments: Accepted at MLSys 2022 conference