-
Residual-INR: Communication Efficient On-Device Learning Using Implicit Neural Representation
Authors:
Hanqiu Chen,
Xuebin Yao,
Pradeep Subedi,
Cong Hao
Abstract:
Edge computing is a distributed computing paradigm that collects and processes data at or near the source of data generation. The on-device learning at edge relies on device-to-device wireless communication to facilitate real-time data sharing and collaborative decision-making among multiple devices. This significantly improves the adaptability of the edge computing system to the changing environm…
▽ More
Edge computing is a distributed computing paradigm that collects and processes data at or near the source of data generation. The on-device learning at edge relies on device-to-device wireless communication to facilitate real-time data sharing and collaborative decision-making among multiple devices. This significantly improves the adaptability of the edge computing system to the changing environments. However, as the scale of the edge computing system is getting larger, communication among devices is becoming the bottleneck because of the limited bandwidth of wireless communication leads to large data transfer latency. To reduce the amount of device-to-device data transmission and accelerate on-device learning, in this paper, we propose Residual-INR, a fog computing-based communication-efficient on-device learning framework by utilizing implicit neural representation (INR) to compress images/videos into neural network weights. Residual-INR enhances data transfer efficiency by collecting JPEG images from edge devices, compressing them into INR format at the fog node, and redistributing them for on-device learning. By using a smaller INR for full image encoding and a separate object INR for high-quality object region reconstruction through residual encoding, our technique can reduce the encoding redundancy while maintaining the object quality. Residual-INR is a promising solution for edge on-device learning because it reduces data transmission by up to 5.16 x across a network of 10 edge devices. It also facilitates CPU-free accelerated on-device learning, achieving up to 2.9 x speedup without sacrificing accuracy. Our code is available at: https://github.com/sharc-lab/Residual-INR.
△ Less
Submitted 16 December, 2024; v1 submitted 10 August, 2024;
originally announced August 2024.
-
ICGMM: CXL-enabled Memory Expansion with Intelligent Caching Using Gaussian Mixture Model
Authors:
Hanqiu Chen,
Yitu Wang,
Luis Vitorio Cargnini,
Mohammadreza Soltaniyeh,
Dongyang Li,
Gongjin Sun,
Pradeep Subedi,
Andrew Chang,
Yiran Chen,
Cong Hao
Abstract:
Compute Express Link (CXL) emerges as a solution for wide gap between computational speed and data communication rates among host and multiple devices. It fosters a unified and coherent memory space between host and CXL storage devices such as such as Solid-state drive (SSD) for memory expansion, with a corresponding DRAM implemented as the device cache. However, this introduces challenges such as…
▽ More
Compute Express Link (CXL) emerges as a solution for wide gap between computational speed and data communication rates among host and multiple devices. It fosters a unified and coherent memory space between host and CXL storage devices such as such as Solid-state drive (SSD) for memory expansion, with a corresponding DRAM implemented as the device cache. However, this introduces challenges such as substantial cache miss penalties, sub-optimal caching due to data access granularity mismatch between the DRAM "cache" and SSD "memory", and inefficient hardware cache management. To address these issues, we propose a novel solution, named ICGMM, which optimizes caching and eviction directly on hardware, employing a Gaussian Mixture Model (GMM)-based approach. We prototype our solution on an FPGA board, which demonstrates a noteworthy improvement compared to the classic Least Recently Used (LRU) cache strategy. We observe a decrease in the cache miss rate ranging from 0.32% to 6.14%, leading to a substantial 16.23% to 39.14% reduction in the average SSD access latency. Furthermore, when compared to the state-of-the-art Long Short-Term Memory (LSTM)-based cache policies, our GMM algorithm on FPGA showcases an impressive latency reduction of over 10,000 times. Remarkably, this is achieved while demanding much fewer hardware resources.
△ Less
Submitted 10 August, 2024;
originally announced August 2024.
-
Exploring the Role of Machine Learning in Scientific Workflows: Opportunities and Challenges
Authors:
Azita Nouri,
Philip E. Davis,
Pradeep Subedi,
Manish Parashar
Abstract:
In this survey, we discuss the challenges of executing scientific workflows as well as existing Machine Learning (ML) techniques to alleviate those challenges. We provide the context and motivation for applying ML to each step of the execution of these workflows. Furthermore, we provide recommendations on how to extend ML techniques to unresolved challenges in the execution of scientific workflows…
▽ More
In this survey, we discuss the challenges of executing scientific workflows as well as existing Machine Learning (ML) techniques to alleviate those challenges. We provide the context and motivation for applying ML to each step of the execution of these workflows. Furthermore, we provide recommendations on how to extend ML techniques to unresolved challenges in the execution of scientific workflows. Moreover, we discuss the possibility of using ML techniques for in-situ operations. We explore the challenges of in-situ workflows and provide suggestions for improving the performance of their execution using ML techniques.
△ Less
Submitted 26 October, 2021;
originally announced October 2021.
-
Scalable Graph Embedding LearningOn A Single GPU
Authors:
Azita Nouri,
Philip E. Davis,
Pradeep Subedi,
Manish Parashar
Abstract:
Graph embedding techniques have attracted growing interest since they convert the graph data into continuous and low-dimensional space. Effective graph analytic provides users a deeper understanding of what is behind the data and thus can benefit a variety of machine learning tasks. With the current scale of real-world applications, most graph analytic methods suffer high computation and space cos…
▽ More
Graph embedding techniques have attracted growing interest since they convert the graph data into continuous and low-dimensional space. Effective graph analytic provides users a deeper understanding of what is behind the data and thus can benefit a variety of machine learning tasks. With the current scale of real-world applications, most graph analytic methods suffer high computation and space costs. These methods and systems can process a network with thousands to a few million nodes. However, scaling to large-scale networks remains a challenge. The complexity of training graph embedding system requires the use of existing accelerators such as GPU. In this paper, we introduce a hybrid CPU-GPU framework that addresses the challenges of learning embedding of large-scale graphs. The performance of our method is compared qualitatively and quantitatively with the existing embedding systems on common benchmarks. We also show that our system can scale training to datasets with an order of magnitude greater than a single machine's total memory capacity. The effectiveness of the learned embedding is evaluated within multiple downstream applications. The experimental results indicate the effectiveness of the learned embedding in terms of performance and accuracy.
△ Less
Submitted 19 January, 2022; v1 submitted 13 October, 2021;
originally announced October 2021.
-
AI Multi-Tenancy on Edge: Concurrent Deep Learning Model Executions and Dynamic Model Placements on Edge Devices
Authors:
Piyush Subedi,
Jianwei Hao,
In Kee Kim,
Lakshmish Ramaswamy
Abstract:
Many real-world applications are widely adopting the edge computing paradigm due to its low latency and better privacy protection. With notable success in AI and deep learning (DL), edge devices and AI accelerators play a crucial role in deploying DL inference services at the edge of the Internet. While prior works quantified various edge devices' efficiency, most studies focused on the performanc…
▽ More
Many real-world applications are widely adopting the edge computing paradigm due to its low latency and better privacy protection. With notable success in AI and deep learning (DL), edge devices and AI accelerators play a crucial role in deploying DL inference services at the edge of the Internet. While prior works quantified various edge devices' efficiency, most studies focused on the performance of edge devices with single DL tasks. Therefore, there is an urgent need to investigate AI multi-tenancy on edge devices, required by many advanced DL applications for edge computing.
This work investigates two techniques - concurrent model executions and dynamic model placements - for AI multi-tenancy on edge devices. With image classification as an example scenario, we empirically evaluate AI multi-tenancy on various edge devices, AI accelerators, and DL frameworks to identify its benefits and limitations. Our results show that multi-tenancy significantly improves DL inference throughput by up to 3.3x -- 3.8x on Jetson TX2. These AI multi-tenancy techniques also open up new opportunities for flexible deployment of multiple DL services on edge devices and AI accelerators.
△ Less
Submitted 26 July, 2021;
originally announced July 2021.
-
Strategies for Democratization of Supercomputing: Availability, Accessibility and Usability of High Performance Computing for Education and Practice of Big Data Analytics
Authors:
Jim Samuel,
Margaret Brennan-Tonetta,
Yana Samuel,
Pradeep Subedi,
Jack Smith
Abstract:
There has been an increasing interest in and growing need for high performance computing (HPC), popularly known as supercomputing, in domains such as textual analytics, business domains analytics, forecasting and natural language processing (NLP), in addition to the relatively mature supercomputing domains of quantum physics and biology. HPC has been widely used in computer science (CS) and other…
▽ More
There has been an increasing interest in and growing need for high performance computing (HPC), popularly known as supercomputing, in domains such as textual analytics, business domains analytics, forecasting and natural language processing (NLP), in addition to the relatively mature supercomputing domains of quantum physics and biology. HPC has been widely used in computer science (CS) and other traditionally computation intensive disciplines, but has remained largely siloed away from the vast array of social, behavioral, business and economics disciplines. However, with ubiquitous big data, there is a compelling need to make HPC technologically and economically accessible, easy to use, and operationally democratized. Therefore, this research focuses on making two key contributions, the first is the articulation of strategies based on availability, accessibility and usability for the demystification and democratization of HPC, based on an analytical review of Caliburn, a notable supercomputer at its inception. The second contribution is a set of principles for HPC adoption based on an experiential narrative of HPC usage for textual analytics and NLP of social media data from a first time user perspective. Both, the HPC usage process and the output of the early stage analytics are summarized. This research study synthesizes expert input on HPC democratization strategies, and chronicles the challenges and opportunities from a multidisciplinary perspective, of a case of rapid adoption of supercomputing for textual analytics and NLP. Deductive logic is used to identify strategies which can lead to efficacious engagement, adoption, production and sustained usage for research, teaching, application and innovation by researchers, faculty, professionals and students across a broad range of disciplines.
△ Less
Submitted 19 April, 2021;
originally announced April 2021.
-
Exploring Trade-offs in Dynamic Task Triggering for Loosely Coupled Scientific Workflows
Authors:
Zhe Wang,
Pradeep Subedi,
Shaohua Duan,
Yubo Qin,
Philip Davis,
Anthony Simonet,
Ivan Rodero,
Manish Parashar
Abstract:
In order to achieve near-time insights, scientific workflows tend to be organized in a flexible and dynamic way. Data-driven triggering of tasks has been explored as a way to support workflows that evolve based on the data. However, the overhead introduced by such dynamic triggering of tasks is an under-studied topic. This paper discusses different facets of dynamic task triggers. Particularly, we…
▽ More
In order to achieve near-time insights, scientific workflows tend to be organized in a flexible and dynamic way. Data-driven triggering of tasks has been explored as a way to support workflows that evolve based on the data. However, the overhead introduced by such dynamic triggering of tasks is an under-studied topic. This paper discusses different facets of dynamic task triggers. Particularly, we explore different ways of constructing a data-driven dynamic workflow and then evaluate the overheads introduced by such design decisions. We evaluate workflows with varying data size, percentage of interesting data, temporal data distribution, and number of tasks triggered. Finally, we provide advice based upon analysis of the evaluation results for users looking to construct data-driven scientific workflows.
△ Less
Submitted 21 April, 2020;
originally announced April 2020.
-
Using Intel Optane Devices for In-situ Data Staging in HPC Workflows
Authors:
Pradeep Subedi,
Philip E. Davis,
J. J. Villalobos,
Ivan Rodero,
Manish Parashar
Abstract:
Emerging non-volatile memory technologies (NVRAM) offer alternatives to hard drives that are persistent, while providing similar latencies to DRAM. Intel recently released the Optane drive, which features 3D XPoint memory technology. This device can be deployed as an SSD or as persistent memory. In this paper, we provide a performance comparison between Optane (SSD DC4800X) and NVMe (SSD DC3700) d…
▽ More
Emerging non-volatile memory technologies (NVRAM) offer alternatives to hard drives that are persistent, while providing similar latencies to DRAM. Intel recently released the Optane drive, which features 3D XPoint memory technology. This device can be deployed as an SSD or as persistent memory. In this paper, we provide a performance comparison between Optane (SSD DC4800X) and NVMe (SSD DC3700) drives as block devices. We study the performance from two perspectives: 1) Benchmarking of drives using FIO workloads, and 2) Assessing the impact of using Optane over NVMe within the DataSpaces framework for in-memory data staging to support in-situ scientific workflows.
△ Less
Submitted 11 July, 2018;
originally announced July 2018.