-
Federated Learning with Workload Reduction through Partial Training of Client Models and Entropy-Based Data Selection
Authors:
Hongrui Shi,
Valentin Radu,
Po Yang
Abstract:
With the rapid expansion of edge devices, such as IoT devices, where crucial data needed for machine learning applications is generated, it becomes essential to promote their participation in privacy-preserving Federated Learning (FL) systems. The best way to achieve this desiderate is by reducing their training workload to match their constrained computational resources. While prior FL research h…
▽ More
With the rapid expansion of edge devices, such as IoT devices, where crucial data needed for machine learning applications is generated, it becomes essential to promote their participation in privacy-preserving Federated Learning (FL) systems. The best way to achieve this desiderate is by reducing their training workload to match their constrained computational resources. While prior FL research has address the workload constrains by introducing lightweight models on the edge, limited attention has been given to optimizing on-device training efficiency through reducing the amount of data need during training. In this work, we propose FedFT-EDS, a novel approach that combines Fine-Tuning of partial client models with Entropy-based Data Selection to reduce training workloads on edge devices. By actively selecting the most informative local instances for learning, FedFT-EDS reduces training data significantly in FL and demonstrates that not all user data is equally beneficial for FL on all rounds. Our experiments on CIFAR-10 and CIFAR-100 show that FedFT-EDS uses only 50% user data while improving the global model performance compared to baseline methods, FedAvg and FedProx. Importantly, FedFT-EDS improves client learning efficiency by up to 3 times, using one third of training time on clients to achieve an equivalent performance to the baselines. This work highlights the importance of data selection in FL and presents a promising pathway to scalable and efficient Federate Learning.
△ Less
Submitted 30 December, 2024;
originally announced January 2025.
-
Closing the Gap between Client and Global Model Performance in Heterogeneous Federated Learning
Authors:
Hongrui Shi,
Valentin Radu,
Po Yang
Abstract:
The heterogeneity of hardware and data is a well-known and studied problem in the community of Federated Learning (FL) as running under heterogeneous settings. Recently, custom-size client models trained with Knowledge Distillation (KD) has emerged as a viable strategy for tackling the heterogeneity challenge. However, previous efforts in this direction are aimed at client model tuning rather than…
▽ More
The heterogeneity of hardware and data is a well-known and studied problem in the community of Federated Learning (FL) as running under heterogeneous settings. Recently, custom-size client models trained with Knowledge Distillation (KD) has emerged as a viable strategy for tackling the heterogeneity challenge. However, previous efforts in this direction are aimed at client model tuning rather than their impact onto the knowledge aggregation of the global model. Despite performance of global models being the primary objective of FL systems, under heterogeneous settings client models have received more attention. Here, we provide more insights into how the chosen approach for training custom client models has an impact on the global model, which is essential for any FL application. We show the global model can fully leverage the strength of KD with heterogeneous data. Driven by empirical observations, we further propose a new approach that combines KD and Learning without Forgetting (LwoF) to produce improved personalised models. We bring heterogeneous FL on pair with the mighty FedAvg of homogeneous FL, in realistic deployment scenarios with dropping clients.
△ Less
Submitted 12 November, 2022; v1 submitted 7 November, 2022;
originally announced November 2022.
-
Data Selection for Efficient Model Update in Federated Learning
Authors:
Hongrui Shi,
Valentin Radu
Abstract:
The Federated Learning (FL) workflow of training a centralized model with distributed data is growing in popularity. However, until recently, this was the realm of contributing clients with similar computing capability. The fast expanding IoT space and data being generated and processed at the edge are encouraging more effort into expanding federated learning to include heterogeneous systems. Prev…
▽ More
The Federated Learning (FL) workflow of training a centralized model with distributed data is growing in popularity. However, until recently, this was the realm of contributing clients with similar computing capability. The fast expanding IoT space and data being generated and processed at the edge are encouraging more effort into expanding federated learning to include heterogeneous systems. Previous approaches distribute light-weight models to clients are rely on knowledge transfer to distil the characteristic of local data in partitioned updates. However, their additional knowledge exchange transmitted through the network degrades the communication efficiency of FL. We propose to reduce the size of knowledge exchanged in these FL setups by clustering and selecting only the most representative bits of information from the clients. The partitioned global update adopted in our work splits the global deep neural network into a lower part for generic feature extraction and an upper part that is more sensitive to this selected client knowledge. Our experiments show that only 1.6% of the initially exchanged data can effectively transfer the characteristic of the client data to the global model in our FL approach, using split networks. These preliminary results evolve our understanding of federated learning by demonstrating efficient training using strategically selected training samples.
△ Less
Submitted 22 March, 2022; v1 submitted 5 November, 2021;
originally announced November 2021.
-
TMBuD: A dataset for urban scene building detection
Authors:
Orhei Ciprian,
Vert Silviu,
Mocofan Muguras,
Vasiu Radu
Abstract:
Building recognition and 3D reconstruction of human made structures in urban scenarios has become an interesting and actual topic in the image processing domain. For this research topic the Computer Vision and Augmented Reality areas intersect for creating a better understanding of the urban scenario for various topics. In this paper we aim to introduce a dataset solution, the TMBuD, that is bette…
▽ More
Building recognition and 3D reconstruction of human made structures in urban scenarios has become an interesting and actual topic in the image processing domain. For this research topic the Computer Vision and Augmented Reality areas intersect for creating a better understanding of the urban scenario for various topics. In this paper we aim to introduce a dataset solution, the TMBuD, that is better fitted for image processing on human made structures for urban scene scenarios. The proposed dataset will allow proper evaluation of salient edges and semantic segmentation of images focusing on the street view perspective of buildings. The images that form our dataset offer various street view perspectives of buildings from urban scenarios, which allows for evaluating complex algorithms. The dataset features 160 images of buildings from Timisoara, Romania, with a resolution of 768 x 1024 pixels each.
△ Less
Submitted 27 October, 2021;
originally announced October 2021.
-
Optimising the Performance of Convolutional Neural Networks across Computing Systems using Transfer Learning
Authors:
Rik Mulder,
Valentin Radu,
Christophe Dubach
Abstract:
The choice of convolutional routines (primitives) to implement neural networks has a tremendous impact on their inference performance (execution speed) on a given hardware platform. To optimise a neural network by primitive selection, the optimal primitive is identified for each layer of the network. This process requires a lengthy profiling stage, iterating over all the available primitives for e…
▽ More
The choice of convolutional routines (primitives) to implement neural networks has a tremendous impact on their inference performance (execution speed) on a given hardware platform. To optimise a neural network by primitive selection, the optimal primitive is identified for each layer of the network. This process requires a lengthy profiling stage, iterating over all the available primitives for each layer configuration, to measure their execution time on the target platform. Because each primitive exploits the hardware in different ways, new profiling is needed to obtain the best performance when moving to another platform. In this work, we propose to replace this prohibitively expensive profiling stage with a machine learning based approach of performance modeling. Our approach speeds up the optimisation time drastically. After training, our performance model can estimate the performance of convolutional primitives in any layer configuration. The time to optimise the execution of large neural networks via primitive selection is reduced from hours to just seconds. Our performance model is easily transferable to other target platforms. We demonstrate this by training a performance model on an Intel platform and performing transfer learning to AMD and ARM processor devices with minimal profiled samples.
△ Less
Submitted 20 October, 2020;
originally announced October 2020.
-
TASO: Time and Space Optimization for Memory-Constrained DNN Inference
Authors:
Yuan Wen,
Andrew Anderson,
Valentin Radu,
Michael F. P. O'Boyle,
David Gregg
Abstract:
Convolutional neural networks (CNNs) are used in many embedded applications, from industrial robotics and automation systems to biometric identification on mobile devices. State-of-the-art classification is typically achieved by large networks, which are prohibitively expensive to run on mobile and embedded devices with tightly constrained memory and energy budgets. We propose an approach for ahea…
▽ More
Convolutional neural networks (CNNs) are used in many embedded applications, from industrial robotics and automation systems to biometric identification on mobile devices. State-of-the-art classification is typically achieved by large networks, which are prohibitively expensive to run on mobile and embedded devices with tightly constrained memory and energy budgets. We propose an approach for ahead-of-time domain specific optimization of CNN models, based on an integer linear programming (ILP) for selecting primitive operations to implement convolutional layers. We optimize the trade-off between execution time and memory consumption by: 1) attempting to minimize execution time across the whole network by selecting data layouts and primitive operations to implement each layer; and 2) allocating an appropriate workspace that reflects the upper bound of memory footprint per layer. These two optimization strategies can be used to run any CNN on any platform with a C compiler. Our evaluation with a range of popular ImageNet neural architectures (GoogleNet, AlexNet, VGG, ResNet and SqueezeNet) on the ARM Cortex-A15 yields speedups of 8x compared to a greedy algorithm based primitive selection, reduces memory requirement by 2.2x while sacrificing only 15% of inference time compared to a solver that considers inference time only. In addition, our optimization approach exposes a range of optimal points for different configurations across the Pareto frontier of memory and latency trade-off, which can be used under arbitrary system constraints.
△ Less
Submitted 21 May, 2020;
originally announced May 2020.
-
Performance Aware Convolutional Neural Network Channel Pruning for Embedded GPUs
Authors:
Valentin Radu,
Kuba Kaszyk,
Yuan Wen,
Jack Turner,
Jose Cano,
Elliot J. Crowley,
Bjorn Franke,
Amos Storkey,
Michael O'Boyle
Abstract:
Convolutional Neural Networks (CNN) are becoming a common presence in many applications and services, due to their superior recognition accuracy. They are increasingly being used on mobile devices, many times just by porting large models designed for server space, although several model compression techniques have been considered. One model compression technique intended to reduce computations is…
▽ More
Convolutional Neural Networks (CNN) are becoming a common presence in many applications and services, due to their superior recognition accuracy. They are increasingly being used on mobile devices, many times just by porting large models designed for server space, although several model compression techniques have been considered. One model compression technique intended to reduce computations is channel pruning. Mobile and embedded systems now have GPUs which are ideal for the parallel computations of neural networks and for their lower energy cost per operation. Specialized libraries perform these neural network computations through highly optimized routines. As we find in our experiments, these libraries are optimized for the most common network shapes, making uninstructed channel pruning inefficient. We evaluate higher level libraries, which analyze the input characteristics of a convolutional layer, based on which they produce optimized OpenCL (Arm Compute Library and TVM) and CUDA (cuDNN) code. However, in reality, these characteristics and subsequent choices intended for optimization can have the opposite effect. We show that a reduction in the number of convolutional channels, pruning 12% of the initial size, is in some cases detrimental to performance, leading to 2x slowdown. On the other hand, we also find examples where performance-aware pruning achieves the intended results, with performance speedups of 3x with cuDNN and above 10x with Arm Compute Library and TVM. Our findings expose the need for hardware-instructed neural network pruning.
△ Less
Submitted 20 February, 2020;
originally announced February 2020.
-
Dynamic quantum sensing of paramagnetic species using nitrogen-vacancy centers in diamond
Authors:
Valentin Radu,
Joshua Colm Price,
Simon James Levett,
Kaarjel K. Narayanasamy,
Thomas David Bateman-Price,
Philippe Barrie Wilson,
Melissa Louise Mather
Abstract:
Naturally occurring paramagnetic species (PS), such as free radicals and paramagnetic metalloproteins, play an essential role in a multitude of critical physiological processes including metabolism, cell signaling and immune response. These highly dynamic species can also act as intrinsic biomarkers for a variety of disease states whilst synthetic para-magnetic probes targeted to specific sites on…
▽ More
Naturally occurring paramagnetic species (PS), such as free radicals and paramagnetic metalloproteins, play an essential role in a multitude of critical physiological processes including metabolism, cell signaling and immune response. These highly dynamic species can also act as intrinsic biomarkers for a variety of disease states whilst synthetic para-magnetic probes targeted to specific sites on biomolecules enable the study of functional information such as tissue oxygenation and redox status in living systems. The work presented herein describes a new sensing method that exploits the spin dependent emission of photoluminescence (PL) from an ensemble of nitrogen vacancy centers in diamond for rapid, non-destructive detection of PS in living systems. Uniquely this approach involves simple measurement protocols that assess PL contrast with and without the application of microwaves. The method is demonstrated to detect concentrations of paramagnetic salts in solution and the widely used magnetic resonance imaging contrast agent Gadobutrol with a limit of detection of less than 10 attomol over a 100 micron x 100 micron field of view. Real time monitoring of changes in the concentration of paramagnetic salts is demonstrated with image exposure times of 20 ms. Further, dynamic tracking of chemical reactions is demonstrated via the conversion of low spin cyanide coordinated Fe3+ to hexaaqua Fe3+ under acidic conditions. Finally, the capability to map paramagnetic species in model cells with sub-cellular resolution is demonstrated using lipid membranes containing gadolinium labelled phospholipids under ambient conditions in the order of minutes. Overall, this work introduces a new sensing approach for the realization of fast, sensitive imaging of PS in a widefield format that is readily deployable in biomedical settings.
△ Less
Submitted 21 August, 2019;
originally announced August 2019.
-
CamLoc: Pedestrian Location Detection from Pose Estimation on Resource-constrained Smart-cameras
Authors:
Adrian Cosma,
Ion Emilian Radoi,
Valentin Radu
Abstract:
Recent advancements in energy-efficient hardware technology is driving the exponential growth we are experiencing in the Internet of Things (IoT) space, with more pervasive computations being performed near to data generation sources. A range of intelligent devices and applications performing local detection is emerging (activity recognition, fitness monitoring, etc.) bringing with them obvious ad…
▽ More
Recent advancements in energy-efficient hardware technology is driving the exponential growth we are experiencing in the Internet of Things (IoT) space, with more pervasive computations being performed near to data generation sources. A range of intelligent devices and applications performing local detection is emerging (activity recognition, fitness monitoring, etc.) bringing with them obvious advantages such as reducing detection latency for improved interaction with devices and safeguarding user data by not leaving the device. Video processing holds utility for many emerging applications and data labelling in the IoT space. However, performing this video processing with deep neural networks at the edge of the Internet is not trivial. In this paper we show that pedestrian location estimation using deep neural networks is achievable on fixed cameras with limited compute resources. Our approach uses pose estimation from key body points detection to extend pedestrian skeleton when whole body not in image (occluded by obstacles or partially outside of frame), which achieves better location estimation performance (infrence time and memory footprint) compared to fitting a bounding box over pedestrian and scaling. We collect a sizable dataset comprising of over 2100 frames in videos from one and two surveillance cameras pointing from different angles at the scene, and annotate each frame with the exact position of person in image, in 42 different scenarios of activity and occlusion. We compare our pose estimation based location detection with a popular detection algorithm, YOLOv2, for overlapping bounding-box generation, our solution achieving faster inference time (15x speedup) at half the memory footprint, within resource capabilities on embedded devices, which demonstrate that CamLoc is an efficient solution for location estimation in videos on smart-cameras.
△ Less
Submitted 28 December, 2018;
originally announced December 2018.
-
Quantum sensing in a physiological-like cell niche using fluorescent nanodiamonds embedded in electrospun polymer nanofibers
Authors:
J. C. Price,
S. J. Levett,
V. Radu,
D. A. Simpson,
A. Mogas Barcons,
C. F. Adams,
M. L. Mather
Abstract:
Fluorescent nanodiamonds (fNDs) containing Nitrogen Vacancy (NV) centres are promising candidates for quantum sensing in biological environments. However, to date, there has been little progress made to combine the sensing capabilities of fNDs with biomimetic substrates used in the laboratory to support physiologically representative cell behaviour. This work describes the fabrication and implemen…
▽ More
Fluorescent nanodiamonds (fNDs) containing Nitrogen Vacancy (NV) centres are promising candidates for quantum sensing in biological environments. However, to date, there has been little progress made to combine the sensing capabilities of fNDs with biomimetic substrates used in the laboratory to support physiologically representative cell behaviour. This work describes the fabrication and implementation of electrospun Poly Lactic-co-Glycolic Acid (PLGA) nanofibers embedded with fNDs for optical quantum sensing in an environment, which recapitulates the nanoscale architecture and topography of the cell niche. A range of solutions for electrospinning was prepared by mixing fNDs in different combinations of PLGA and it was shown that fND distribution was highly dependent on PLGA and solvent concentrations. The formulation that produced uniformly dispersed fNDs was identified and subsequently electrospun into nanofibers. The resulting fND nanofibers were characterised using fluorescent microscopy and Scanning Electron Microscopy (SEM). Quantum measurements were also performed via optically detected magnetic resonance (ODMR) and longitudinal spin relaxometry. Time varying magnetic fields external to the fND nanofibers were detected using continuous wave ODMR to demonstrate the sensing capability of the embedded fNDs. The potential utility of fND embedded nanofibers for use as biosensors in physiological environments was demonstrated by their ability to support highly viable populations of differentiated neural stem cells, a major therapeutic population able to produce electrically active neuronal circuits. The successful acquisition of ODMR spectra from the fNDs in the presence of live cells was also demonstrated on cultures of differentiating neural stem cells.
△ Less
Submitted 5 December, 2018; v1 submitted 27 November, 2018;
originally announced November 2018.
-
Distilling with Performance Enhanced Students
Authors:
Jack Turner,
Elliot J. Crowley,
Valentin Radu,
José Cano,
Amos Storkey,
Michael O'Boyle
Abstract:
The task of accelerating large neural networks on general purpose hardware has, in recent years, prompted the use of channel pruning to reduce network size. However, the efficacy of pruning based approaches has since been called into question. In this paper, we turn to distillation for model compression---specifically, attention transfer---and develop a simple method for discovering performance en…
▽ More
The task of accelerating large neural networks on general purpose hardware has, in recent years, prompted the use of channel pruning to reduce network size. However, the efficacy of pruning based approaches has since been called into question. In this paper, we turn to distillation for model compression---specifically, attention transfer---and develop a simple method for discovering performance enhanced student networks. We combine channel saliency metrics with empirical observations of runtime performance to design more accurate networks for a given latency budget. We apply our methodology to residual and densely-connected networks, and show that we are able to find resource-efficient student networks on different hardware platforms while maintaining very high accuracy. These performance-enhanced student networks achieve up to 10% boosts in top-1 ImageNet accuracy over their channel-pruned counterparts for the same inference time.
△ Less
Submitted 7 March, 2019; v1 submitted 24 October, 2018;
originally announced October 2018.
-
Characterising Across-Stack Optimisations for Deep Convolutional Neural Networks
Authors:
Jack Turner,
José Cano,
Valentin Radu,
Elliot J. Crowley,
Michael O'Boyle,
Amos Storkey
Abstract:
Convolutional Neural Networks (CNNs) are extremely computationally demanding, presenting a large barrier to their deployment on resource-constrained devices. Since such systems are where some of their most useful applications lie (e.g. obstacle detection for mobile robots, vision-based medical assistive technology), significant bodies of work from both machine learning and systems communities have…
▽ More
Convolutional Neural Networks (CNNs) are extremely computationally demanding, presenting a large barrier to their deployment on resource-constrained devices. Since such systems are where some of their most useful applications lie (e.g. obstacle detection for mobile robots, vision-based medical assistive technology), significant bodies of work from both machine learning and systems communities have attempted to provide optimisations that will make CNNs available to edge devices. In this paper we unify the two viewpoints in a Deep Learning Inference Stack and take an across-stack approach by implementing and evaluating the most common neural network compression techniques (weight pruning, channel pruning, and quantisation) and optimising their parallel execution with a range of programming approaches (OpenMP, OpenCL) and hardware architectures (CPU, GPU). We provide comprehensive Pareto curves to instruct trade-offs under constraints of accuracy, execution time, and memory space.
△ Less
Submitted 19 September, 2018;
originally announced September 2018.
-
Quantitative results on the Ishikawa iteration of Lipschitz pseudo-contractions
Authors:
Laurentiu Leustean,
Vlad Radu,
Andrei Sipos
Abstract:
We compute uniform rates of metastability for the Ishikawa iteration of a Lipschitz pseudo-contractive self-mapping of a compact convex subset of a Hilbert space. This extraction is an instance of the proof mining program that aims to apply tools from mathematical logic in order to extract the hidden quantitative content of mathematical proofs. We prove our main result by applying methods develope…
▽ More
We compute uniform rates of metastability for the Ishikawa iteration of a Lipschitz pseudo-contractive self-mapping of a compact convex subset of a Hilbert space. This extraction is an instance of the proof mining program that aims to apply tools from mathematical logic in order to extract the hidden quantitative content of mathematical proofs. We prove our main result by applying methods developed by Kohlenbach, the first author and Nicolae for obtaining quantitative versions of strong convergence results for generalized Fejér monotone sequences in compact subsets of metric spaces.
△ Less
Submitted 21 August, 2016;
originally announced August 2016.