Search | arXiv e-print repository

doi 10.1016/j.dib.2024.110614

A dataset of over one thousand computed tomography scans of battery cells

Authors: Amariah Condon, Bailey Buscarino, Eric Moch, William J. Sehnert, Owen Miles, Patrick K. Herring, Peter M. Attia

Abstract: Battery technology is increasingly important for global electrification efforts. However, batteries are highly sensitive to small manufacturing variations that can induce reliability or safety issues. An important technology for battery quality control is computed tomography (CT) scanning, which is widely used for non-destructive 3D inspection across a variety of clinical and industrial applicatio… ▽ More Battery technology is increasingly important for global electrification efforts. However, batteries are highly sensitive to small manufacturing variations that can induce reliability or safety issues. An important technology for battery quality control is computed tomography (CT) scanning, which is widely used for non-destructive 3D inspection across a variety of clinical and industrial applications. Historically, however, the utility of CT scanning for high-volume manufacturing has been limited by its low throughput as well as the difficulty of handling its large file sizes. In this work, we present a dataset of over one thousand CT scans of as-produced commercially available batteries. The dataset spans various chemistries (lithium-ion and sodium-ion) as well as various battery form factors (cylindrical, pouch, and prismatic). We evaluate seven different battery types in total. The manufacturing variability and the presence of battery defects can be observed via this dataset. This dataset may be of interest to scientists and engineers working on battery technology, computer vision, or both. △ Less

Submitted 7 May, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

arXiv:2101.01885 [pdf]

doi 10.1149/1945-7111/ac2704

Statistical learning for accurate and interpretable battery lifetime prediction

Authors: Peter M. Attia, Kristen A. Severson, Jeremy D. Witmer

Abstract: Data-driven methods for battery lifetime prediction are attracting increasing attention for applications in which the degradation mechanisms are poorly understood and suitable training sets are available. However, while advanced machine learning and deep learning methods promise high performance with minimal data preprocessing, simpler linear models with engineered features often achieve comparabl… ▽ More Data-driven methods for battery lifetime prediction are attracting increasing attention for applications in which the degradation mechanisms are poorly understood and suitable training sets are available. However, while advanced machine learning and deep learning methods promise high performance with minimal data preprocessing, simpler linear models with engineered features often achieve comparable performance, especially for small training sets, while also providing physical and statistical interpretability. In this work, we use a previously published dataset to develop simple, accurate, and interpretable data-driven models for battery lifetime prediction. We first present the "capacity matrix" concept as a compact representation of battery electrochemical cycling data, along with a series of feature representations. We then create a number of univariate and multivariate models, many of which achieve comparable performance to the highest-performing models previously published for this dataset. These models also provide insights into the degradation of these cells. Our approaches can be used both to quickly train models for a new dataset and to benchmark the performance of more advanced machine learning methods. △ Less

Submitted 24 April, 2021; v1 submitted 6 January, 2021; originally announced January 2021.

Comments: Submitted to the Journal of the Electrochemical Society

arXiv:2006.04345

Fast Synthetic LiDAR Rendering via Spherical UV Unwrapping of Equirectangular Z-Buffer Images

Authors: Mohammed Hossny, Khaled Saleh, Mohammed Attia, Ahmed Abobakr, Julie Iskander

Abstract: LiDAR data is becoming increasingly essential with the rise of autonomous vehicles. Its ability to provide 360deg horizontal field of view of point cloud, equips self-driving vehicles with enhanced situational awareness capabilities. While synthetic LiDAR data generation pipelines provide a good solution to advance the machine learning research on LiDAR, they do suffer from a major shortcoming, wh… ▽ More LiDAR data is becoming increasingly essential with the rise of autonomous vehicles. Its ability to provide 360deg horizontal field of view of point cloud, equips self-driving vehicles with enhanced situational awareness capabilities. While synthetic LiDAR data generation pipelines provide a good solution to advance the machine learning research on LiDAR, they do suffer from a major shortcoming, which is rendering time. Physically accurate LiDAR simulators (e.g. Blensor) are computationally expensive with an average rendering time of 14-60 seconds per frame for urban scenes. This is often compensated for via using 3D models with simplified polygon topology (low poly assets) as is the case of CARLA (Dosovitskiy et al., 2017). However, this comes at the price of having coarse grained unrealistic LiDAR point clouds. In this paper, we present a novel method to simulate LiDAR point cloud with faster rendering time of 1 sec per frame. The proposed method relies on spherical UV unwrapping of Equirectangular Z-Buffer images. We chose Blensor (Gschwandtner et al., 2011) as the baseline method to compare the point clouds generated using the proposed method. The reported error for complex urban landscapes is 4.28cm for a scanning range between 2-120 meters with Velodyne HDL64-E2 parameters. The proposed method reported a total time per frame to 3.2 +/- 0.31 seconds per frame. In contrast, the BlenSor baseline method reported 16.2 +/- 1.82 seconds. △ Less

Submitted 8 June, 2020; originally announced June 2020.

Comments: This version has been removed by arXiv administrators as the submitter did not have the right to agree to the license at the time of submission

arXiv:2006.03048 [pdf, other]

Asymmetric Leaky Private Information Retrieval

Authors: Islam Samy, Mohamed A. Attia, Ravi Tandon, Loukas Lazos

Abstract: Information-theoretic formulations of the private information retrieval (PIR) problem have been investigated under a variety of scenarios. Symmetric private information retrieval (SPIR) is a variant where a user is able to privately retrieve one out of $K$ messages from $N$ non-colluding replicated databases without learning anything about the remaining $K-1$ messages. However, the goal of perfect… ▽ More Information-theoretic formulations of the private information retrieval (PIR) problem have been investigated under a variety of scenarios. Symmetric private information retrieval (SPIR) is a variant where a user is able to privately retrieve one out of $K$ messages from $N$ non-colluding replicated databases without learning anything about the remaining $K-1$ messages. However, the goal of perfect privacy can be too taxing for certain applications. In this paper, we investigate if the information-theoretic capacity of SPIR (equivalently, the inverse of the minimum download cost) can be increased by relaxing both user and DB privacy definitions. Such relaxation is relevant in applications where privacy can be traded for communication efficiency. We introduce and investigate the Asymmetric Leaky PIR (AL-PIR) model with different privacy leakage budgets in each direction. For user privacy leakage, we bound the probability ratios between all possible realizations of DB queries by a function of a non-negative constant $ε$. For DB privacy, we bound the mutual information between the undesired messages, the queries, and the answers, by a function of a non-negative constant $δ$. We propose a general AL-PIR scheme that achieves an upper bound on the optimal download cost for arbitrary $ε$ and $δ$. We show that the optimal download cost of AL-PIR is upper-bounded as $D^{*}(ε,δ)\leq 1+\frac{1}{N-1}-\frac{δe^ε}{N^{K-1}-1}$. Second, we obtain an information-theoretic lower bound on the download cost as $D^{*}(ε,δ)\geq 1+\frac{1}{Ne^ε-1}-\fracδ{(Ne^ε)^{K-1}-1}$. The gap analysis between the two bounds shows that our AL-PIR scheme is optimal when $ε=0$, i.e., under perfect user privacy and it is optimal within a maximum multiplicative gap of $\frac{N-e^{-ε}}{N-1}$ for any $(ε,δ)$. △ Less

Submitted 4 June, 2020; originally announced June 2020.

arXiv:2006.02818 [pdf, other]

Refined Continuous Control of DDPG Actors via Parametrised Activation

Authors: Mohammed Hossny, Julie Iskander, Mohammed Attia, Khaled Saleh

Abstract: In this paper, we propose enhancing actor-critic reinforcement learning agents by parameterising the final actor layer which produces the actions in order to accommodate the behaviour discrepancy of different actuators, under different load conditions during interaction with the environment. We propose branching the action producing layer in the actor to learn the tuning parameter controlling the… ▽ More In this paper, we propose enhancing actor-critic reinforcement learning agents by parameterising the final actor layer which produces the actions in order to accommodate the behaviour discrepancy of different actuators, under different load conditions during interaction with the environment. We propose branching the action producing layer in the actor to learn the tuning parameter controlling the activation layer (e.g. Tanh and Sigmoid). The learned parameters are then used to create tailored activation functions for each actuator. We ran experiments on three OpenAI Gym environments, i.e. Pendulum-v0, LunarLanderContinuous-v2 and BipedalWalker-v2. Results have shown an average of 23.15% and 33.80% increase in total episode reward of the LunarLanderContinuous-v2 and BipedalWalker-v2 environments, respectively. There was no significant improvement in Pendulum-v0 environment but the proposed method produces a more stable actuation signal compared to the state-of-the-art method. The proposed method allows the reinforcement learning actor to produce more robust actions that accommodate the discrepancy in the actuators' response functions. This is particularly useful for real life scenarios where actuators exhibit different response functions depending on the load and the interaction with the environment. This also simplifies the transfer learning problem by fine tuning the parameterised activation layers instead of retraining the entire policy every time an actuator is replaced. Finally, the proposed method would allow better accommodation to biological actuators (e.g. muscles) in biomechanical systems. △ Less

Submitted 4 June, 2020; originally announced June 2020.

Comments: 9 pages, 7 figures, 2 tables, submitted to Elsevier. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by- nc-nd/4.0/

arXiv:2001.05998 [pdf, other]

Latent-variable Private Information Retrieval

Authors: Islam Samy, Mohamed A. Attia, Ravi Tandon, Loukas Lazos

Abstract: In many applications, content accessed by users (movies, videos, news articles, etc.) can leak sensitive latent attributes, such as religious and political views, sexual orientation, ethnicity, gender, and others. To prevent such information leakage, the goal of classical PIR is to hide the identity of the content/message being accessed, which subsequently also hides the latent attributes. This so… ▽ More In many applications, content accessed by users (movies, videos, news articles, etc.) can leak sensitive latent attributes, such as religious and political views, sexual orientation, ethnicity, gender, and others. To prevent such information leakage, the goal of classical PIR is to hide the identity of the content/message being accessed, which subsequently also hides the latent attributes. This solution, while private, can be too costly, particularly, when perfect (information-theoretic) privacy constraints are imposed. For instance, for a single database holding $K$ messages, privately retrieving one message is possible if and only if the user downloads the entire database of $K$ messages. Retrieving content privately, however, may not be necessary to perfectly hide the latent attributes. Motivated by the above, we formulate and study the problem of latent-variable private information retrieval (LV-PIR), which aims at allowing the user efficiently retrieve one out of $K$ messages (indexed by $θ$) without revealing any information about the latent variable (modeled by $S$). We focus on the practically relevant setting of a single database and show that one can significantly reduce the download cost of LV-PIR (compared to the classical PIR) based on the correlation between $θ$ and $S$. We present a general scheme for LV-PIR as a function of the statistical relationship between $θ$ and $S$, and also provide new results on the capacity/download cost of LV-PIR. Several open problems and new directions are also discussed. △ Less

Submitted 14 May, 2020; v1 submitted 16 January, 2020; originally announced January 2020.

arXiv:1905.08955 [pdf, other]

Domain Adaptation for Vehicle Detection from Bird's Eye View LiDAR Point Cloud Data

Authors: Khaled Saleh, Ahmed Abobakr, Mohammed Attia, Julie Iskander, Darius Nahavandi, Mohammed Hossny

Abstract: Point cloud data from 3D LiDAR sensors are one of the most crucial sensor modalities for versatile safety-critical applications such as self-driving vehicles. Since the annotations of point cloud data is an expensive and time-consuming process, therefore recently the utilisation of simulated environments and 3D LiDAR sensors for this task started to get some popularity. With simulated sensors and… ▽ More Point cloud data from 3D LiDAR sensors are one of the most crucial sensor modalities for versatile safety-critical applications such as self-driving vehicles. Since the annotations of point cloud data is an expensive and time-consuming process, therefore recently the utilisation of simulated environments and 3D LiDAR sensors for this task started to get some popularity. With simulated sensors and environments, the process for obtaining an annotated synthetic point cloud data became much easier. However, the generated synthetic point cloud data are still missing the artefacts usually exist in point cloud data from real 3D LiDAR sensors. As a result, the performance of the trained models on this data for perception tasks when tested on real point cloud data is degraded due to the domain shift between simulated and real environments. Thus, in this work, we are proposing a domain adaptation framework for bridging this gap between synthetic and real point cloud data. Our proposed framework is based on the deep cycle-consistent generative adversarial networks (CycleGAN) architecture. We have evaluated the performance of our proposed framework on the task of vehicle detection from a bird's eye view (BEV) point cloud images coming from real 3D LiDAR sensors. The framework has shown competitive results with an improvement of more than 7% in average precision score over other baseline approaches when tested on real BEV point cloud images. △ Less

Submitted 22 May, 2019; originally announced May 2019.

Comments: Under review for IEEE SMC 2019

arXiv:1904.09169 [pdf, other]

Realistic Hair Simulation Using Image Blending

Authors: Mohamed Attia, Mohammed Hossny, Saeid Nahavandi, Anousha Yazdabadi, Hamed Asadi

Abstract: In this presented work, we propose a realistic hair simulator using image blending for dermoscopic images. This hair simulator can be used for benchmarking and validation of the hair removal methods and in data augmentation for improving computer aided diagnostic tools. We adopted one of the popular implementation of image blending to superimpose realistic hair masks to hair lesion. This method wa… ▽ More In this presented work, we propose a realistic hair simulator using image blending for dermoscopic images. This hair simulator can be used for benchmarking and validation of the hair removal methods and in data augmentation for improving computer aided diagnostic tools. We adopted one of the popular implementation of image blending to superimpose realistic hair masks to hair lesion. This method was able to produce realistic hair masks according to a predefined mask for hair. Thus, the produced hair images and masks can be used as ground truth for hair segmentation and removal methods by inpainting hair according to a pre-defined hair masks on hairfree areas. Also, we achieved a realism score equals to 1.65 in comparison to 1.59 for the state-of-the-art hair simulator. △ Less

Submitted 19 April, 2019; originally announced April 2019.

arXiv:1810.06619 [pdf, other]

Diacritization of Maghrebi Arabic Sub-Dialects

Authors: Ahmed Abdelali, Mohammed Attia, Younes Samih, Kareem Darwish, Hamdy Mubarak

Abstract: Diacritization process attempt to restore the short vowels in Arabic written text; which typically are omitted. This process is essential for applications such as Text-to-Speech (TTS). While diacritization of Modern Standard Arabic (MSA) still holds the lion share, research on dialectal Arabic (DA) diacritization is very limited. In this paper, we present our contribution and results on the automa… ▽ More Diacritization process attempt to restore the short vowels in Arabic written text; which typically are omitted. This process is essential for applications such as Text-to-Speech (TTS). While diacritization of Modern Standard Arabic (MSA) still holds the lion share, research on dialectal Arabic (DA) diacritization is very limited. In this paper, we present our contribution and results on the automatic diacritization of two sub-dialects of Maghrebi Arabic, namely Tunisian and Moroccan, using a character-level deep neural network architecture that stacks two bi-LSTM layers over a CRF output layer. The model achieves word error rate of 2.7% and 3.6% for Moroccan and Tunisian respectively and is capable of implicitly identifying the sub-dialect of the input. △ Less

Submitted 30 May, 2019; v1 submitted 15 October, 2018; originally announced October 2018.

Comments: 6 pages, 3 figures

arXiv:1805.04104 [pdf, other]

The Capacity of Private Information Retrieval from Uncoded Storage Constrained Databases

Authors: Mohamed Adel Attia, Deepak Kumar, Ravi Tandon

Abstract: Private information retrieval (PIR) allows a user to retrieve a desired message from a set of databases without revealing the identity of the desired message. The replicated databases scenario was considered by Sun and Jafar, 2016, where $N$ databases can store the same $K$ messages completely. A PIR scheme was developed to achieve the optimal download cost given by… ▽ More Private information retrieval (PIR) allows a user to retrieve a desired message from a set of databases without revealing the identity of the desired message. The replicated databases scenario was considered by Sun and Jafar, 2016, where $N$ databases can store the same $K$ messages completely. A PIR scheme was developed to achieve the optimal download cost given by $\left(1+ \frac{1}{N}+ \frac{1}{N^{2}}+ \cdots + \frac{1}{N^{K-1}}\right)$. In this work, we consider the problem of PIR from storage constrained databases. Each database has a storage capacity of $μKL$ bits, where $L$ is the size of each message in bits, and $μ\in [1/N, 1]$ is the normalized storage. On one extreme, $μ=1$ is the replicated databases case. On the other hand, when $μ= 1/N$, then in order to retrieve a message privately, the user has to download all the messages from the databases achieving a download cost of $1/K$. We aim to characterize the optimal download cost versus storage trade-off for any storage capacity in the range $μ\in [1/N, 1]$. For any $(N,K)$, we show that the optimal trade-off between storage, $μ$, and the download cost, $D(μ)$, is given by the lower convex hull of the $N$ pairs $\left(μ= \frac{t}{N},D(μ) = \left(1+ \frac{1}{t}+ \frac{1}{t^{2}}+ \cdots + \frac{1}{t^{K-1}}\right)\right)$ for $t=1,2,\ldots, N$. To prove this result, we first present the storage constrained PIR scheme for any $(N,K)$. We next obtain a general lower bound on the download cost for PIR, which is valid for the following storage scenarios: replicated or storage constrained, coded or uncoded, and fixed or optimized. We then specialize this bound using the uncoded storage assumption to obtain lower bounds matching the achievable download cost of the storage constrained PIR scheme for any value of the available storage. △ Less

Submitted 23 October, 2018; v1 submitted 10 May, 2018; originally announced May 2018.

arXiv:1801.01875 [pdf, other]

Near Optimal Coded Data Shuffling for Distributed Learning

Authors: Mohamed A. Attia, Ravi Tandon

Abstract: Data shuffling between distributed cluster of nodes is one of the critical steps in implementing large-scale learning algorithms. Randomly shuffling the data-set among a cluster of workers allows different nodes to obtain fresh data assignments at each learning epoch. This process has been shown to provide improvements in the learning process. However, the statistical benefits of distributed data… ▽ More Data shuffling between distributed cluster of nodes is one of the critical steps in implementing large-scale learning algorithms. Randomly shuffling the data-set among a cluster of workers allows different nodes to obtain fresh data assignments at each learning epoch. This process has been shown to provide improvements in the learning process. However, the statistical benefits of distributed data shuffling come at the cost of extra communication overhead from the master node to worker nodes, and can act as one of the major bottlenecks in the overall time for computation. There has been significant recent interest in devising approaches to minimize this communication overhead. One approach is to provision for extra storage at the computing nodes. The other emerging approach is to leverage coded communication to minimize the overall communication overhead. The focus of this work is to understand the fundamental trade-off between the amount of storage and the communication overhead for distributed data shuffling. In this work, we first present an information theoretic formulation for the data shuffling problem, accounting for the underlying problem parameters (number of workers, $K$, number of data points, $N$, and the available storage, $S$ per node). We then present an information theoretic lower bound on the communication overhead for data shuffling as a function of these parameters. We next present a novel coded communication scheme and show that the resulting communication overhead of the proposed scheme is within a multiplicative factor of at most $\frac{K}{K-1}$ from the information-theoretic lower bound. Furthermore, we present the aligned coded shuffling scheme for some storage values, which achieves the optimal storage vs communication trade-off for $K<5$, and further reduces the maximum multiplicative gap down to $\frac{K-\frac{1}{3}}{K-1}$, for $K\geq 5$. △ Less

Submitted 5 January, 2018; originally announced January 2018.

arXiv:1711.08452 [pdf, other]

Combating Computational Heterogeneity in Large-Scale Distributed Computing via Work Exchange

Authors: Mohamed A. Attia, Ravi Tandon

Abstract: Owing to data-intensive large-scale applications, distributed computation systems have gained significant recent interest, due to their ability of running such tasks over a large number of commodity nodes in a time efficient manner. One of the major bottlenecks that adversely impacts the time efficiency is the computational heterogeneity of distributed nodes, often limiting the task completion tim… ▽ More Owing to data-intensive large-scale applications, distributed computation systems have gained significant recent interest, due to their ability of running such tasks over a large number of commodity nodes in a time efficient manner. One of the major bottlenecks that adversely impacts the time efficiency is the computational heterogeneity of distributed nodes, often limiting the task completion time due to the slowest worker. In this paper, we first present a lower bound on the expected computation time based on the work-conservation principle. We then present our approach of work exchange to combat the latency problem, in which faster workers can be reassigned additional leftover computations that were originally assigned to slower workers. We present two variations of the work exchange approach: a) when the computational heterogeneity knowledge is known a priori; and b) when heterogeneity is unknown and is estimated in an online manner to assign tasks to distributed workers. As a baseline, we also present and analyze the use of an optimized Maximum Distance Separable (MDS) coded distributed computation scheme over heterogeneous nodes. Simulation results also compare the proposed approach of work exchange, the baseline MDS coded scheme and the lower bound obtained via work-conservation principle. We show that the work exchange scheme achieves time for computation which is very close to the lower bound with limited coordination and communication overhead even when the knowledge about heterogeneity levels is not available. △ Less

Submitted 22 November, 2017; originally announced November 2017.

arXiv:1708.05891 [pdf, other]

Arabic Multi-Dialect Segmentation: bi-LSTM-CRF vs. SVM

Authors: Mohamed Eldesouki, Younes Samih, Ahmed Abdelali, Mohammed Attia, Hamdy Mubarak, Kareem Darwish, Kallmeyer Laura

Abstract: Arabic word segmentation is essential for a variety of NLP applications such as machine translation and information retrieval. Segmentation entails breaking words into their constituent stems, affixes and clitics. In this paper, we compare two approaches for segmenting four major Arabic dialects using only several thousand training examples for each dialect. The two approaches involve posing the p… ▽ More Arabic word segmentation is essential for a variety of NLP applications such as machine translation and information retrieval. Segmentation entails breaking words into their constituent stems, affixes and clitics. In this paper, we compare two approaches for segmenting four major Arabic dialects using only several thousand training examples for each dialect. The two approaches involve posing the problem as a ranking problem, where an SVM ranker picks the best segmentation, and as a sequence labeling problem, where a bi-LSTM RNN coupled with CRF determines where best to segment words. We are able to achieve solid segmentation results for all dialects using rather limited training data. We also show that employing Modern Standard Arabic data for domain adaptation and assuming context independence improve overall results. △ Less

Submitted 19 August, 2017; originally announced August 2017.

arXiv:1702.07963 [pdf, other]

Spatially Aware Melanoma Segmentation Using Hybrid Deep Learning Techniques

Authors: M. Attia, M. Hossny, S. Nahavandi, A. Yazdabadi

Abstract: In this paper, we proposed using a hybrid method that utilises deep convolutional and recurrent neural networks for accurate delineation of skin lesion of images supplied with ISBI 2017 lesion segmentation challenge. The proposed method was trained using 1800 images and tested on 150 images from ISBI 2017 challenge. In this paper, we proposed using a hybrid method that utilises deep convolutional and recurrent neural networks for accurate delineation of skin lesion of images supplied with ISBI 2017 lesion segmentation challenge. The proposed method was trained using 1800 images and tested on 150 images from ISBI 2017 challenge. △ Less

Submitted 25 February, 2017; originally announced February 2017.

Comments: ISIC2017

arXiv:1609.09823 [pdf, other]

On the Worst-case Communication Overhead for Distributed Data Shuffling

Authors: Mohamed Attia, Ravi Tandon

Abstract: Distributed learning platforms for processing large scale data-sets are becoming increasingly prevalent. In typical distributed implementations, a centralized master node breaks the data-set into smaller batches for parallel processing across distributed workers to achieve speed-up and efficiency. Several computational tasks are of sequential nature, and involve multiple passes over the data. At e… ▽ More Distributed learning platforms for processing large scale data-sets are becoming increasingly prevalent. In typical distributed implementations, a centralized master node breaks the data-set into smaller batches for parallel processing across distributed workers to achieve speed-up and efficiency. Several computational tasks are of sequential nature, and involve multiple passes over the data. At each iteration over the data, it is common practice to randomly re-shuffle the data at the master node, assigning different batches for each worker to process. This random re-shuffling operation comes at the cost of extra communication overhead, since at each shuffle, new data points need to be delivered to the distributed workers. In this paper, we focus on characterizing the information theoretically optimal communication overhead for the distributed data shuffling problem. We propose a novel coded data delivery scheme for the case of no excess storage, where every worker can only store the assigned data batches under processing. Our scheme exploits a new type of coding opportunity and is applicable to any arbitrary shuffle, and for any number of workers. We also present an information theoretic lower bound on the minimum communication overhead for data shuffling, and show that the proposed scheme matches this lower bound for the worst-case communication overhead. △ Less

Submitted 30 September, 2016; originally announced September 2016.

Comments: To appear in Allerton 2016

arXiv:1609.05181 [pdf, ps, other]

Information Theoretic Limits of Data Shuffling for Distributed Learning

Authors: Mohamed Attia, Ravi Tandon

Abstract: Data shuffling is one of the fundamental building blocks for distributed learning algorithms, that increases the statistical gain for each step of the learning process. In each iteration, different shuffled data points are assigned by a central node to a distributed set of workers to perform local computations, which leads to communication bottlenecks. The focus of this paper is on formalizing and… ▽ More Data shuffling is one of the fundamental building blocks for distributed learning algorithms, that increases the statistical gain for each step of the learning process. In each iteration, different shuffled data points are assigned by a central node to a distributed set of workers to perform local computations, which leads to communication bottlenecks. The focus of this paper is on formalizing and understanding the fundamental information-theoretic trade-off between storage (per worker) and the worst-case communication overhead for the data shuffling problem. We completely characterize the information theoretic trade-off for $K=2$, and $K=3$ workers, for any value of storage capacity, and show that increasing the storage across workers can reduce the communication overhead by leveraging coding. We propose a novel and systematic data delivery and storage update strategy for each data shuffle iteration, which preserves the structural properties of the storage across the workers, and aids in minimizing the communication overhead in subsequent data shuffling iterations. △ Less

Submitted 16 September, 2016; originally announced September 2016.

Comments: To be presented at IEEE GLOBECOM, December 2016

Showing 1–16 of 16 results for author: Attia, M