-
Procrustes: a Dataflow and Accelerator for Sparse Deep Neural Network Training
Authors:
Dingqing Yang,
Amin Ghasemazar,
Xiaowei Ren,
Maximilian Golub,
Guy Lemieux,
Mieszko Lis
Abstract:
The success of DNN pruning has led to the development of energy-efficient inference accelerators that support pruned models with sparse weight and activation tensors. Because the memory layouts and dataflows in these architectures are optimized for the access patterns during $\mathit{inference}$, however, they do not efficiently support the emerging sparse $\mathit{training}$ techniques.
In this…
▽ More
The success of DNN pruning has led to the development of energy-efficient inference accelerators that support pruned models with sparse weight and activation tensors. Because the memory layouts and dataflows in these architectures are optimized for the access patterns during $\mathit{inference}$, however, they do not efficiently support the emerging sparse $\mathit{training}$ techniques.
In this paper, we demonstrate (a) that accelerating sparse training requires a co-design approach where algorithms are adapted to suit the constraints of hardware, and (b) that hardware for sparse DNN training must tackle constraints that do not arise in inference accelerators. As proof of concept, we adapt a sparse training algorithm to be amenable to hardware acceleration; we then develop dataflow, data layout, and load-balancing techniques to accelerate it.
The resulting system is a sparse DNN training accelerator that produces pruned models with the same accuracy as dense models without first training, then pruning, and finally retraining, a dense model. Compared to training the equivalent unpruned models using a state-of-the-art DNN accelerator without sparse training support, Procrustes consumes up to 3.26$\times$ less energy and offers up to 4$\times$ speedup across a range of models, while pruning weights by an order of magnitude and maintaining unpruned accuracy.
△ Less
Submitted 23 September, 2020;
originally announced September 2020.
-
TinBiNN: Tiny Binarized Neural Network Overlay in about 5,000 4-LUTs and 5mW
Authors:
Guy G. F. Lemieux,
Joe Edwards,
Joel Vandergriendt,
Aaron Severance,
Ryan De Iaco,
Abdullah Raouf,
Hussein Osman,
Tom Watzka,
Satwant Singh
Abstract:
Reduced-precision arithmetic improves the size, cost, power and performance of neural networks in digital logic. In convolutional neural networks, the use of 1b weights can achieve state-of-the-art error rates while eliminating multiplication, reducing storage and improving power efficiency. The BinaryConnect binary-weighted system, for example, achieves 9.9% error using floating-point activations…
▽ More
Reduced-precision arithmetic improves the size, cost, power and performance of neural networks in digital logic. In convolutional neural networks, the use of 1b weights can achieve state-of-the-art error rates while eliminating multiplication, reducing storage and improving power efficiency. The BinaryConnect binary-weighted system, for example, achieves 9.9% error using floating-point activations on the CIFAR-10 dataset. In this paper, we introduce TinBiNN, a lightweight vector processor overlay for accelerating inference computations with 1b weights and 8b activations. The overlay is very small -- it uses about 5,000 4-input LUTs and fits into a low cost iCE40 UltraPlus FPGA from Lattice Semiconductor. To show this can be useful, we build two embedded 'person detector' systems by shrinking the original BinaryConnect network. The first is a 10-category classifier with a 89% smaller network that runs in 1,315ms and achieves 13.6% error. The other is a 1-category classifier that is even smaller, runs in 195ms, and has only 0.4% error. In both classifiers, the error can be attributed entirely to training and not reduced precision.
△ Less
Submitted 5 March, 2019;
originally announced March 2019.
-
Full deep neural network training on a pruned weight budget
Authors:
Maximilian Golub,
Guy Lemieux,
Mieszko Lis
Abstract:
We introduce a DNN training technique that learns only a fraction of the full parameter set without incurring an accuracy penalty. To do this, our algorithm constrains the total number of weights updated during backpropagation to those with the highest total gradients. The remaining weights are not tracked, and their initial value is regenerated at every access to avoid storing them in memory. Thi…
▽ More
We introduce a DNN training technique that learns only a fraction of the full parameter set without incurring an accuracy penalty. To do this, our algorithm constrains the total number of weights updated during backpropagation to those with the highest total gradients. The remaining weights are not tracked, and their initial value is regenerated at every access to avoid storing them in memory. This can dramatically reduce the number of off-chip memory accesses during both training and inference, a key component of the energy needs of DNN accelerators. By ensuring that the total weight diffusion remains close to that of baseline unpruned SGD, networks pruned using our technique are able to retain state-of-the-art accuracy across network architectures -- including networks previously identified as difficult to compress, such as Densenet and WRN. With ResNet18 on ImageNet, we observe an 11.7$\times$ weight reduction with no accuracy loss, and up to 24.4$\times$ with a small accuracy impact.
△ Less
Submitted 23 November, 2019; v1 submitted 11 June, 2018;
originally announced June 2018.
-
Automated Space/Time Scaling of Streaming Task Graph
Authors:
Hossein Omidian,
Guy G. F. Lemieux
Abstract:
In this paper, we describe a high-level synthesis (HLS) tool that automatically allows area/throughput trade-offs for implementing streaming task graphs (STG). Our tool targets a massively parallel processor array (MPPA) architecture, very similar to the Ambric MPPA chip architecture, which is to be implemented as an FPGA overlay. Similar to Ambric tools, our HLS tool accepts a STG as input writte…
▽ More
In this paper, we describe a high-level synthesis (HLS) tool that automatically allows area/throughput trade-offs for implementing streaming task graphs (STG). Our tool targets a massively parallel processor array (MPPA) architecture, very similar to the Ambric MPPA chip architecture, which is to be implemented as an FPGA overlay. Similar to Ambric tools, our HLS tool accepts a STG as input written in a subset of Java and a structural language in the style of a Kahn Processing Network (KPN). Unlike the Ambric tools, our HLS tool analyzes the parallelism internal to each Java "node" and evaluates the throughput and area of several possible implementations. It then analyzes the full graph for bottlenecks or excess compute capacity, selects an implementation for each node, and even considers replicating or splitting nodes while either minimizing area (for a fixed throughput target), or maximizing throughput (for a fixed area target). In addition to traditional node selection and replication methods used in prior work, we have uniquely implemented node combining and splitting to find a better area/throughput trade-off. We present two optimization approaches, a formal ILP formulation and a heuristic solution. Results show that the heuristic is more flexible and can find design points not available to the ILP, thereby achieving superior results.
△ Less
Submitted 12 June, 2016;
originally announced June 2016.