Search | arXiv e-print repository

Improving Quantization with Post-Training Model Expansion

Authors: Giuseppe Franco, Pablo Monteagudo-Lago, Ian Colbert, Nicholas Fraser, Michaela Blott

Abstract: The size of a model has been a strong predictor of its quality, as well as its cost. As such, the trade-off between model cost and quality has been well-studied. Post-training optimizations like quantization and pruning have typically focused on reducing the overall volume of pre-trained models to reduce inference costs while maintaining model quality. However, recent advancements have introduced… ▽ More The size of a model has been a strong predictor of its quality, as well as its cost. As such, the trade-off between model cost and quality has been well-studied. Post-training optimizations like quantization and pruning have typically focused on reducing the overall volume of pre-trained models to reduce inference costs while maintaining model quality. However, recent advancements have introduced optimization techniques that, interestingly, expand models post-training, increasing model size to improve quality when reducing volume. For instance, to enable 4-bit weight and activation quantization, incoherence processing often necessitates inserting online Hadamard rotations in the compute graph, and preserving highly sensitive weights often calls for additional higher precision computations. However, if application requirements cannot be met, the prevailing solution is to relax quantization constraints. In contrast, we demonstrate post-training model expansion is a viable strategy to improve model quality within a quantization co-design space, and provide theoretical justification. We show it is possible to progressively and selectively expand the size of a pre-trained large language model (LLM) to improve model quality without end-to-end retraining. In particular, when quantizing the weights and activations to 4 bits for Llama3 1B, we reduce the zero-shot accuracy gap to full precision by an average of 3% relative to both QuaRot and SpinQuant with only 5% more parameters, which is still a 3.8% reduction in volume relative to a BF16 reference model. △ Less

Submitted 21 March, 2025; originally announced March 2025.

arXiv:2105.12078 [pdf, other]

doi 10.1162/qss_a_00255

No Deal: Investigating the Influence of Restricted Access to Elsevier Journals on German Researchers' Publishing and Citing Behaviours

Authors: Nicholas Fraser, Anne Hobert, Najko Jahn, Philipp Mayr, Isabella Peters

Abstract: In 2014, a union of German research organisations established Projekt DEAL, a national-level project to negotiate licensing agreements with large scientific publishers. Negotiations between DEAL and Elsevier began in 2016, and broke down without a successful agreement in 2018; in this time, around 200 German research institutions cancelled their license agreements with Elsevier, leading Elsevier t… ▽ More In 2014, a union of German research organisations established Projekt DEAL, a national-level project to negotiate licensing agreements with large scientific publishers. Negotiations between DEAL and Elsevier began in 2016, and broke down without a successful agreement in 2018; in this time, around 200 German research institutions cancelled their license agreements with Elsevier, leading Elsevier to restrict journal access at those institutions from July 2018 onwards. We investigated the effect of these access restrictions on researchers' publishing and citing behaviours from a bibliometric perspective, using a dataset of ~410,000 articles published by researchers at the affected DEAL institutions between 2012-2020. We further investigated these effects with respect to the timing of contract cancellations with Elsevier, research disciplines, collaboration patterns, and article open-access status. We find evidence for a decrease in Elsevier's market share of articles from DEAL institutions, from a peak of 25.3% in 2015 to 20.6% in 2020, with the largest year-on-year market share decreases occurring in 2019 (-1.1%) and 2020 (-1.6%) following the implementation of access restrictions. We also observe year-on-year decreases in the proportion of citations made from articles published by authors at DEAL institutions to articles in Elsevier journals post-2018, although the decrease is smaller (-0.4% in 2019 and -0.6% in 2020) than changes in publishing volume. We conclude that Elsevier access restrictions have led to some reduced willingness of researchers at DEAL institutions to publish their research in Elsevier journals, but that researchers are not strongly affected in their ability to cite Elsevier articles, with the implication that researchers use a variety of other methods (e.g. interlibrary loans, sharing between colleagues, or "shadow libraries") to access scientific literature. △ Less

Submitted 25 May, 2021; originally announced May 2021.

Comments: 34 pages, 13 figures, preprint

Journal ref: QSS 2023

arXiv:2103.14522 [pdf]

doi 10.1007/s11192-021-03972-5

What happens when a journal converts to Open Access? A bibliometric analysis

Authors: Fakhri Momeni, Philipp Mayr, Nicholas Fraser, Isabella Peters

Abstract: In recent years, increased stakeholder pressure to transition research to Open Access has led to many journals converting, or 'flipping', from a closed access (CA) to an open access (OA) publishing model. Changing the publishing model can influence the decision of authors to submit their papers to a journal, and increased article accessibility may influence citation behaviour. In this paper we aim… ▽ More In recent years, increased stakeholder pressure to transition research to Open Access has led to many journals converting, or 'flipping', from a closed access (CA) to an open access (OA) publishing model. Changing the publishing model can influence the decision of authors to submit their papers to a journal, and increased article accessibility may influence citation behaviour. In this paper we aimed to understand how flipping a journal to an OA model influences the journal's future publication volumes and citation impact. We analysed two independent sets of journals that had flipped to an OA model, one from the Directory of Open Access Journals (DOAJ) and one from the Open Access Directory (OAD), and compared their development with two respective control groups of similar journals. For bibliometric analyses, journals were matched to the Scopus database. We assessed changes in the number of articles published over time, as well as two citation metrics at the journal and article level: the normalised impact factor (IF) and the average relative citations (ARC), respectively. Our results show that overall, journals that flipped to an OA model increased their publication output compared to journals that remained closed. Mean normalised IF and ARC also generally increased following the flip to an OA model, at a greater rate than was observed in the control groups. However, the changes appear to vary largely by scientific discipline. Overall, these results indicate that flipping to an OA publishing model can bring positive changes to a journal. △ Less

Submitted 26 March, 2021; originally announced March 2021.

Comments: 16 pages, 5 figures, Accepted in Scientometrics

arXiv:2102.11289 [pdf, other]

doi 10.3389/frai.2021.676564

Ps and Qs: Quantization-aware pruning for efficient low latency neural network inference

Authors: Benjamin Hawks, Javier Duarte, Nicholas J. Fraser, Alessandro Pappalardo, Nhan Tran, Yaman Umuroglu

Abstract: Efficient machine learning implementations optimized for inference in hardware have wide-ranging benefits, depending on the application, from lower inference latency to higher data throughput and reduced energy consumption. Two popular techniques for reducing computation in neural networks are pruning, removing insignificant synapses, and quantization, reducing the precision of the calculations. I… ▽ More Efficient machine learning implementations optimized for inference in hardware have wide-ranging benefits, depending on the application, from lower inference latency to higher data throughput and reduced energy consumption. Two popular techniques for reducing computation in neural networks are pruning, removing insignificant synapses, and quantization, reducing the precision of the calculations. In this work, we explore the interplay between pruning and quantization during the training of neural networks for ultra low latency applications targeting high energy physics use cases. Techniques developed for this study have potential applications across many other domains. We study various configurations of pruning during quantization-aware training, which we term quantization-aware pruning, and the effect of techniques like regularization, batch normalization, and different pruning schemes on performance, computational complexity, and information content metrics. We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task. Further, quantization-aware pruning typically performs similar to or better in terms of computational efficiency compared to other neural architecture search techniques like Bayesian optimization. Surprisingly, while networks with different training configurations can have similar performance for the benchmark application, the information content in the network can vary significantly, affecting its generalizability. △ Less

Submitted 19 July, 2021; v1 submitted 22 February, 2021; originally announced February 2021.

Comments: 22 pages, 7 Figures, 1 Table

Report number: FERMILAB-PUB-21-056-SCD

Journal ref: Front. AI 4, 94 (2021)

arXiv:2011.07317 [pdf, other]

Memory-Efficient Dataflow Inference for Deep CNNs on FPGA

Authors: Lucian Petrica, Tobias Alonso, Mairin Kroes, Nicholas Fraser, Sorin Cotofana, Michaela Blott

Abstract: Custom dataflow Convolutional Neural Network (CNN) inference accelerators on FPGA are tailored to a specific CNN topology and store parameters in On-Chip Memory (OCM), resulting in high energy efficiency and low inference latency. However, in these accelerators the shapes of parameter memories are dictated by throughput constraints and do not map well to the underlying OCM, which becomes an implem… ▽ More Custom dataflow Convolutional Neural Network (CNN) inference accelerators on FPGA are tailored to a specific CNN topology and store parameters in On-Chip Memory (OCM), resulting in high energy efficiency and low inference latency. However, in these accelerators the shapes of parameter memories are dictated by throughput constraints and do not map well to the underlying OCM, which becomes an implementation bottleneck. In this work, we propose an accelerator design methodology - Frequency Compensated Memory Packing (FCMP) - which improves the OCM utilization efficiency of dataflow accelerators with minimal reduction in throughput and no modifications to the physical structure of FPGA OCM. To validate our methodology, we apply it to several realizations of medium-sized CIFAR-10 inference accelerators and demonstrate up to 30% reduction in OCM utilization without loss of inference throughput, allowing us to port the accelerators from Xilinx Zynq 7020 to 7012S, reducing application cost. We also implement a custom dataflow FPGA inference accelerator for a quantized ResNet-50 CNN, utilizing on-chip weights, the largest topology ever implemented with this accelerator architecture. We demonstrate that by applying FCMP to the ResNet accelerator, the OCM bottleneck is alleviated which enables the accelerator to be ported from Alveo U250 to the smaller Alveo U280 board with less throughput loss compared to alternative techniques. By providing a finer-grained trade off between throughput and OCM requirements, FCMP increases the flexibility of custom dataflow CNN inference designs on FPGA. △ Less

Submitted 14 November, 2020; originally announced November 2020.

Comments: To appear in FPT 2020 proceedings

arXiv:2011.05873 [pdf, ps, other]

FAT: Training Neural Networks for Reliable Inference Under Hardware Faults

Authors: Ussama Zahid, Giulio Gambardella, Nicholas J. Fraser, Michaela Blott, Kees Vissers

Abstract: Deep neural networks (DNNs) are state-of-the-art algorithms for multiple applications, spanning from image classification to speech recognition. While providing excellent accuracy, they often have enormous compute and memory requirements. As a result of this, quantized neural networks (QNNs) are increasingly being adopted and deployed especially on embedded devices, thanks to their high accuracy,… ▽ More Deep neural networks (DNNs) are state-of-the-art algorithms for multiple applications, spanning from image classification to speech recognition. While providing excellent accuracy, they often have enormous compute and memory requirements. As a result of this, quantized neural networks (QNNs) are increasingly being adopted and deployed especially on embedded devices, thanks to their high accuracy, but also since they have significantly lower compute and memory requirements compared to their floating point equivalents. QNN deployment is also being evaluated for safety-critical applications, such as automotive, avionics, medical or industrial. These systems require functional safety, guaranteeing failure-free behaviour even in the presence of hardware faults. In general fault tolerance can be achieved by adding redundancy to the system, which further exacerbates the overall computational demands and makes it difficult to meet the power and performance requirements. In order to decrease the hardware cost for achieving functional safety, it is vital to explore domain-specific solutions which can exploit the inherent features of DNNs. In this work we present a novel methodology called fault-aware training (FAT), which includes error modeling during neural network (NN) training, to make QNNs resilient to specific fault models on the device. Our experiments show that by injecting faults in the convolutional layers during training, highly accurate convolutional neural networks (CNNs) can be trained which exhibits much better error tolerance compared to the original. Furthermore, we show that redundant systems which are built from QNNs trained with FAT achieve higher worse-case accuracy at lower hardware cost. This has been validated for numerous classification tasks including CIFAR10, GTSRB, SVHN and ImageNet. △ Less

Submitted 11 November, 2020; originally announced November 2020.

arXiv:2004.03021 [pdf, other]

LogicNets: Co-Designed Neural Networks and Circuits for Extreme-Throughput Applications

Authors: Yaman Umuroglu, Yash Akhauri, Nicholas J. Fraser, Michaela Blott

Abstract: Deployment of deep neural networks for applications that require very high throughput or extremely low latency is a severe computational challenge, further exacerbated by inefficiencies in mapping the computation to hardware. We present a novel method for designing neural network topologies that directly map to a highly efficient FPGA implementation. By exploiting the equivalence of artificial neu… ▽ More Deployment of deep neural networks for applications that require very high throughput or extremely low latency is a severe computational challenge, further exacerbated by inefficiencies in mapping the computation to hardware. We present a novel method for designing neural network topologies that directly map to a highly efficient FPGA implementation. By exploiting the equivalence of artificial neurons with quantized inputs/outputs and truth tables, we can train quantized neural networks that can be directly converted to a netlist of truth tables, and subsequently deployed as a highly pipelinable, massively parallel FPGA circuit. However, the neural network topology requires careful consideration since the hardware cost of truth tables grows exponentially with neuron fan-in. To obtain smaller networks where the whole netlist can be placed-and-routed onto a single FPGA, we derive a fan-in driven hardware cost model to guide topology design, and combine high sparsity with low-bit activation quantization to limit the neuron fan-in. We evaluate our approach on two tasks with very high intrinsic throughput requirements in high-energy physics and network intrusion detection. We show that the combination of sparsity and low-bit activation quantization results in high-speed circuits with small logic depth and low LUT cost, demonstrating competitive accuracy with less than 15 ns of inference latency and throughput in the hundreds of millions of inferences per second. △ Less

Submitted 6 April, 2020; originally announced April 2020.

arXiv:1910.11568 [pdf]

Open Access -- Towards a non-normative and systematic understanding

Authors: Niels Taubert, Anne Hobert, Nicolas Fraser, Najko Jahn, Elham Iravani

Abstract: The term Open Access not only describes a certain model of scholarly publishing -- namely in digital format freely accessible to readers -- but often also implies that free availability of research results is desirable, and hence has a normative character. Together with the large variety of presently used definitions of different Open Access types, this normativity hinders a systematic investigati… ▽ More The term Open Access not only describes a certain model of scholarly publishing -- namely in digital format freely accessible to readers -- but often also implies that free availability of research results is desirable, and hence has a normative character. Together with the large variety of presently used definitions of different Open Access types, this normativity hinders a systematic investigation of the development of open availability of scholarly literature. In this paper, we propose a non-normative definition of Open Access and its usage as a neutral, descriptive term in bibliometric studies and research on science. To this end, we first specify what normative figures are commonly associated with the term Open Access and then develop a neutral definition. We further identify distinguishing characteristics of openly accessible literature, called dimensions, and derive a classification scheme into Open Access categories based on these dimensions. Additionally, we present an operationalisation method to assign scientific publications to the respective categories in practice. Here, we describe useful data sources, which can be employed to gather the information needed for the classification of scholarly works according to the presented classification scheme. △ Less

Submitted 25 October, 2019; originally announced October 2019.

Comments: 16 pages, 4 tables

arXiv:1903.11682 [pdf]

From closed to open access: A case study of flipped journals

Authors: Fakhri Momeni, Nicholas Fraser, Isabella Peters, Philipp Mayr

Abstract: In recent years, increased stakeholder pressure to transition research to Open Access has led to many journals "flipping" from a toll access to an open access publishing model. Changing the publishing model can influence the decision of authors to submit their papers to a journal, and increased article accessibility may influence citation behaviour. The aim of this paper is to show changes in the… ▽ More In recent years, increased stakeholder pressure to transition research to Open Access has led to many journals "flipping" from a toll access to an open access publishing model. Changing the publishing model can influence the decision of authors to submit their papers to a journal, and increased article accessibility may influence citation behaviour. The aim of this paper is to show changes in the number of published articles and citations after the flipping of a journal. We analysed a set of 171 journals in the Web of Science (WoS) which flipped to open access. In addition to comparing the number of articles, average relative citation (ARC) and normalized impact factor (IF) are applied, respectively, as bibliometric indicators at the article and journal level, to trace the transformation of flipped journals covered. Our results show that flipping mostly has had positive effects on journal's IF. But it has had no obvious citation advantage for the articles. We also see a decline in the number of published articles after flipping. We can conclude that flipping to open access can improve the performance of journals, despite decreasing the tendency of authors to submit their articles and no better citation advantages for articles. △ Less

Submitted 9 October, 2019; v1 submitted 27 March, 2019; originally announced March 2019.

Comments: 6 pages, 4 figures, revised research-in-progress paper accepted at the 17th International Conference on Scientometrics & Informetrics (ISSI 2019), Rome, Italy

arXiv:1809.04570 [pdf, other]

FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks

Authors: Michaela Blott, Thomas Preusser, Nicholas Fraser, Giulio Gambardella, Kenneth O'Brien, Yaman Umuroglu

Abstract: Convolutional Neural Networks have rapidly become the most successful machine learning algorithm, enabling ubiquitous machine vision and intelligent decisions on even embedded computing-systems. While the underlying arithmetic is structurally simple, compute and memory requirements are challenging. One of the promising opportunities is leveraging reduced-precision representations for inputs, activ… ▽ More Convolutional Neural Networks have rapidly become the most successful machine learning algorithm, enabling ubiquitous machine vision and intelligent decisions on even embedded computing-systems. While the underlying arithmetic is structurally simple, compute and memory requirements are challenging. One of the promising opportunities is leveraging reduced-precision representations for inputs, activations and model parameters. The resulting scalability in performance, power efficiency and storage footprint provides interesting design compromises in exchange for a small reduction in accuracy. FPGAs are ideal for exploiting low-precision inference engines leveraging custom precisions to achieve the required numerical accuracy for a given application. In this article, we describe the second generation of the FINN framework, an end-to-end tool which enables design space exploration and automates the creation of fully customized inference engines on FPGAs. Given a neural network description, the tool optimizes for given platforms, design targets and a specific precision. We introduce formalizations of resource cost functions and performance predictions, and elaborate on the optimization algorithms. Finally, we evaluate a selection of reduced precision neural networks ranging from CIFAR-10 classifiers to YOLO-based object detection on a range of platforms including PYNQ and AWS\,F1, demonstrating new unprecedented measured throughput at 50TOp/s on AWS-F1 and 5TOp/s on embedded devices. △ Less

Submitted 12 September, 2018; originally announced September 2018.

Comments: to be published in ACM TRETS Special Edition on Deep Learning

arXiv:1807.10577 [pdf, other]

Accuracy to Throughput Trade-offs for Reduced Precision Neural Networks on Reconfigurable Logic

Authors: Jiang Su, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Gianluca Durelli, David B. Thomas, Philip Leong, Peter Y. K. Cheung

Abstract: Modern CNN are typically based on floating point linear algebra based implementations. Recently, reduced precision NN have been gaining popularity as they require significantly less memory and computational resources compared to floating point. This is particularly important in power constrained compute environments. However, in many cases a reduction in precision comes at a small cost to the accu… ▽ More Modern CNN are typically based on floating point linear algebra based implementations. Recently, reduced precision NN have been gaining popularity as they require significantly less memory and computational resources compared to floating point. This is particularly important in power constrained compute environments. However, in many cases a reduction in precision comes at a small cost to the accuracy of the resultant network. In this work, we investigate the accuracy-throughput trade-off for various parameter precision applied to different types of NN models. We firstly propose a quantization training strategy that allows reduced precision NN inference with a lower memory footprint and competitive model accuracy. Then, we quantitatively formulate the relationship between data representation and hardware efficiency. Our experiments finally provide insightful observation. For example, one of our tests show 32-bit floating point is more hardware efficient than 1-bit parameters to achieve 99% MNIST accuracy. In general, 2-bit and 4-bit fixed point parameters show better hardware trade-off on small-scale datasets like MNIST and CIFAR-10 while 4-bit provide the best trade-off in large-scale tasks like AlexNet on ImageNet dataset within our tested problem domain. △ Less

Submitted 17 July, 2018; originally announced July 2018.

Comments: Accepted by ARC 2018

arXiv:1807.03123 [pdf, other]

Scaling Neural Network Performance through Customized Hardware Architectures on Reconfigurable Logic

Authors: Michaela Blott, Thomas B. Preusser, Nicholas Fraser, Giulio Gambardella, Kenneth OBrien, Yaman Umuroglu, Miriam Leeser

Abstract: Convolutional Neural Networks have dramatically improved in recent years, surpassing human accuracy on certain problems and performance exceeding that of traditional computer vision algorithms. While the compute pattern in itself is relatively simple, significant compute and memory challenges remain as CNNs may contain millions of floating-point parameters and require billions of floating-point op… ▽ More Convolutional Neural Networks have dramatically improved in recent years, surpassing human accuracy on certain problems and performance exceeding that of traditional computer vision algorithms. While the compute pattern in itself is relatively simple, significant compute and memory challenges remain as CNNs may contain millions of floating-point parameters and require billions of floating-point operations to process a single image. These computational requirements, combined with storage footprints that exceed typical cache sizes, pose a significant performance and power challenge for modern compute architectures. One of the promising opportunities to scale performance and power efficiency is leveraging reduced precision representations for all activations and weights as this allows to scale compute capabilities, reduce weight and feature map buffering requirements as well as energy consumption. While a small reduction in accuracy is encountered, these Quantized Neural Networks have been shown to achieve state-of-the-art accuracy on standard benchmark datasets, such as MNIST, CIFAR-10, SVHN and even ImageNet, and thus provide highly attractive design trade-offs. Current research has focused mainly on the implementation of extreme variants with full binarization of weights and or activations, as well typically smaller input images. Within this paper, we investigate the scalability of dataflow architectures with respect to supporting various precisions for both weights and activations, larger image dimensions, and increasing numbers of feature map channels. Key contributions are a formalized approach to understanding the scalability of the existing hardware architecture with cost models and a performance prediction as a function of the target device size. We provide validating experimental results for an ImageNet classification on a server-class platform, namely the AWS F1 node. △ Less

Submitted 26 June, 2018; originally announced July 2018.

arXiv:1807.00301 [pdf, other]

SYQ: Learning Symmetric Quantization For Efficient Deep Neural Networks

Authors: Julian Faraone, Nicholas Fraser, Michaela Blott, Philip H. W. Leong

Abstract: Inference for state-of-the-art deep neural networks is computationally expensive, making them difficult to deploy on constrained hardware environments. An efficient way to reduce this complexity is to quantize the weight parameters and/or activations during training by approximating their distributions with a limited entry codebook. For very low-precisions, such as binary or ternary networks with… ▽ More Inference for state-of-the-art deep neural networks is computationally expensive, making them difficult to deploy on constrained hardware environments. An efficient way to reduce this complexity is to quantize the weight parameters and/or activations during training by approximating their distributions with a limited entry codebook. For very low-precisions, such as binary or ternary networks with 1-8-bit activations, the information loss from quantization leads to significant accuracy degradation due to large gradient mismatches between the forward and backward functions. In this paper, we introduce a quantization method to reduce this loss by learning a symmetric codebook for particular weight subgroups. These subgroups are determined based on their locality in the weight matrix, such that the hardware simplicity of the low-precision representations is preserved. Empirically, we show that symmetric quantization can substantially improve accuracy for networks with extremely low-precision weights and activations. We also demonstrate that this representation imposes minimal or no hardware implications to more coarse-grained approaches. Source code is available at https://www.github.com/julianfaraone/SYQ. △ Less

Submitted 1 July, 2018; originally announced July 2018.

Comments: Published as a conference paper at the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

arXiv:1806.08085 [pdf, other]

doi 10.23919/DATE.2018.8342121

Inference of Quantized Neural Networks on Heterogeneous All-Programmable Devices

Authors: Thomas B. Preußer, Giulio Gambardella, Nicholas Fraser, Michaela Blott

Abstract: Neural networks have established as a generic and powerful means to approach challenging problems such as image classification, object detection or decision making. Their successful employment foots on an enormous demand of compute. The quantization of network parameters and the processed data has proven a valuable measure to reduce the challenges of network inference so effectively that the feasi… ▽ More Neural networks have established as a generic and powerful means to approach challenging problems such as image classification, object detection or decision making. Their successful employment foots on an enormous demand of compute. The quantization of network parameters and the processed data has proven a valuable measure to reduce the challenges of network inference so effectively that the feasible scope of applications is expanded even into the embedded domain. This paper describes the making of a real-time object detection in a live video stream processed on an embedded all-programmable device. The presented case illustrates how the required processing is tamed and parallelized across both the CPU cores and the programmable logic and how the most suitable resources and powerful extensions, such as NEON vectorization, are leveraged for the individual processing steps. The crafted result is an extended Darknet framework implementing a fully integrated, end-to-end solution from video capture over object annotation to video output applying neural network inference at different quantization levels running at 16~frames per second on an embedded Zynq UltraScale+ (XCZU3EG) platform. △ Less

Submitted 21 June, 2018; originally announced June 2018.

arXiv:1805.07941 [pdf, other]

Quantizing Convolutional Neural Networks for Low-Power High-Throughput Inference Engines

Authors: Sean O. Settle, Manasa Bollavaram, Paolo D'Alberto, Elliott Delaye, Oscar Fernandez, Nicholas Fraser, Aaron Ng, Ashish Sirasao, Michael Wu

Abstract: Deep learning as a means to inferencing has proliferated thanks to its versatility and ability to approach or exceed human-level accuracy. These computational models have seemingly insatiable appetites for computational resources not only while training, but also when deployed at scales ranging from data centers all the way down to embedded devices. As such, increasing consideration is being made… ▽ More Deep learning as a means to inferencing has proliferated thanks to its versatility and ability to approach or exceed human-level accuracy. These computational models have seemingly insatiable appetites for computational resources not only while training, but also when deployed at scales ranging from data centers all the way down to embedded devices. As such, increasing consideration is being made to maximize the computational efficiency given limited hardware and energy resources and, as a result, inferencing with reduced precision has emerged as a viable alternative to the IEEE 754 Standard for Floating-Point Arithmetic. We propose a quantization scheme that allows inferencing to be carried out using arithmetic that is fundamentally more efficient when compared to even half-precision floating-point. Our quantization procedure is significant in that we determine our quantization scheme parameters by calibrating against its reference floating-point model using a single inference batch rather than (re)training and achieve end-to-end post quantization accuracies comparable to the reference model. △ Less

Submitted 21 May, 2018; originally announced May 2018.

arXiv:1709.06262 [pdf, other]

Compressing Low Precision Deep Neural Networks Using Sparsity-Induced Regularization in Ternary Networks

Authors: Julian Faraone, Nicholas Fraser, Giulio Gambardella, Michaela Blott, Philip H. W. Leong

Abstract: A low precision deep neural network training technique for producing sparse, ternary neural networks is presented. The technique incorporates hard- ware implementation costs during training to achieve significant model compression for inference. Training involves three stages: network training using L2 regularization and a quantization threshold regularizer, quantization pruning, and finally retra… ▽ More A low precision deep neural network training technique for producing sparse, ternary neural networks is presented. The technique incorporates hard- ware implementation costs during training to achieve significant model compression for inference. Training involves three stages: network training using L2 regularization and a quantization threshold regularizer, quantization pruning, and finally retraining. Resulting networks achieve improved accuracy, reduced memory footprint and reduced computational complexity compared with conventional methods, on MNIST and CIFAR10 datasets. Our networks are up to 98% sparse and 5 & 11 times smaller than equivalent binary and ternary models, translating to significant resource and speed benefits for hardware implementations. △ Less

Submitted 9 October, 2017; v1 submitted 19 September, 2017; originally announced September 2017.

Comments: To appear as a conference paper at the 24th International Conference On Neural Information Processing (ICONIP 2017)

arXiv:1701.03400 [pdf, other]

Scaling Binarized Neural Networks on Reconfigurable Logic

Authors: Nicholas J. Fraser, Yaman Umuroglu, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, Kees Vissers

Abstract: Binarized neural networks (BNNs) are gaining interest in the deep learning community due to their significantly lower computational and memory cost. They are particularly well suited to reconfigurable logic devices, which contain an abundance of fine-grained compute resources and can result in smaller, lower power implementations, or conversely in higher classification rates. Towards this end, the… ▽ More Binarized neural networks (BNNs) are gaining interest in the deep learning community due to their significantly lower computational and memory cost. They are particularly well suited to reconfigurable logic devices, which contain an abundance of fine-grained compute resources and can result in smaller, lower power implementations, or conversely in higher classification rates. Towards this end, the Finn framework was recently proposed for building fast and flexible field programmable gate array (FPGA) accelerators for BNNs. Finn utilized a novel set of optimizations that enable efficient mapping of BNNs to hardware and implemented fully connected, non-padded convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements. However, FINN was not evaluated on larger topologies due to the size of the chosen FPGA, and exhibited decreased accuracy due to lack of padding. In this paper, we improve upon Finn to show how padding can be employed on BNNs while still maintaining a 1-bit datapath and high accuracy. Based on this technique, we demonstrate numerous experiments to illustrate flexibility and scalability of the approach. In particular, we show that a large BNN requiring 1.2 billion operations per frame running on an ADM-PCIE-8K5 platform can classify images at 12 kFPS with 671 us latency while drawing less than 41 W board power and classifying CIFAR-10 images at 88.7% accuracy. Our implementation of this network achieves 14.8 trillion operations per second. We believe this is the fastest classification rate reported to date on this benchmark at this level of accuracy. △ Less

Submitted 27 January, 2017; v1 submitted 12 January, 2017; originally announced January 2017.

Comments: To appear in the PARMA-DITAM workshop at HiPEAC 2017, January 2017

arXiv:1612.07119 [pdf, other]

doi 10.1145/3020078.3021744

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

Authors: Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, Kees Vissers

Abstract: Research has shown that convolutional neural networks contain significant redundancy, and high classification accuracy can be obtained even when weights and activations are reduced from floating point to binary values. In this paper, we present FINN, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture. By utilizing a novel set of optim… ▽ More Research has shown that convolutional neural networks contain significant redundancy, and high classification accuracy can be obtained even when weights and activations are reduced from floating point to binary values. In this paper, we present FINN, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture. By utilizing a novel set of optimizations that enable efficient mapping of binarized neural networks to hardware, we implement fully connected, convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements. On a ZC706 embedded FPGA platform drawing less than 25 W total system power, we demonstrate up to 12.3 million image classifications per second with 0.31 μs latency on the MNIST dataset with 95.8% accuracy, and 21906 image classifications per second with 283 μs latency on the CIFAR-10 and SVHN datasets with respectively 80.1% and 94.9% accuracy. To the best of our knowledge, ours are the fastest classification rates reported to date on these benchmarks. △ Less

Submitted 1 December, 2016; originally announced December 2016.

Comments: To appear in the 25th International Symposium on Field-Programmable Gate Arrays, February 2017

arXiv:1104.5094 [pdf, ps, other]

doi 10.1088/0004-637X/735/2/85

OGLE-2005-BLG-018: Characterization of Full Physical and Orbital Parameters of a Gravitational Binary Lens

Authors: I. -G. Shin, A. Udalski, C. Han, A. Gould, M. Dominik, P. Fouque, M. Kubiak, M. K. Szymanski, G. Pietrzynki, I. Soszynski, K. Ulaczyk, L. Wyrzykowski, D. L. DePoy, S. Dong, B. S. Gaudi, C. -U. Lee, B. -G. Park, R. W. Pogge, M. D. Albrow, A. Allan, J. P. Beaulieu, D. P. Bennett, M. Bode, D. M. Bramich, S. Brillant , et al. (33 additional authors not shown)

Abstract: We present the analysis result of a gravitational binary-lensing event OGLE-2005-BLG-018. The light curve of the event is characterized by 2 adjacent strong features and a single weak feature separated from the strong features. The light curve exhibits noticeable deviations from the best-fit model based on standard binary parameters. To explain the deviation, we test models including various highe… ▽ More We present the analysis result of a gravitational binary-lensing event OGLE-2005-BLG-018. The light curve of the event is characterized by 2 adjacent strong features and a single weak feature separated from the strong features. The light curve exhibits noticeable deviations from the best-fit model based on standard binary parameters. To explain the deviation, we test models including various higher-order effects of the motions of the observer, source, and lens. From this, we find that it is necessary to account for the orbital motion of the lens in describing the light curve. From modeling of the light curve considering the parallax effect and Keplerian orbital motion, we are able to measure not only the physical parameters but also a complete orbital solution of the lens system. It is found that the event was produced by a binary lens located in the Galactic bulge with a distance $6.7\pm 0.3$ kpc from the Earth. The individual lens components with masses $0.9\pm 0.3\ M_\odot$ and $0.5\pm 0.1\ M_\odot$ are separated with a semi-major axis of $a=2.5 \pm 1.0$ AU and orbiting each other with a period $P=3.1 \pm 1.3$ yr. The event demonstrates that it is possible to extract detailed information about binary lens systems from well-resolved lensing light curves. △ Less

Submitted 27 April, 2011; originally announced April 2011.

Comments: 19 pages, 6 figures

arXiv:1005.0966 [pdf, ps, other]

doi 10.1051/0004-6361/201014053

OGLE 2008--BLG--290: An accurate measurement of the limb darkening of a Galactic Bulge K Giant spatially resolved by microlensing

Authors: P. Fouque, D. Heyrovsky, S. Dong, A. Gould, A. Udalski, M. D. Albrow, V. Batista, J. -P. Beaulieu, D. P. Bennett, I. A. Bond, D. M. Bramich, S. Calchi Novati, A. Cassan, C. Coutures, S. Dieters, M. Dominik, D. Dominis Prester, J. Greenhill, K. Horne, U. G. Jorgensen, S. Kozlowski, D. Kubas, C. -H. Lee, J. -B. Marquette, M. Mathiasen , et al. (93 additional authors not shown)

Abstract: Gravitational microlensing is not only a successful tool for discovering distant exoplanets, but it also enables characterization of the lens and source stars involved in the lensing event. In high magnification events, the lens caustic may cross over the source disk, which allows a determination of the angular size of the source and additionally a measurement of its limb darkening. When such exte… ▽ More Gravitational microlensing is not only a successful tool for discovering distant exoplanets, but it also enables characterization of the lens and source stars involved in the lensing event. In high magnification events, the lens caustic may cross over the source disk, which allows a determination of the angular size of the source and additionally a measurement of its limb darkening. When such extended-source effects appear close to maximum magnification, the resulting light curve differs from the characteristic Paczynski point-source curve. The exact shape of the light curve close to the peak depends on the limb darkening of the source. Dense photometric coverage permits measurement of the respective limb-darkening coefficients. In the case of microlensing event OGLE 2008-BLG-290, the K giant source star reached a peak magnification of about 100. Thirteen different telescopes have covered this event in eight different photometric bands. Subsequent light-curve analysis yielded measurements of linear limb-darkening coefficients of the source in six photometric bands. The best-measured coefficients lead to an estimate of the source effective temperature of about 4700 +100-200 K. However, the photometric estimate from colour-magnitude diagrams favours a cooler temperature of 4200 +-100 K. As the limb-darkening measurements, at least in the CTIO/SMARTS2 V and I bands, are among the most accurate obtained, the above disagreement needs to be understood. A solution is proposed, which may apply to previous events where such a discrepancy also appeared. △ Less

Submitted 6 May, 2010; originally announced May 2010.

Comments: Astronomy & Astrophysics in press

arXiv:0801.2162 [pdf, ps, other]

doi 10.1002/asna.200710928

ARTEMiS (Automated Robotic Terrestrial Exoplanet Microlensing Search) - A possible expert-system based cooperative effort to hunt for planets of Earth mass and below

Authors: M. Dominik, K. Horne, A. Allan, N. J. Rattenbury, Y. Tsapras, C. Snodgrass, M. F. Bode, M. J. Burgdorf, S. N. Fraser, E. Kerins, C. J. Mottram, I. A. Steele, R. A. Street, P. J. Wheatley, L. Wyrzykowski

Abstract: (abridged) The technique of gravitational microlensing is currently unique in its ability to provide a sample of terrestrial exoplanets around both Galactic disk and bulge stars, allowing to measure their abundance and determine their distribution with respect to mass and orbital separation. In order to achieve these goals in reasonable time, a well-coordinated effort involving a network of eith… ▽ More (abridged) The technique of gravitational microlensing is currently unique in its ability to provide a sample of terrestrial exoplanets around both Galactic disk and bulge stars, allowing to measure their abundance and determine their distribution with respect to mass and orbital separation. In order to achieve these goals in reasonable time, a well-coordinated effort involving a network of either 2m or 4 x 1m telescopes at each site is required. It could lead to the first detection of an Earth-mass planet outside the Solar system, and even planets less massive than Earth could be discovered. From April 2008, ARTEMiS (Automated Robotic Terrestrial Exoplanet Microlensing Search) is planned to provide a platform for a three-step strategy of survey, follow-up, and anomaly monitoring. As an expert system embedded in eSTAR (e-Science Telescopes for Astronomical Research), ARTEMiS will give advice on the optimal targets to be observed at any given time, and will also alert on deviations from ordinary microlensing light curves by means of the SIGNALMEN anomaly detector. While the use of the VOEvent (Virtual Observatory Event) protocol allows a direct interaction with the telescopes that are part of the HTN (Heterogeneous Telescope Networks) consortium, additional interfaces provide means of communication with all existing microlensing campaigns that rely on human observers. The success of discovering a planet by microlensing critically depends on the availability of a telescope in a suitable location at the right time, which can mean within 10 min. Real-time modelling offers the opportunity of live discovery of extra-solar planets, thereby providing ''Science live to your home''. △ Less

Submitted 14 January, 2008; originally announced January 2008.

Comments: 4 pages with 2 eps figures embedded. Accepted for publication in Astronomische Nachrichten as part of the Proceedings of the Joint VOEvent & HTN Workshop "Hot-wiring the Transient Universe" held in Tucson, Arizona (US), June 4-7 2007

arXiv:astro-ph/0511032 [pdf, ps, other]

doi 10.1086/499289

The Automatic Real-Time GRB Pipeline of the 2-m Liverpool Telescope

Authors: C. Guidorzi, A. Monfardini, A. Gomboc, C. J. Mottram, C. G. Mundell, I. A. Steele, D. Carter, M. F. Bode, R. J. Smith, S. N. Fraser, M. J. Burgdorf, A. M. Newsam

Abstract: The 2-m Liverpool Telescope (LT), owned by Liverpool John Moores University, is located in La Palma (Canary Islands) and operates in fully robotic mode. In 2005, the LT began conducting an automatic GRB follow-up program. On receiving an automatic GRB alert from a Gamma-Ray Observatory (Swift, INTEGRAL, HETE-II, IPN) the LT initiates a special override mode that conducts follow-up observations w… ▽ More The 2-m Liverpool Telescope (LT), owned by Liverpool John Moores University, is located in La Palma (Canary Islands) and operates in fully robotic mode. In 2005, the LT began conducting an automatic GRB follow-up program. On receiving an automatic GRB alert from a Gamma-Ray Observatory (Swift, INTEGRAL, HETE-II, IPN) the LT initiates a special override mode that conducts follow-up observations within 2-3 min of the GRB onset. This follow-up procedure begins with an initial sequence of short (10-s) exposures acquired through an r' band filter. These images are reduced, analyzed and interpreted automatically using pipeline software developed by our team called "LT-TRAP" (Liverpool Telescope Transient Rapid Analysis Pipeline); the automatic detection and successful identification of an unknown and potentially fading optical transient triggers a subsequent multi-color imaging sequence. In the case of a candidate brighter than r'=15, either a polarimetric (from 2006) or a spectroscopic observation (from 2007) will be triggered on the LT. If no candidate is identified, the telescope continues to obtain z', r' and i' band imaging with increasingly longer exposure times. Here we present a detailed description of the LT-TRAP and briefly discuss the illustrative case of the afterglow of GRB 050502a, whose automatic identification by the LT just 3 min after the GRB, led to the acquisition of the first early-time (< 1 hr) multi-color light curve of a GRB afterglow. △ Less

Submitted 1 November, 2005; originally announced November 2005.

Comments: PASP, accepted (8 pages, 3 figures)

arXiv:astro-ph/0502506 [pdf, ps, other]

doi 10.1393/ncc/i2005-10140-3

The Liverpool Telescope Automatic Pipeline for Real-time GRB Afterglow Detection

Authors: A. Gomboc, A. Monfardini, C. Guidorzi, C. G. Mundell, C. J. Mottram, S. N. Fraser, R. J. Smith, I. A. Steele, D. Carter, M. F. Bode, A. M. Newsam

Abstract: The 2-m robotic Liverpool Telescope (LT) is ideally suited to the rapid follow-up of unpredictable and transient events such as GRBs. Our GRB follow-up strategy is designed to identify optical/IR counterparts in real time; it involves the automatic triggering of initial observations, on receipt of an alert from Gamma Ray Observatories HETE-2, INTEGRAL and Swift, followed by automated data reduct… ▽ More The 2-m robotic Liverpool Telescope (LT) is ideally suited to the rapid follow-up of unpredictable and transient events such as GRBs. Our GRB follow-up strategy is designed to identify optical/IR counterparts in real time; it involves the automatic triggering of initial observations, on receipt of an alert from Gamma Ray Observatories HETE-2, INTEGRAL and Swift, followed by automated data reduction, analysis, OT identification and subsequent observing mode choice. The lack of human intervention in this process requires robustness at all stages of the procedure. Here we describe the telescope, its instrumentation and GRB pipeline. △ Less

Submitted 24 February, 2005; originally announced February 2005.

Comments: 4 pages, 1 figure, submitted to Il nuovo cimento (4th Workshop Gamma-Ray Bursts in the Afterglow Era, Rome, 18-22 October 2004)

Journal ref: Nuovo Cim.C28:727-730,2005

arXiv:astro-ph/0502505 [pdf, ps, other]

doi 10.1393/ncc/i2005-10139-8

Early GRB Optical and Infrared Afterglow Observations with the 2-m Robotic Liverpool Telescope

Authors: A. Gomboc, C. G. Mundell, C. Guidorzi, A. Monfardini, C. J. Mottram, R. Priddey, R. J. Smith, S. Pak, I. A. Steele, N. Tanvir, D. Carter, S. N. Fraser, M. F. Bode, A. M. Newsam, M. Hughes

Abstract: We present the first optical observations of a Gamma Ray Burst (GRB) afterglow using the 2-m robotic Liverpool Telescope (LT), which is owned and operated by Liverpool John Moores University and situated on La Palma. We briefly discuss the capabilities of LT and its suitability for rapid follow-up observations of early optical and infrared GRB light curves. In particular, the combination of aper… ▽ More We present the first optical observations of a Gamma Ray Burst (GRB) afterglow using the 2-m robotic Liverpool Telescope (LT), which is owned and operated by Liverpool John Moores University and situated on La Palma. We briefly discuss the capabilities of LT and its suitability for rapid follow-up observations of early optical and infrared GRB light curves. In particular, the combination of aperture, site, instrumentation and rapid response (robotic over-ride mode aided by telescope's rapid slew and fully-opening enclosure) makes the LT ideal for investigating the nature of short bursts, optically-dark bursts, and GRB blast-wave physics in general. We briefly describe the LT's key position in the RoboNet-1.0 network of robotic telescopes. We present the LT observations of GRB041006 and use its gamma-ray properties to predict the time of the break in optical light curve, a prediction consistent with the observations. △ Less

Submitted 3 May, 2005; v1 submitted 24 February, 2005; originally announced February 2005.

Comments: 4 pages, 1 figure, accepted for publication in Il nuovo cimento (4th Workshop Gamma-Ray Bursts in the Afterglow Era, Rome, 18-22 October 2004)

Journal ref: Nuovo Cim.C28:723-726,2005

Showing 1–24 of 24 results for author: Fraser, N