Search | arXiv e-print repository

Improving Quantization with Post-Training Model Expansion

Authors: Giuseppe Franco, Pablo Monteagudo-Lago, Ian Colbert, Nicholas Fraser, Michaela Blott

Abstract: The size of a model has been a strong predictor of its quality, as well as its cost. As such, the trade-off between model cost and quality has been well-studied. Post-training optimizations like quantization and pruning have typically focused on reducing the overall volume of pre-trained models to reduce inference costs while maintaining model quality. However, recent advancements have introduced… ▽ More The size of a model has been a strong predictor of its quality, as well as its cost. As such, the trade-off between model cost and quality has been well-studied. Post-training optimizations like quantization and pruning have typically focused on reducing the overall volume of pre-trained models to reduce inference costs while maintaining model quality. However, recent advancements have introduced optimization techniques that, interestingly, expand models post-training, increasing model size to improve quality when reducing volume. For instance, to enable 4-bit weight and activation quantization, incoherence processing often necessitates inserting online Hadamard rotations in the compute graph, and preserving highly sensitive weights often calls for additional higher precision computations. However, if application requirements cannot be met, the prevailing solution is to relax quantization constraints. In contrast, we demonstrate post-training model expansion is a viable strategy to improve model quality within a quantization co-design space, and provide theoretical justification. We show it is possible to progressively and selectively expand the size of a pre-trained large language model (LLM) to improve model quality without end-to-end retraining. In particular, when quantizing the weights and activations to 4 bits for Llama3 1B, we reduce the zero-shot accuracy gap to full precision by an average of 3% relative to both QuaRot and SpinQuant with only 5% more parameters, which is still a 3.8% reduction in volume relative to a BF16 reference model. △ Less

Submitted 21 March, 2025; originally announced March 2025.

arXiv:2410.00340 [pdf, other]

Sparse Attention Decomposition Applied to Circuit Tracing

Authors: Gabriel Franco, Mark Crovella

Abstract: Many papers have shown that attention heads work in conjunction with each other to perform complex tasks. It's frequently assumed that communication between attention heads is via the addition of specific features to token residuals. In this work we seek to isolate and identify the features used to effect communication and coordination among attention heads in GPT-2 small. Our key leverage on the… ▽ More Many papers have shown that attention heads work in conjunction with each other to perform complex tasks. It's frequently assumed that communication between attention heads is via the addition of specific features to token residuals. In this work we seek to isolate and identify the features used to effect communication and coordination among attention heads in GPT-2 small. Our key leverage on the problem is to show that these features are very often sparsely coded in the singular vectors of attention head matrices. We characterize the dimensionality and occurrence of these signals across the attention heads in GPT-2 small when used for the Indirect Object Identification (IOI) task. The sparse encoding of signals, as provided by attention head singular vectors, allows for efficient separation of signals from the residual background and straightforward identification of communication paths between attention heads. We explore the effectiveness of this approach by tracing portions of the circuits used in the IOI task. Our traces reveal considerable detail not present in previous studies, shedding light on the nature of redundant paths present in GPT-2. And our traces go beyond previous work by identifying features used to communicate between attention heads when performing IOI. △ Less

Submitted 28 October, 2024; v1 submitted 30 September, 2024; originally announced October 2024.

arXiv:2409.17092 [pdf, other]

Accumulator-Aware Post-Training Quantization

Authors: Ian Colbert, Fabian Grob, Giuseppe Franco, Jinjie Zhang, Rayan Saab

Abstract: Several recent studies have investigated low-precision accumulation, reporting improvements in throughput, power, and area across various platforms. However, the accompanying proposals have only considered the quantization-aware training (QAT) paradigm, in which models are fine-tuned or trained from scratch with quantization in the loop. As models continue to grow in size, QAT techniques become in… ▽ More Several recent studies have investigated low-precision accumulation, reporting improvements in throughput, power, and area across various platforms. However, the accompanying proposals have only considered the quantization-aware training (QAT) paradigm, in which models are fine-tuned or trained from scratch with quantization in the loop. As models continue to grow in size, QAT techniques become increasingly more expensive, which has motivated the recent surge in post-training quantization (PTQ) research. To the best of our knowledge, ours marks the first formal study of accumulator-aware quantization in the PTQ setting. To bridge this gap, we introduce AXE, a practical framework of accumulator-aware extensions designed to endow overflow avoidance guarantees to existing layer-wise PTQ algorithms. We theoretically motivate AXE and demonstrate its flexibility by implementing it on top of two state-of-the-art PTQ algorithms: GPFQ and OPTQ. We further generalize AXE to support multi-stage accumulation for the first time, opening the door for full datapath optimization and scaling to large language models (LLMs). We evaluate AXE across image classification and language generation models, and observe significant improvements in the trade-off between accumulator bit width and model accuracy over baseline methods. △ Less

Submitted 25 September, 2024; originally announced September 2024.

arXiv:2407.14538 [pdf, other]

Alea-BFT: Practical Asynchronous Byzantine Fault Tolerance

Authors: Diogo S. Antunes, Afonso N. Oliveira, André Breda, Matheus Guilherme Franco, Henrique Moniz, Rodrigo Rodrigues

Abstract: Traditional Byzantine Fault Tolerance (BFT) state machine replication protocols assume a partial synchrony model, leading to a design where a leader replica drives the protocol and is replaced after a timeout. Recently, we witnessed a surge of asynchronous BFT protocols, which use randomization to remove the need for bounds on message delivery times, making them more resilient to adverse network c… ▽ More Traditional Byzantine Fault Tolerance (BFT) state machine replication protocols assume a partial synchrony model, leading to a design where a leader replica drives the protocol and is replaced after a timeout. Recently, we witnessed a surge of asynchronous BFT protocols, which use randomization to remove the need for bounds on message delivery times, making them more resilient to adverse network conditions. However, existing research proposals still fall short of gaining practical adoption, plausibly because they are not able to combine good performance with a simple design that can be readily understood and adopted. In this paper, we present Alea-BFT, a simple and highly efficient asynchronous BFT protocol, which is gaining practical adoption, namely in Ethereum distributed validators. Alea-BFT brings the key design insight from classical protocols of concentrating part of the work on a single designated replica and incorporates this principle in a simple two-stage pipelined design, with an efficient broadcast led by the designated replica, followed by an inexpensive binary agreement. The evaluation of our research prototype implementation and two real-world integrations in cryptocurrency ecosystems shows excellent performance, improving on the fastest protocol (Dumbo-NG) in terms of latency and displaying good performance under faults. △ Less

Submitted 14 July, 2024; originally announced July 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2202.02071

ACM Class: C.2.4; D.4.5

Journal ref: In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24) 2024 (pp. 313-328)

arXiv:2311.12359 [pdf, other]

Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs

Authors: Shivam Aggarwal, Hans Jakob Damsgaard, Alessandro Pappalardo, Giuseppe Franco, Thomas B. Preußer, Michaela Blott, Tulika Mitra

Abstract: Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point formats(FP8) in the context of PTQ for model inference. However, floating-point formats smaller than 8 bits and their relative comparison in terms of accuracy-hardware c… ▽ More Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point formats(FP8) in the context of PTQ for model inference. However, floating-point formats smaller than 8 bits and their relative comparison in terms of accuracy-hardware cost with integers remains unexplored on FPGAs. In this work, we present minifloats, which are reduced-precision floating-point formats capable of further reducing the memory footprint, latency, and energy cost of a model while approaching full-precision model accuracy. We implement a custom FPGA-based multiply-accumulate operator library and explore the vast design space, comparing minifloat and integer representations across 3 to 8 bits for both weights and activations. We also examine the applicability of various integerbased quantization techniques to minifloats. Our experiments show that minifloats offer a promising alternative for emerging workloads such as vision transformers. △ Less

Submitted 5 July, 2024; v1 submitted 21 November, 2023; originally announced November 2023.

Comments: Accepted in FPL (International Conference on Field-Programmable Logic and Applications) 2024 conference. Revised with updated results

arXiv:2310.19065 [pdf, other]

Evaluating LLP Methods: Challenges and Approaches

Authors: Gabriel Franco, Giovanni Comarela, Mark Crovella

Abstract: Learning from Label Proportions (LLP) is an established machine learning problem with numerous real-world applications. In this setting, data items are grouped into bags, and the goal is to learn individual item labels, knowing only the features of the data and the proportions of labels in each bag. Although LLP is a well-established problem, it has several unusual aspects that create challenges f… ▽ More Learning from Label Proportions (LLP) is an established machine learning problem with numerous real-world applications. In this setting, data items are grouped into bags, and the goal is to learn individual item labels, knowing only the features of the data and the proportions of labels in each bag. Although LLP is a well-established problem, it has several unusual aspects that create challenges for benchmarking learning methods. Fundamental complications arise because of the existence of different LLP variants, i.e., dependence structures that can exist between items, labels, and bags. Accordingly, the first algorithmic challenge is the generation of variant-specific datasets capturing the diversity of dependence structures and bag characteristics. The second methodological challenge is model selection, i.e., hyperparameter tuning; due to the nature of LLP, model selection cannot easily use the standard machine learning paradigm. The final benchmarking challenge consists of properly evaluating LLP solution methods across various LLP variants. We note that there is very little consideration of these issues in prior work, and there are no general solutions for these challenges proposed to date. To address these challenges, we develop methods capable of generating LLP datasets meeting the requirements of different variants. We use these methods to generate a collection of datasets encompassing the spectrum of LLP problem characteristics, which can be used in future evaluation studies. Additionally, we develop guidelines for benchmarking LLP algorithms, including the model selection and evaluation steps. Finally, we illustrate the new methods and guidelines by performing an extensive benchmark of a set of well-known LLP algorithms. We show that choosing the best algorithm depends critically on the LLP variant and model selection method, demonstrating the need for our proposed approach. △ Less

Submitted 29 October, 2023; originally announced October 2023.

arXiv:2304.01405 [pdf]

The Work Avatar Face-Off: Knowledge Worker Preferences for Realism in Meetings

Authors: Vrushank Phadnis, Kristin Moore, Mar Gonzalez Franco

Abstract: While avatars have grown in popularity in social settings, their use in the workplace is still debatable. We conducted a large-scale survey to evaluate knowledge worker sentiment towards avatars, particularly the effects of realism on their acceptability for work meetings. Our survey of 2509 knowledge workers from multiple countries rated five avatar styles for use by managers, known colleagues an… ▽ More While avatars have grown in popularity in social settings, their use in the workplace is still debatable. We conducted a large-scale survey to evaluate knowledge worker sentiment towards avatars, particularly the effects of realism on their acceptability for work meetings. Our survey of 2509 knowledge workers from multiple countries rated five avatar styles for use by managers, known colleagues and unknown colleagues. In all scenarios, participants favored higher realism, but fully realistic avatars were sometimes perceived as uncanny. Less realistic avatars were rated worse when interacting with an unknown colleague or manager, as compared to a known colleague. Avatar acceptability varied by country, with participants from the United States and South Korea rating avatars more favorably. We supplemented our quantitative findings with a thematic analysis of open-ended responses to provide a comprehensive understanding of factors influencing work avatar choices. In conclusion, our results show that realism had a significant positive correlation with acceptability. Non-realistic avatars were seen as fun and playful, but only suitable for occasional use. △ Less

Submitted 8 October, 2023; v1 submitted 3 April, 2023; originally announced April 2023.

Comments: 10 pages, accepted at ISMAR 2023 conference

arXiv:2106.15351 [pdf, ps, other]

Spectral concepts in genome informational analysis

Authors: Vincenzo Bonnici, Giuditta Franco, Vincenzo Manca

Abstract: The concept of k-spectrum for genomes is here investigated as a basic tool to analyze genomes. Related spectral notions based on k-mers are introduced with some related mathematical properties which are relevant for informational analysis of genomes. Procedures to generate spectral segmentations of genomes are provided and are tested (under several values of length k for k-mers) on cases of real g… ▽ More The concept of k-spectrum for genomes is here investigated as a basic tool to analyze genomes. Related spectral notions based on k-mers are introduced with some related mathematical properties which are relevant for informational analysis of genomes. Procedures to generate spectral segmentations of genomes are provided and are tested (under several values of length k for k-mers) on cases of real genomes, such as some human chromosomes and Saccharomyces cerevisiae. △ Less

Submitted 25 June, 2021; originally announced June 2021.

arXiv:2009.10449 [pdf, other]

A word recurrence based algorithm to extract genomic dictionaries

Authors: Vincenzo Bonnici, Giuditta Franco, Vincenzo Manca

Abstract: Genomes may be analyzed from an information viewpoint as very long strings, containing functional elements of variable length, which have been assembled by evolution. In this work an innovative information theory based algorithm is proposed, to extract significant (relatively small) dictionaries of genomic words. Namely, conceptual analyses are here combined with empirical studies, to open up a me… ▽ More Genomes may be analyzed from an information viewpoint as very long strings, containing functional elements of variable length, which have been assembled by evolution. In this work an innovative information theory based algorithm is proposed, to extract significant (relatively small) dictionaries of genomic words. Namely, conceptual analyses are here combined with empirical studies, to open up a methodology for the extraction of variable length dictionaries from genomic sequences, based on the information content of some factors. Its application to human chromosomes highlights an original inter-chromosomal similarity in terms of factor distributions. △ Less

Submitted 22 September, 2020; originally announced September 2020.

arXiv:1908.06399 [pdf]

Evaluation of an AI System for the Detection of Diabetic Retinopathy from Images Captured with a Handheld Portable Fundus Camera: the MAILOR AI study

Authors: T W Rogers, J Gonzalez-Bueno, R Garcia Franco, E Lopez Star, D Méndez Marín, J Vassallo, V C Lansingh, S Trikha, N Jaccard

Abstract: Objectives: To evaluate the performance of an Artificial Intelligence (AI) system (Pegasus, Visulytix Ltd., UK), at the detection of Diabetic Retinopathy (DR) from images captured by a handheld portable fundus camera. Methods: A cohort of 6,404 patients (~80% with diabetes mellitus) was screened for retinal diseases using a handheld portable fundus camera (Pictor Plus, Volk Optical Inc., USA) at… ▽ More Objectives: To evaluate the performance of an Artificial Intelligence (AI) system (Pegasus, Visulytix Ltd., UK), at the detection of Diabetic Retinopathy (DR) from images captured by a handheld portable fundus camera. Methods: A cohort of 6,404 patients (~80% with diabetes mellitus) was screened for retinal diseases using a handheld portable fundus camera (Pictor Plus, Volk Optical Inc., USA) at the Mexican Advanced Imaging Laboratory for Ocular Research. The images were graded for DR by specialists according to the Scottish DR grading scheme. The performance of the AI system was evaluated, retrospectively, in assessing Referable DR (RDR) and Proliferative DR (PDR) and compared to the performance on a publicly available desktop camera benchmark dataset. Results: For RDR detection, Pegasus performed with an 89.4% (95% CI: 88.0-90.7) Area Under the Receiver Operating Characteristic (AUROC) curve for the MAILOR cohort, compared to an AUROC of 98.5% (95% CI: 97.8-99.2) on the benchmark dataset. This difference was statistically significant. Moreover, no statistically significant difference was found in performance for PDR detection with Pegasus achieving an AUROC of 94.3% (95% CI: 91.0-96.9) on the MAILOR cohort and 92.2% (95% CI: 89.4-94.8) on the benchmark dataset. Conclusions: Pegasus showed good transferability for the detection of PDR from a curated desktop fundus camera dataset to real-world clinical practice with a handheld portable fundus camera. However, there was a substantial, and statistically significant, decrease in the diagnostic performance for RDR when using the handheld device. △ Less

Submitted 18 August, 2019; originally announced August 2019.

Showing 1–10 of 10 results for author: Franco, G