JugglePAC: a Pipelined Accumulation Circuit
Authors:
Ahmad Houraniah,
H. Fatih Ugurdag,
Furkan Aydin
Abstract:
Reducing a set of numbers to a single value is a fundamental operation in applications such as signal processing, data compression, scientific computing, and neural networks. Accumulation, which involves summing a dataset to obtain a single result, is crucial for these tasks. Due to hardware constraints, large vectors or matrices often cannot be fully stored in memory and must be read sequentially…
▽ More
Reducing a set of numbers to a single value is a fundamental operation in applications such as signal processing, data compression, scientific computing, and neural networks. Accumulation, which involves summing a dataset to obtain a single result, is crucial for these tasks. Due to hardware constraints, large vectors or matrices often cannot be fully stored in memory and must be read sequentially, one item per clock cycle. For high-speed inputs, such as rapidly arriving floating-point numbers, pipelined adders are necessary to maintain performance. However, pipelining introduces multiple intermediate sums and requires delays between back-to-back datasets unless their processing is overlapped. In this paper, we present JugglePAC, a novel accumulation circuit designed to address these challenges. JugglePAC operates quickly, is area-efficient, and features a fully pipelined design. It effectively manages back-to-back variable-length datasets while consistently producing results in the correct input order. Compared to the state-of-the-art, JugglePAC achieves higher throughput and reduces area complexity, offering significant improvements in performance and efficiency.
△ Less
Submitted 16 September, 2024; v1 submitted 2 October, 2023;
originally announced October 2023.
Efficient Multi-Cycle Folded Integer Multipliers
Authors:
Ahmad Houraniah,
H. Fatih Ugurdag,
C. Emre Dedeagac
Abstract:
Fast combinational multipliers with large bit widths can occupy significant silicon area, which also drives up power consumption. Area can be reduced through resource sharing (i.e., folding) at the expense of lower throughput, which is acceptable for some applications. This work explores multiple architectures for Multi-Cycle folded Integer Multiplier (MCIM) designs, which are based on Schoolbook…
▽ More
Fast combinational multipliers with large bit widths can occupy significant silicon area, which also drives up power consumption. Area can be reduced through resource sharing (i.e., folding) at the expense of lower throughput, which is acceptable for some applications. This work explores multiple architectures for Multi-Cycle folded Integer Multiplier (MCIM) designs, which are based on Schoolbook and Karatsuba approaches. Applications sometimes require a fractional number of multiplications to be performed per cycle. For example, an algorithm may only require 3.5 multiplications per cycle. In such a case, 3 multipliers with a throughput of 1 plus an additional smaller multiplier with a throughput of $1/2$ would be sufficient to maintain the algorithm's throughput. Our MCIM design generator offers customization in terms of throughput, latency, and clock frequency. MCIM designs were synthesized and verified for various parameter values using scripts. ASIC synthesis results show that MCIM designs with a throughput of $1/2$ offer area savings of up to 44% for bit widths of 8 to 128 with respect to directly synthesizing the * operator. Additionally, MCIM designs can offer up to 33% energy savings and 65% average peak power reduction.
△ Less
Submitted 20 March, 2025; v1 submitted 30 January, 2023;
originally announced January 2023.