-
Measuring Participant Contributions in Decentralized Federated Learning
Authors:
Honoka Anada,
Tatsuya Kaneko,
Shinya Takamaeda-Yamazaki
Abstract:
Federated learning (FL) enables multiple clients to collaboratively train models without sharing their data. Measuring participant contributions in FL is crucial for incentivizing clients and ensuring transparency. While various methods have been proposed for contribution measurement, they are designed exclusively for centralized federated learning (CFL), where a central server collects and aggreg…
▽ More
Federated learning (FL) enables multiple clients to collaboratively train models without sharing their data. Measuring participant contributions in FL is crucial for incentivizing clients and ensuring transparency. While various methods have been proposed for contribution measurement, they are designed exclusively for centralized federated learning (CFL), where a central server collects and aggregates client models, along with evaluating their contributions. Meanwhile, decentralized federated learning (DFL), in which clients exchange models directly without a central server, has gained significant attention for mitigating communication bottlenecks and eliminating a single point of failure. However, applying existing contribution measurement methods to DFL is challenging due to the presence of multiple global models and the absence of a central server. In this study, we present novel methodologies for measuring participant contributions in DFL. We first propose DFL-Shapley, an extension of the Shapley value tailored for DFL, adapting this widely used CFL metric to decentralized settings. Given the impracticality of computing the ideal DFL-Shapley in real-world systems, we introduce DFL-MR, a computable approximation that estimates overall contributions by accumulating round-wise Shapley values. We evaluate DFL-Shapley and DFL-MR across various FL scenarios and compare them with existing CFL metrics. The experimental results confirm DFL-Shapley as a valid ground-truth metric and demonstrate DFL-MR's proximity to DFL-Shapley across various settings, highlighting their effectiveness as contribution metrics in DFL.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
PUDTune: Multi-Level Charging for High-Precision Calibration in Processing-Using-DRAM
Authors:
Tatsuya Kubo,
Daichi Tokuda,
Lei Qu,
Ting Cao,
Shinya Takamaeda-Yamazaki
Abstract:
Recently, practical analog in-memory computing has been realized using unmodified commercial DRAM modules. The underlying Processing-Using-DRAM (PUD) techniques enable high-throughput bitwise operations directly within DRAM arrays. However, the presence of inherent error-prone columns hinders PUD's practical adoption. While selectively using only error-free columns would ensure reliability, this a…
▽ More
Recently, practical analog in-memory computing has been realized using unmodified commercial DRAM modules. The underlying Processing-Using-DRAM (PUD) techniques enable high-throughput bitwise operations directly within DRAM arrays. However, the presence of inherent error-prone columns hinders PUD's practical adoption. While selectively using only error-free columns would ensure reliability, this approach significantly reduces PUD's computational throughput.
This paper presents PUDTune, a novel high-precision calibration technique for increasing the number of error-free columns in PUD. PUDTune compensates for errors by applying pre-identified column-specific offsets to PUD operations. By leveraging multi-level charge states of DRAM cells, PUDTune generates fine-grained and wide-range offset variations despite the limited available rows. Our experiments with DDR4 DRAM demonstrate that PUDTune increases the number of error-free columns by 1.81$\times$ compared to conventional implementations, improving addition and multiplication throughput by 1.88$\times$ and 1.89$\times$ respectively.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
MVDRAM: Enabling GeMV Execution in Unmodified DRAM for Low-Bit LLM Acceleration
Authors:
Tatsuya Kubo,
Daichi Tokuda,
Tomoya Nagatani,
Masayuki Usui,
Lei Qu,
Ting Cao,
Shinya Takamaeda-Yamazaki
Abstract:
General matrix-vector multiplication (GeMV) remains a critical latency bottleneck in large language model (LLM) inference, even with quantized low-bit models. Processing-Using-DRAM (PUD), an analog in-DRAM computing technique, has the potential to repurpose on-device DRAM as a GeMV engine, offering additional high-throughput processing capabilities to widespread consumer devices without DRAM modif…
▽ More
General matrix-vector multiplication (GeMV) remains a critical latency bottleneck in large language model (LLM) inference, even with quantized low-bit models. Processing-Using-DRAM (PUD), an analog in-DRAM computing technique, has the potential to repurpose on-device DRAM as a GeMV engine, offering additional high-throughput processing capabilities to widespread consumer devices without DRAM modifications. However, applying PUD to GeMV operations in the LLM inference pipeline incurs significant overheads $\textit{before}$ and $\textit{after}$ in-DRAM computation, diminishing the benefits of its high-throughput processing capabilities.
This paper presents MVDRAM, the first practical system to accelerate GeMV operations for low-bit LLM inference using unmodified DRAM. By leveraging the data sharing patterns and mathematical linearity in GeMV operations, MVDRAM orchestrates the processor and DRAM to eliminate the costs associated with pre-arranging inputs and bit-transposition of outputs required in conventional PUD approaches. Our experimental evaluation with four DDR4 DRAM modules shows that MVDRAM achieves comparable or even better inference speed than the processor-based implementation for GeMV operations in low-bit (under 4-bit) LLM. In particular, MVDRAM achieves up to 7.29$\times$ speedup and 30.5$\times$ energy efficiency for low-bit GeMV operations. For end-to-end LLM inference, MVDRAM achieves 2.18$\times$ and 1.31$\times$ throughput improvements, along with 3.04$\times$ and 2.35$\times$ energy efficiency, for 2-bit and 4-bit quantized low-bit models, respectively. MVDRAM has the potential to redefine the AI hardware landscape by demonstrating the feasibility of standard DRAM as an LLM accelerator.
△ Less
Submitted 31 March, 2025;
originally announced March 2025.
-
PRIOT: Pruning-Based Integer-Only Transfer Learning for Embedded Systems
Authors:
Honoka Anada,
Sefutsu Ryu,
Masayuki Usui,
Tatsuya Kaneko,
Shinya Takamaeda-Yamazaki
Abstract:
On-device transfer learning is crucial for adapting a common backbone model to the unique environment of each edge device. Tiny microcontrollers, such as the Raspberry Pi Pico, are key targets for on-device learning but often lack floating-point units, necessitating integer-only training. Dynamic computation of quantization scale factors, which is adopted in former studies, incurs high computation…
▽ More
On-device transfer learning is crucial for adapting a common backbone model to the unique environment of each edge device. Tiny microcontrollers, such as the Raspberry Pi Pico, are key targets for on-device learning but often lack floating-point units, necessitating integer-only training. Dynamic computation of quantization scale factors, which is adopted in former studies, incurs high computational costs. Therefore, this study focuses on integer-only training with static scale factors, which is challenging with existing training methods. We propose a new training method named PRIOT, which optimizes the network by pruning selected edges rather than updating weights, allowing effective training with static scale factors. The pruning pattern is determined by the edge-popup algorithm, which trains a parameter named score assigned to each edge instead of the original parameters and prunes the edges with low scores before inference. Additionally, we introduce a memory-efficient variant, PRIOT-S, which only assigns scores to a small fraction of edges. We implement PRIOT and PRIOT-S on the Raspberry Pi Pico and evaluate their accuracy and computational costs using a tiny CNN model on the rotated MNIST dataset and the VGG11 model on the rotated CIFAR-10 dataset. Our results demonstrate that PRIOT improves accuracy by 8.08 to 33.75 percentage points over existing methods, while PRIOT-S reduces memory footprint with minimal accuracy loss.
△ Less
Submitted 21 March, 2025;
originally announced March 2025.
-
RustSFQ: A Domain-Specific Language for SFQ Circuit Design
Authors:
Mebuki Oishi,
Sun Tanaka,
Shinya Takamaeda-Yamazaki
Abstract:
Cell-based design of a single-flux-quantum (SFQ) digital circuit requires input-output consistency; every output signal must be consumed only once by the input of the following component, which is a unique constraint, unlike the traditional CMOS digital circuit design. While there are some cell libraries and simulation tools for SFQ circuit development, they do not verify the input-output consiste…
▽ More
Cell-based design of a single-flux-quantum (SFQ) digital circuit requires input-output consistency; every output signal must be consumed only once by the input of the following component, which is a unique constraint, unlike the traditional CMOS digital circuit design. While there are some cell libraries and simulation tools for SFQ circuit development, they do not verify the input-output consistency, and designers have significant responsibilities to ensure it manually. Additionally, designers have to carefully manage net names without unintended duplication and correct connectivity among nets in a netlist for simulations. We propose RustSFQ, a domain-specific language (DSL) embedded in Rust that automatically ensures the input-output consistency in the SFQ circuit by leveraging the ownership system of Rust. Each SFQ circuit element is represented as a function while wires are represented as instances, and the Rust compiler verifies that multiple elements do not share a single wire through the ownership system. Circuit descriptions in the RustSFQ are successfully compiled into low-level netlists for both analog and digital circuit simulations, and the DSL provides higher productivity than the conventional design flow. Using the RustSFQ, we developed an SFQ-based Reed-Solomon encoder with a 4-bit width for the first time as a case study. We confirmed that the circuit operated correctly at 10 GHz through circuit simulations.
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
Exploring the Versal AI Engine for 3D Gaussian Splatting
Authors:
Kotaro Shimamura,
Ayumi Ohno,
Shinya Takamaeda-Yamazaki
Abstract:
Dataflow-oriented spatial architectures are the emerging paradigm for higher computation performance and efficiency.
AMD Versal AI Engine is a commercial spatial architecture consisting of tiles of VLIW processors supporting SIMD operations arranged in a two-dimensional mesh.
The architecture requires the explicit design of task assignments and dataflow configurations for each tile to maximize…
▽ More
Dataflow-oriented spatial architectures are the emerging paradigm for higher computation performance and efficiency.
AMD Versal AI Engine is a commercial spatial architecture consisting of tiles of VLIW processors supporting SIMD operations arranged in a two-dimensional mesh.
The architecture requires the explicit design of task assignments and dataflow configurations for each tile to maximize performance, demanding advanced techniques and meticulous design.
However, a few works revealed the performance characteristics of the Versal AI Engine through practical workloads.
In this work, we provide the comprehensive performance evaluation of the Versal AI Engine using Gaussian feature computation in 3D Gaussian splatting as a practical workload, and we then propose a novel dedicated algorithm to fully exploit the hardware architecture.
The computations of 3D Gaussian splatting include matrix multiplications and color computations utilizing high-dimensional spherical harmonic coefficients.
These tasks are processed efficiently by leveraging the SIMD capabilities and their instruction-level parallelism.
Additionally, pipelined processing is achieved by assigning different tasks to individual cores, thereby fully exploiting the spatial parallelism of AI Engines.
The proposed method demonstrated a 226-fold throughput increase in simulation-based evaluation, outperforming a naive approach.
These findings provide valuable insights for application development that effectively harnesses the spatial and architectural advantages of AI Engines.
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
Accelerating Elliptic Curve Point Additions on Versal AI Engine for Multi-scalar Multiplication
Authors:
Ayumi Ohno,
Kotaro Shimamura,
Shinya Takamaeda-Yamazaki
Abstract:
Multi-scalar multiplication (MSM) is crucial in cryptographic applications and computationally intensive in zero-knowledge proofs. MSM involves accumulating the products of scalars and points on an elliptic curve over a 377-bit modulus, and the Pippenger algorithm converts MSM into a series of elliptic curve point additions (PADDs) with high parallelism. This study investigates accelerating MSM on…
▽ More
Multi-scalar multiplication (MSM) is crucial in cryptographic applications and computationally intensive in zero-knowledge proofs. MSM involves accumulating the products of scalars and points on an elliptic curve over a 377-bit modulus, and the Pippenger algorithm converts MSM into a series of elliptic curve point additions (PADDs) with high parallelism. This study investigates accelerating MSM on the Versal ACAP platform, an emerging hardware that employs a spatial architecture integrating 400 AI Engines (AIEs) with programmable logic and a processing system. AIEs are SIMD-based VLIW processors capable of performing vector multiply-accumulate operations, making them well-suited for multiplication-heavy workloads in PADD. Unlike simpler multiplication tasks in previous studies, cryptographic computations also require complex operations such as carry propagation. These operations necessitate architecture-aware optimizations, including intra-core dedicated coding style to fully exploit VLIW capabilities and inter-core strategy for spatial task mapping. We propose various optimizations to accelerate PADDs, including (1) algorithmic optimizations for carry propagation employing a carry-save-like technique to exploit VLIW and SIMD capabilities and (2) a comparison of four distinct spatial mappings to enhance intra- and inter-task parallelism. Our approach achieves a computational efficiency that utilizes 50.2% of the theoretical memory bandwidth and provides 568 speedup over the integrated CPU on the AIE evaluation board.
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
Federated Learning with Relative Fairness
Authors:
Shogo Nakakita,
Tatsuya Kaneko,
Shinya Takamaeda-Yamazaki,
Masaaki Imaizumi
Abstract:
This paper proposes a federated learning framework designed to achieve \textit{relative fairness} for clients. Traditional federated learning frameworks typically ensure absolute fairness by guaranteeing minimum performance across all client subgroups. However, this approach overlooks disparities in model performance between subgroups. The proposed framework uses a minimax problem approach to mini…
▽ More
This paper proposes a federated learning framework designed to achieve \textit{relative fairness} for clients. Traditional federated learning frameworks typically ensure absolute fairness by guaranteeing minimum performance across all client subgroups. However, this approach overlooks disparities in model performance between subgroups. The proposed framework uses a minimax problem approach to minimize relative unfairness, extending previous methods in distributionally robust optimization (DRO). A novel fairness index, based on the ratio between large and small losses among clients, is introduced, allowing the framework to assess and improve the relative fairness of trained models. Theoretical guarantees demonstrate that the framework consistently reduces unfairness. We also develop an algorithm, named \textsc{Scaff-PD-IA}, which balances communication and computational efficiency while maintaining minimax-optimal convergence rates. Empirical evaluations on real-world datasets confirm its effectiveness in maintaining model performance while reducing disparity.
△ Less
Submitted 2 November, 2024;
originally announced November 2024.
-
PACiM: A Sparsity-Centric Hybrid Compute-in-Memory Architecture via Probabilistic Approximation
Authors:
Wenlun Zhang,
Shimpei Ando,
Yung-Chin Chen,
Satomi Miyagi,
Shinya Takamaeda-Yamazaki,
Kentaro Yoshioka
Abstract:
Approximate computing emerges as a promising approach to enhance the efficiency of compute-in-memory (CiM) systems in deep neural network processing. However, traditional approximate techniques often significantly trade off accuracy for power efficiency, and fail to reduce data transfer between main memory and CiM banks, which dominates power consumption. This paper introduces a novel probabilisti…
▽ More
Approximate computing emerges as a promising approach to enhance the efficiency of compute-in-memory (CiM) systems in deep neural network processing. However, traditional approximate techniques often significantly trade off accuracy for power efficiency, and fail to reduce data transfer between main memory and CiM banks, which dominates power consumption. This paper introduces a novel probabilistic approximate computation (PAC) method that leverages statistical techniques to approximate multiply-and-accumulation (MAC) operations, reducing approximation error by 4X compared to existing approaches. PAC enables efficient sparsity-based computation in CiM systems by simplifying complex MAC vector computations into scalar calculations. Moreover, PAC enables sparsity encoding and eliminates the LSB activations transmission, significantly reducing data reads and writes. This sets PAC apart from traditional approximate computing techniques, minimizing not only computation power but also memory accesses by 50%, thereby boosting system-level efficiency. We developed PACiM, a sparsity-centric architecture that fully exploits sparsity to reduce bit-serial cycles by 81% and achieves a peak 8b/8b efficiency of 14.63 TOPS/W in 65 nm CMOS while maintaining high accuracy of 93.85/72.36/66.02% on CIFAR-10/CIFAR-100/ImageNet benchmarks using a ResNet-18 model, demonstrating the effectiveness of our PAC methodology.
△ Less
Submitted 28 August, 2024;
originally announced August 2024.
-
HALO-CAT: A Hidden Network Processor with Activation-Localized CIM Architecture and Layer-Penetrative Tiling
Authors:
Yung-Chin Chen,
Shimpei Ando,
Daichi Fujiki,
Shinya Takamaeda-Yamazaki,
Kentaro Yoshioka
Abstract:
To address the 'memory wall' problem in NN hardware acceleration, we introduce HALO-CAT, a software-hardware co-design optimized for Hidden Neural Network (HNN) processing. HALO-CAT integrates Layer-Penetrative Tiling (LPT) for algorithmic efficiency, reducing intermediate result sizes. Furthermore, the architecture employs an activation-localized computing-in-memory approach to minimize data move…
▽ More
To address the 'memory wall' problem in NN hardware acceleration, we introduce HALO-CAT, a software-hardware co-design optimized for Hidden Neural Network (HNN) processing. HALO-CAT integrates Layer-Penetrative Tiling (LPT) for algorithmic efficiency, reducing intermediate result sizes. Furthermore, the architecture employs an activation-localized computing-in-memory approach to minimize data movement. This design significantly enhances energy efficiency, achieving a 14.2x reduction in activation memory capacity and a 17.8x decrease in energy consumption, with only a 1.5% loss in accuracy, compared to traditional HNN processors.
△ Less
Submitted 10 December, 2023;
originally announced December 2023.
-
OSA-HCIM: On-The-Fly Saliency-Aware Hybrid SRAM CIM with Dynamic Precision Configuration
Authors:
Yung-Chin Chen,
Shimpei Ando,
Daichi Fujiki,
Shinya Takamaeda-Yamazaki,
Kentaro Yoshioka
Abstract:
Computing-in-Memory (CIM) has shown great potential for enhancing efficiency and performance for deep neural networks (DNNs). However, the lack of flexibility in CIM leads to an unnecessary expenditure of computational resources on less critical operations, and a diminished Signal-to-Noise Ratio (SNR) when handling more complex tasks, significantly hindering the overall performance. Hence, we focu…
▽ More
Computing-in-Memory (CIM) has shown great potential for enhancing efficiency and performance for deep neural networks (DNNs). However, the lack of flexibility in CIM leads to an unnecessary expenditure of computational resources on less critical operations, and a diminished Signal-to-Noise Ratio (SNR) when handling more complex tasks, significantly hindering the overall performance. Hence, we focus on the integration of CIM with Saliency-Aware Computing -- a paradigm that dynamically tailors computing precision based on the importance of each input. We propose On-the-fly Saliency-Aware Hybrid CIM (OSA-HCIM) offering three primary contributions: (1) On-the-fly Saliency-Aware (OSA) precision configuration scheme, which dynamically sets the precision of each MAC operation based on its saliency, (2) Hybrid CIM Array (HCIMA), which enables simultaneous operation of digital-domain CIM (DCIM) and analog-domain CIM (ACIM) via split-port 6T SRAM, and (3) an integrated framework combining OSA and HCIMA to fulfill diverse accuracy and power demands.
Implemented on a 65nm CMOS process, OSA-HCIM demonstrates an exceptional balance between accuracy and resource utilization. Notably, it is the first CIM design to incorporate a dynamic digital-to-analog boundary, providing unprecedented flexibility for saliency-aware computing. OSA-HCIM achieves a 1.95x enhancement in energy efficiency, while maintaining minimal accuracy loss compared to DCIM when tested on CIFAR100 dataset.
△ Less
Submitted 21 November, 2023; v1 submitted 29 August, 2023;
originally announced August 2023.
-
FADEC: FPGA-based Acceleration of Video Depth Estimation by HW/SW Co-design
Authors:
Nobuho Hashimoto,
Shinya Takamaeda-Yamazaki
Abstract:
3D reconstruction from videos has become increasingly popular for various applications, including navigation for autonomous driving of robots and drones, augmented reality (AR), and 3D modeling. This task often combines traditional image/video processing algorithms and deep neural networks (DNNs). Although recent developments in deep learning have improved the accuracy of the task, the large numbe…
▽ More
3D reconstruction from videos has become increasingly popular for various applications, including navigation for autonomous driving of robots and drones, augmented reality (AR), and 3D modeling. This task often combines traditional image/video processing algorithms and deep neural networks (DNNs). Although recent developments in deep learning have improved the accuracy of the task, the large number of calculations involved results in low computation speed and high power consumption. Although there are various domain-specific hardware accelerators for DNNs, it is not easy to accelerate the entire process of applications that alternate between traditional image/video processing algorithms and DNNs. Thus, FPGA-based end-to-end acceleration is required for such complicated applications in low-power embedded environments.
This paper proposes a novel FPGA-based accelerator for DeepVideoMVS, a DNN-based depth estimation method for 3D reconstruction. We employ HW/SW co-design to appropriately utilize heterogeneous components in modern SoC FPGAs, such as programmable logic (PL) and CPU, according to the inherent characteristics of the method. As some operations are unsuitable for hardware implementation, we determine the operations to be implemented in software through analyzing the number of times each operation is performed and its memory access pattern, and then considering comprehensive aspects: the ease of hardware implementation and degree of expected acceleration by hardware. The hardware and software implementations are executed in parallel on the PL and CPU to hide their execution latencies. The proposed accelerator was developed on a Xilinx ZCU104 board by using NNgen, an open-source high-level synthesis (HLS) tool. Experiments showed that the proposed accelerator operates 60.2 times faster than the software-only implementation on the same FPGA board with minimal accuracy degradation.
△ Less
Submitted 16 December, 2022; v1 submitted 1 December, 2022;
originally announced December 2022.
-
MP-GELU Bayesian Neural Networks: Moment Propagation by GELU Nonlinearity
Authors:
Yuki Hirayama,
Sinya Takamaeda-Yamazaki
Abstract:
Bayesian neural networks (BNNs) have been an important framework in the study of uncertainty quantification. Deterministic variational inference, one of the inference methods, utilizes moment propagation to compute the predictive distributions and objective functions. Unfortunately, deriving the moments requires computationally expensive Taylor expansion in nonlinear functions, such as a rectified…
▽ More
Bayesian neural networks (BNNs) have been an important framework in the study of uncertainty quantification. Deterministic variational inference, one of the inference methods, utilizes moment propagation to compute the predictive distributions and objective functions. Unfortunately, deriving the moments requires computationally expensive Taylor expansion in nonlinear functions, such as a rectified linear unit (ReLU) or a sigmoid function. Therefore, a new nonlinear function that realizes faster moment propagation than conventional functions is required. In this paper, we propose a novel nonlinear function named moment propagating-Gaussian error linear unit (MP-GELU) that enables the fast derivation of first and second moments in BNNs. MP-GELU enables the analytical computation of moments by applying nonlinearity to the input statistics, thereby reducing the computationally expensive calculations required for nonlinear functions. In empirical experiments on regression tasks, we observed that the proposed MP-GELU provides higher prediction accuracy and better quality of uncertainty with faster execution than those of ReLU-based BNNs.
△ Less
Submitted 23 November, 2022;
originally announced November 2022.
-
An FPGA-Based Fully Pipelined Bilateral Grid for Real-Time Image Denoising
Authors:
Nobuho Hashimoto,
Shinya Takamaeda-Yamazaki
Abstract:
The bilateral filter (BF) is widely used in image processing because it can perform denoising while preserving edges. It has disadvantages in that it is nonlinear, and its computational complexity and hardware resources are directly proportional to its window size. Thus far, several approximation methods and hardware implementations have been proposed to solve these problems. However, processing l…
▽ More
The bilateral filter (BF) is widely used in image processing because it can perform denoising while preserving edges. It has disadvantages in that it is nonlinear, and its computational complexity and hardware resources are directly proportional to its window size. Thus far, several approximation methods and hardware implementations have been proposed to solve these problems. However, processing large-scale and high-resolution images in real time under severe hardware resource constraints remains a challenge.
This paper proposes a real-time image denoising system that uses an FPGA based on the bilateral grid (BG). In the BG, a 2D image consisting of x- and y-axes is projected onto a 3D space called a "grid," which consists of axes that correlate to the x-component, y-component, and intensity value of the input image. This grid is then blurred using the Gaussian filter, and the output image is generated by interpolating the grid. Although it is possible to change the window size in the BF, it is impossible to change it on the input image in the BG. This makes it difficult to associate the BG with the BF and to obtain the property of suppressing the increase in hardware resources when the window radius is enlarged.
This study demonstrates that a BG with a variable-sized window can be realized by introducing the window radius parameter wherein the window radius on the grid is always 1. We then implement this BG on an FPGA in a fully pipelined manner. Further, we verify that our design suppresses the increase in hardware resources even when the window size is enlarged and outperforms the existing designs in terms of computation speed and hardware resources.
△ Less
Submitted 13 December, 2022; v1 submitted 14 October, 2021;
originally announced October 2021.