Search | arXiv e-print repository

AES-RV: Hardware-Efficient RISC-V Accelerator with Low-Latency AES Instruction Extension for IoT Security

Authors: Van Tinh Nguyen, Phuc Hung Pham, Vu Trung Duong Le, Hoai Luan Pham, Tuan Hai Vu, Thi Diem Tran

Abstract: The Advanced Encryption Standard (AES) is a widely adopted cryptographic algorithm essential for securing embedded systems and IoT platforms. However, existing AES hardware accelerators often face limitations in performance, energy efficiency, and flexibility. This paper presents AES-RV, a hardware-efficient RISC-V accelerator featuring low-latency AES instruction extensions optimized for real-tim… ▽ More The Advanced Encryption Standard (AES) is a widely adopted cryptographic algorithm essential for securing embedded systems and IoT platforms. However, existing AES hardware accelerators often face limitations in performance, energy efficiency, and flexibility. This paper presents AES-RV, a hardware-efficient RISC-V accelerator featuring low-latency AES instruction extensions optimized for real-time processing across all AES modes and key sizes. AES-RV integrates three key innovations: high-bandwidth internal buffers for continuous data processing, a specialized AES unit with custom low-latency instructions, and a pipelined system supported by a ping-pong memory transfer mechanism. Implemented on the Xilinx ZCU102 SoC FPGA, AES-RV achieves up to 255.97 times speedup and up to 453.04 times higher energy efficiency compared to baseline and conventional CPU/GPU platforms. It also demonstrates superior throughput and area efficiency against state-of-the-art AES accelerators, making it a strong candidate for secure and high-performance embedded systems. △ Less

Submitted 17 May, 2025; originally announced May 2025.

Comments: 6 pages, 5 figures. Submitted to IEICE Electronics Express

ACM Class: C.3; B.6.3; E.3

arXiv:2505.01984 [pdf, other]

Lifelong Whole Slide Image Analysis: Online Vision-Language Adaptation and Past-to-Present Gradient Distillation

Authors: Doanh C. Bui, Hoai Luan Pham, Vu Trung Duong Le, Tuan Hai Vu, Van Duy Tran, Khang Nguyen, Yasuhiko Nakashima

Abstract: Whole Slide Images (WSIs) play a crucial role in accurate cancer diagnosis and prognosis, as they provide tissue details at the cellular level. However, the rapid growth of computational tasks involving WSIs poses significant challenges. Given that WSIs are gigapixels in size, they present difficulties in terms of storage, processing, and model training. Therefore, it is essential to develop lifel… ▽ More Whole Slide Images (WSIs) play a crucial role in accurate cancer diagnosis and prognosis, as they provide tissue details at the cellular level. However, the rapid growth of computational tasks involving WSIs poses significant challenges. Given that WSIs are gigapixels in size, they present difficulties in terms of storage, processing, and model training. Therefore, it is essential to develop lifelong learning approaches for WSI analysis. In scenarios where slides are distributed across multiple institutes, we aim to leverage them to develop a unified online model as a computational tool for cancer diagnosis in clinical and hospital settings. In this study, we introduce ADaFGrad, a method designed to enhance lifelong learning for whole-slide image (WSI) analysis. First, we leverage pathology vision-language foundation models to develop a framework that enables interaction between a slide's regional tissue features and a predefined text-based prototype buffer. Additionally, we propose a gradient-distillation mechanism that mimics the gradient of a logit with respect to the classification-head parameters across past and current iterations in a continual-learning setting. We construct a sequence of six TCGA datasets for training and evaluation. Experimental results show that ADaFGrad outperforms both state-of-the-art WSI-specific and conventional continual-learning methods after only a few training epochs, exceeding them by up to +5.068% in the class-incremental learning scenario while exhibiting the least forgetting (i.e., retaining the most knowledge from previous tasks). Moreover, ADaFGrad surpasses its baseline by as much as +40.084% in accuracy, further demonstrating the effectiveness of the proposed modules. △ Less

Submitted 4 May, 2025; originally announced May 2025.

arXiv:2504.15627 [pdf, ps, other]

ZeroSlide: Is Zero-Shot Classification Adequate for Lifelong Learning in Whole-Slide Image Analysis in the Era of Pathology Vision-Language Foundation Models?

Authors: Doanh C. Bui, Hoai Luan Pham, Vu Trung Duong Le, Tuan Hai Vu, Van Duy Tran, Yasuhiko Nakashima

Abstract: Lifelong learning for whole slide images (WSIs) poses the challenge of training a unified model to perform multiple WSI-related tasks, such as cancer subtyping and tumor classification, in a distributed, continual fashion. This is a practical and applicable problem in clinics and hospitals, as WSIs are large, require storage, processing, and transfer time. Training new models whenever new tasks ar… ▽ More Lifelong learning for whole slide images (WSIs) poses the challenge of training a unified model to perform multiple WSI-related tasks, such as cancer subtyping and tumor classification, in a distributed, continual fashion. This is a practical and applicable problem in clinics and hospitals, as WSIs are large, require storage, processing, and transfer time. Training new models whenever new tasks are defined is time-consuming. Recent work has applied regularization- and rehearsal-based methods to this setting. However, the rise of vision-language foundation models that align diagnostic text with pathology images raises the question: are these models alone sufficient for lifelong WSI learning using zero-shot classification, or is further investigation into continual learning strategies needed to improve performance? To our knowledge, this is the first study to compare conventional continual-learning approaches with vision-language zero-shot classification for WSIs. Our source code and experimental results will be available soon. △ Less

Submitted 22 April, 2025; originally announced April 2025.

Comments: 10 pages, 3 figures, 1 table, conference submission

arXiv:2503.14951 [pdf, other]

QEA: An Accelerator for Quantum Circuit Simulation with Resources Efficiency and Flexibility

Authors: Van Duy Tran, Tuan Hai Vu, Vu Trung Duong Le, Hoai Luan Pham, Yasuhiko Nakashima

Abstract: The area of quantum circuit simulation has attracted a lot of attention in recent years. However, due to the exponentially increasing computational costs, assessing and validating these models on large datasets poses significant obstacles. Despite plenty of research in quantum simulation, issues such as memory management, system adaptability, and execution efficiency remain unresolved. In this stu… ▽ More The area of quantum circuit simulation has attracted a lot of attention in recent years. However, due to the exponentially increasing computational costs, assessing and validating these models on large datasets poses significant obstacles. Despite plenty of research in quantum simulation, issues such as memory management, system adaptability, and execution efficiency remain unresolved. In this study, we introduce QEA, a state vector-based hardware accelerator that overcomes these difficulties with four key improvements: optimized memory allocation management, open PE, flexible ALU, and simplified CX swapper. To evaluate QEA's capabilities, we implemented and evaluated it on the AMD Alveo U280 board, which uses only 0.534 W of power. Experimental results show that QEA is extremely flexible, supporting a wide range of quantum circuits, has excellent fidelity, making it appropriate for standard quantum emulators, and outperforms powerful CPUs and related works up to 153.16x better in terms of normalized gate speed. This study has considerable potential as a useful approach for quantum emulators in future works. △ Less

Submitted 12 May, 2025; v1 submitted 19 March, 2025; originally announced March 2025.

Comments: 6 pages, 7 figures, the code (software-side) is available at https://github.com/NAIST-Archlab/fast-psr This work has been accepted at the 10th International Conference on ICs, Design and Verification (ICDV 2025)

Journal ref: Proc. of the 10th International Conference on ICs, Design and Verification (ICDV 2025), June 16-17, 2025, Ho Chi Minh City, Vietnam

arXiv:2411.04471 [pdf, other]

FQsun: A Configurable Wave Function-Based Quantum Emulator for Power-Efficient Quantum Simulations

Authors: Tuan Hai Vu, Vu Trung Duong Le, Hoai Luan Pham, Quoc Chuong Nguyen, Yasuhiko Nakashima

Abstract: Quantum computers are promising powerful computers for solving complex problems, but access to real quantum hardware remains limited due to high costs. Although the software simulators on CPUs/GPUs such as Qiskit, ProjectQ, and Qsun offer flexibility and support for many qubits, they struggle with high power consumption and limited processing speed, especially as qubit counts scale. Accordingly, q… ▽ More Quantum computers are promising powerful computers for solving complex problems, but access to real quantum hardware remains limited due to high costs. Although the software simulators on CPUs/GPUs such as Qiskit, ProjectQ, and Qsun offer flexibility and support for many qubits, they struggle with high power consumption and limited processing speed, especially as qubit counts scale. Accordingly, quantum emulators implemented on dedicated hardware, such as FPGAs and analog circuits, offer a promising path for addressing energy efficiency concerns. However, existing studies on hardware-based emulators still face challenges in terms of limited flexibility and lack of fidelity evaluation. To overcome these gaps, we propose FQsun, a quantum emulator that enhances performance by integrating four key innovations: efficient memory organization, a configurable Quantum Gate Unit (QGU), optimized scheduling, and multiple number precisions. Five FQsun versions with different number precisions are implemented on the Xilinx ZCU102, consuming a maximum power of 2.41W. Experimental results demonstrate high fidelity, low mean square error, and high normalized gate speed, particularly with 32-bit versions, establishing FQsun's capability as a precise quantum emulator. Benchmarking on famous quantum algorithms reveals that FQsun achieves a superior power-delay product, outperforming software simulators on CPUs in the processing speed range. △ Less

Submitted 18 March, 2025; v1 submitted 7 November, 2024; originally announced November 2024.

Comments: 15 pages, 11 figures, 7 tables, submitted to the IEEE Access

arXiv:2410.11146 [pdf, other]

Theoretical Analysis of the Efficient-Memory Matrix Storage Method for Quantum Emulation Accelerators with Gate Fusion on FPGAs

Authors: Tran Xuan Hieu Le, Hoai Luan Pham, Tuan Hai Vu, Vu Trung Duong Le, Nakashima Yasuhiko

Abstract: Quantum emulators play an important role in the development and testing of quantum algorithms, especially given the limitations of the current FTQC era. Developing high-speed, memory-optimized quantum emulators is a growing research trend, with gate fusion being a promising technique. However, existing gate fusion implementations often struggle to efficiently support large-scale quantum systems wi… ▽ More Quantum emulators play an important role in the development and testing of quantum algorithms, especially given the limitations of the current FTQC era. Developing high-speed, memory-optimized quantum emulators is a growing research trend, with gate fusion being a promising technique. However, existing gate fusion implementations often struggle to efficiently support large-scale quantum systems with a high number of qubits due to a lack of optimizations for the exponential growth in memory requirements. Therefore, this study proposes the EMMS (Efficient-Memory Matrix Storage) method for storing quantum operators and states, along with an EMMS-based Quantum Emulator Accelerator (QEA) architecture that incorporates multiple processing elements (PEs) to accelerate tensor product and matrix multiplication computations in quantum emulation with gate fusion. The theoretical analysis of the QEA on the Xilinx ZCU102 FPGA, using varying numbers of PEs and different depths of unitary and local data memory, reveals a linear increase in memory depth with the number of qubits. This scaling highlights the potential of the EMMS-based QEA to accommodate larger quantum circuits, providing insights into selecting appropriate memory sizes and FPGA devices. Furthermore, the estimated performance of the QEA with PE counts ranging from $2^2$ to $2^5$ on the Xilinx ZCU102 FPGA demonstrates that increasing the number of PEs significantly reduces the computation cycle count for circuits with fewer than 18 qubits, making it significantly faster than previous works. △ Less

Submitted 14 October, 2024; originally announced October 2024.

arXiv:2407.17790 [pdf, other]

Exploring the Limitations of Kolmogorov-Arnold Networks in Classification: Insights to Software Training and Hardware Implementation

Authors: Van Duy Tran, Tran Xuan Hieu Le, Thi Diem Tran, Hoai Luan Pham, Vu Trung Duong Le, Tuan Hai Vu, Van Tinh Nguyen, Yasuhiko Nakashima

Abstract: Kolmogorov-Arnold Networks (KANs), a novel type of neural network, have recently gained popularity and attention due to the ability to substitute multi-layer perceptions (MLPs) in artificial intelligence (AI) with higher accuracy and interoperability. However, KAN assessment is still limited and cannot provide an in-depth analysis of a specific domain. Furthermore, no study has been conducted on t… ▽ More Kolmogorov-Arnold Networks (KANs), a novel type of neural network, have recently gained popularity and attention due to the ability to substitute multi-layer perceptions (MLPs) in artificial intelligence (AI) with higher accuracy and interoperability. However, KAN assessment is still limited and cannot provide an in-depth analysis of a specific domain. Furthermore, no study has been conducted on the implementation of KANs in hardware design, which would directly demonstrate whether KANs are truly superior to MLPs in practical applications. As a result, in this paper, we focus on verifying KANs for classification issues, which are a common but significant topic in AI using four different types of datasets. Furthermore, the corresponding hardware implementation is considered using the Vitis high-level synthesis (HLS) tool. To the best of our knowledge, this is the first article to implement hardware for KAN. The results indicate that KANs cannot achieve more accuracy than MLPs in high complex datasets while utilizing substantially higher hardware resources. Therefore, MLP remains an effective approach for achieving accuracy and efficiency in software and hardware implementation. △ Less

Submitted 25 July, 2024; v1 submitted 25 July, 2024; originally announced July 2024.

Comments: 6 pages, 3 figures, 2 tables

Showing 1–7 of 7 results for author: Le, V T D